Running Large-Scale Machine Learning Experiments in the Cloud

O’REILLY Artificial Intelligence Conference, San Jose 2019 Running large-scale machine learning experiments in the cloud Shashank Prasanna, Amazon Web Services @shshnkp September 12, 2019 © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Agenda • Machine Learning, experiments and the scientific method • Scaling challenges with machine learning setups • Containers technologies for large-scale machine learning • Demo 1: Hyperparameter optimization with Amazon SageMaker • Demo 2: Custom designed experiments with Amazon SageMaker • Demo 3: Hyperparameter optimization with Kubernetes, Kubeflow and Katib • Summary and additional resources © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Experimentation in machine learning Data acquisition Data preparation for Design and run curation and labeling training experiments Model optimization Distributed Deployment and validation training © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Why do we run experiments? The scientific method Question Hypothesis empirical procedure to determine Experiment whether observations agree or conflict with our hypothesis Interpret Iterate © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Why do we run machine learning experiments? Machine learning researcher may want to: Question • Identify factors that affect model performance Hypothesis • Explain the root causes of performance Experiment • Choose between alternate models Interpret • Study variability and robustness of a model Iterate • Pareto-optimal model (e.g. accuracy vs. complexity) Hyperparameter optimization © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. en.wikipedia.org/wiki/Scientific_method Machine learning experiments Response Machine Learning factors System Controlled Factors: Responses : • ML algorithms (trees, glm, dnn, …) • Generalization accuracy, error • Datasets and features • Variability • DNN network architectures • Reproducibility • Model hyperparameters • Model execution time © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. • others Common machine learning setups On-premises 1. Code & frameworks CLI 2. Compute (CPUs, GPUs) 3. Storage EC2 instance CLI DL AMI 8x GPUs Amazon S3 © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Most experiment runs can be scaled out Cluster EC2 instance CLI CLI ML experiment runs are computationally expensive, but are usually embarrassingly parallel jobs © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. …but there are challenges to scaling Cluster Code and dependencies Cluster CLI management Infrastructure management © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Machine learning stack is complex • “My code requires building several dependencies from source” • “My code isn’t taking advantage the GPU/GPUs” • “is cudnn, nccl installed, is it the right version?” • “My code is running slow on CPUs” • “oh wait, is it taking advantage of AVX instruction set ?!?” • “I updated my drivers and training is now slower/errors out” • “My cluster runs a different version of framework/linux distro” Makes portability, collaboration, scaling experiment runs really really hard! © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. My code TensorFlow 1.13 TensorFlow 1.14 Keras scikit-learn Keras scikit-learn horovod pandas horovod pandas numpy openmpi numpy openmpi scipy Python scipy Python others… Multiple others… points CPU:DevelopmentMkl 2019 v3 of failure CPU: MklTraining2019 v2 GPU: systemcudnn 7.1 GPU: cudnncluster7.5 cublas 10 cublas 10 nccl 2 nccl 2.4 CUDA toolkit 10 CUDA toolkit 10 NVIDIA drivers 436.15 NVIDIA drivers 410.68 Ubuntu 16.04 Centos 7 Development Training © 2019, Amazonsystem Web Services, Inc. or its Affiliates. All rights reserved. cluster TensorFlow Packages: Keras scikit- learn horovod pandas numpy Containers openmpi TensorFlow scipy Container Python Image for others… ML environments that are: Machine CPU: mkl Learning GPU: cudnn cublas Nccl CUDA toolkit Container runtime NVIDIA drivers Host OS © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Infrastructure TensorFlow TensorFlow Keras scikit- Keras scikit- learn learn horovod horovod pandas pandas numpy numpy TensorFlow TensorFlow openmpi push openmpi scipy Container scipy Container Python Image pull Python Image others… others… CPU: mkl CPU: mkl GPU: cudnn GPU: cudnn cublas cublas Nccl Nccl CUDA toolkit CUDA toolkit Container Container runtime registry Container runtime NVIDIA drivers NVIDIA drivers Host OS Host OS © 2019, AmazonDevelopment Web Services, Inc. or its Affiliates.system All rights reserved. Training cluster AWS Deep Learning Containers https://docs.aws.amazon.com/dlami/latest/devguide/deep-learning-containers-images.html © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Challenges with scaling ML experiments Cluster ML code and dependencies Cluster CLI management Infrastructure management © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. ML infrastructure and cluster management Amazon SageMaker ML services Fully-managed service that covers the entire machine learning workflow Jupyter notebook high performance Large-scale Optimization One-click Fully managed with instances algorithms training deployment auto-scaling Image registry Container image repository Management Amazon Elastic Amazon Elastic Deployment, scheduling, Container Service Kubernetes Service Amazon Elastic scaling, and management of (Amazon ECS) (Amazon EKS) Container Registry containerized applications (Amazon ECR) Compute Amazon EC2 Where the containers run © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. DEMO 1: Random Search for hyperparameter optimization with Amazon SageMaker Approach: • Bring Your Own Training Script • Choose hyperparameter search strategy (Bayesian, random, custom) • Launch a tuning job Learning rate Between 0.0001 and 0.1 on log scale Batch size 32, 128, 512, 1024 Momentum Between 0.9 and 0.99 Optimizer SGD, Adam © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Hyperparameter optimization with Amazon SageMaker Pre-built SageMaker SDK Deep Learning cifar10.py Container Container registry Training script 1 Instance type (CPU, GPU), … 2 Hyperparameter ranges Number of parallel jobs (10, 100, …) 3 Amazon S3 Objective metric, … Fully-managed SageMaker cluster © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. DEMO 2: Custom experiment with bring your own container on Amazon SageMaker Approach: • Bring Your Own Docker Container • Specify parameter values • Launch a tuning job © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Large-scale experiments with Amazon SageMaker Custom container Docker build SageMaker SDK Code files Container registry Build docker image 1 Instance type (CPU, GPU), … 2 Specify parameters to vary Number of parallel jobs (10, 100, …) Amazon S3 3 Fully-managed SageMaker cluster © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. DEMO 3: Hyperparameter optimization with Amazon EKS, Kubeflow and Katib Hyperparameter Tuning and Neural Architecture Search Machine learning workflows on Kubernetes Amazon Elastic Kubernetes Service (Amazon EKS) © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Create a Kubernetes cluster Create cluster Submit a training jobs eksctl create cluster \ --name eks-gpu \ --version 1.12 \ --region us-west-2 \ --nodegroup-name gpu-nodes \ --node-type p3.8xlarge \ CLI --nodes 8 \ --timeout=40m \ --ssh-access \ --ssh-public-key=<public-key> \ --auto-kubeconfig SUMMIT © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. Hyperparameter optimization with Kubeflow and Katib Custom container SageMaker SDK Code files Container registry Create a Kubernetes cluster 1 Install Kubeflow 2 Build custom container 3 Specify hyperparameter search space Amazon EKS cluster 4 © 2019, Amazon Web Services, Inc. or its Affiliates.Launch All rights reserved. Takeaways • Apply scientific method to machine learning experiments Code and • Embrace containers – they let you build dependencies l Cluster management • Leverage services such as Amazon SageMaker and Kubernetes + Kubeflow Infrastructure management to manage large-scale ML workloads. • Choose fully-managed or self-managed based on needs © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Demos: github.com/shashankprasanna/oreilly-ai-conference-sanjose19.git Resources Documentation Examples on GitHub AWS ML Blog docs.aws.amazon.com/sagemaker/ github.com/awslabs/ aws.amazon.com/blogs/machine- latest/dg/whatis.html amazon-sagemaker-examples learning/category/artificial-intelligence/ sagemaker/ © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Thank you and happy experimenting! Shashank Prasanna, Sr. Technical Evangelist, AI/ML Please rate Questions? Happy to help: this session! Twitter: @shshnkp LinkedIn: linkedin.com/in/shashankprasanna © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved..

Running Large-Scale Machine Learning Experiments in the Cloud

Running ML/DL Workloads Using Red Hat Openshift Container Platform V3.11 Accelerate Your ML/DL Projects Platform Using Kubeflow and NVIDIA Gpus

Application Development with Azure

Chapter 1 - Overview

Tensorflow 2.0 and Kubeflow for Scalable and Reproducable Enterprise Ai

Using Ipus from Docker Release 1.4.0

Reference Architecture for Kubeflow on Openshift Accelerate ML/DL Workloads Using Kubeflow and Poweredge Servers

CNCF Webinar Taming Your AI/ML Workloads with Kubeflow

Kubeflow: End to End ML Platform Animesh Singh

Machine Learning at Scale with Kubernetes Aug 23Rd, 2018

Application Development with Azure

Build an Event Driven Machine Learning Pipeline on Kubernetes

Arxiv:2103.00490V1 [Cs.DC] 24 Feb 2021 Keywords: Dataset Lifecycle Framework · Kubeﬂow · Kubernetes · Bioinformatics