O’REILLY Artificial Intelligence Conference, San Jose 2019

Running large-scale experiments in the cloud

Shashank Prasanna, @shshnkp

September 12, 2019

© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Agenda

• Machine Learning, experiments and the scientific method

• Scaling challenges with machine learning setups • Containers technologies for large-scale machine learning

• Demo 1: Hyperparameter optimization with Amazon SageMaker

• Demo 2: Custom designed experiments with Amazon SageMaker

• Demo 3: Hyperparameter optimization with , Kubeflow and Katib • Summary and additional resources © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Experimentation in machine learning

Data acquisition Data preparation for Design and run curation and labeling training experiments

Model optimization Distributed Deployment and validation training

© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Why do we run experiments?

The scientific method

Question

Hypothesis empirical procedure to determine Experiment whether observations agree or conflict with our hypothesis Interpret

Iterate

© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Why do we run machine learning experiments?

Machine learning researcher may want to: Question • Identify factors that affect model performance Hypothesis • Explain the root causes of performance Experiment • Choose between alternate models

Interpret • Study variability and robustness of a model Iterate • Pareto-optimal model (e.g. accuracy vs. complexity) Hyperparameter optimization

© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. en.wikipedia.org/wiki/Scientific_method Machine learning experiments Response Machine Learning

factors System Controlled

Factors: Responses : • ML algorithms (trees, glm, dnn, …) • Generalization accuracy, error • Datasets and features • Variability • DNN network architectures • Reproducibility • Model hyperparameters • Model execution time © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. • others Common machine learning setups

On-premises

1. Code & frameworks CLI 2. Compute (CPUs, GPUs) 3. Storage

EC2 instance

CLI

DL AMI 8x GPUs Amazon S3

© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Most experiment runs can be scaled out

Cluster

EC2 instance

CLI CLI

ML experiment runs are computationally expensive, but are usually embarrassingly parallel jobs

© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. …but there are challenges to scaling

Cluster Code and dependencies

Cluster CLI management

Infrastructure management

© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Machine learning stack is complex

• “My code requires building several dependencies from source” • “My code isn’t taking advantage the GPU/GPUs” • “is cudnn, nccl installed, is it the right version?” • “My code is running slow on CPUs” • “oh wait, is it taking advantage of AVX instruction set ?!?” • “I updated my drivers and training is now slower/errors out” • “My cluster runs a different version of framework/linux distro”

Makes portability, collaboration, scaling experiment runs really really hard!

© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. My code TensorFlow 1.13 TensorFlow 1.14

Keras scikit-learn Keras scikit-learn horovod pandas horovod pandas numpy openmpi numpy openmpi scipy Python scipy Python others… Multiple others… points CPU:DevelopmentMkl 2019 v3 of failure CPU: MklTraining2019 v2

GPU: systemcudnn 7.1 GPU: cudnncluster7.5 cublas 10 cublas 10 nccl 2 nccl 2.4 CUDA toolkit 10 CUDA toolkit 10

NVIDIA drivers 436.15 NVIDIA drivers 410.68

Ubuntu 16.04 Centos 7 Development Training

© 2019, Amazonsystem Web Services, Inc. or its Affiliates. All rights reserved. cluster TensorFlow Packages: Keras scikit- learn horovod pandas numpy Containers openmpi TensorFlow scipy Container Python Image for others… ML environments that are: Machine CPU: mkl

Learning GPU: cudnn cublas Nccl CUDA toolkit

Container runtime

NVIDIA drivers Host OS

© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Infrastructure TensorFlow TensorFlow

Keras scikit- Keras scikit- learn learn horovod horovod pandas pandas numpy numpy TensorFlow TensorFlow openmpi push openmpi scipy Container scipy Container Python Image pull Python Image others… others…

CPU: mkl CPU: mkl

GPU: cudnn GPU: cudnn cublas cublas Nccl Nccl CUDA toolkit CUDA toolkit Container Container runtime registry Container runtime

NVIDIA drivers NVIDIA drivers Host OS Host OS

© 2019, AmazonDevelopment Web Services, Inc. or its Affiliates.system All rights reserved. Training cluster AWS Deep Learning Containers

https://docs.aws.amazon.com/dlami/latest/devguide/deep-learning-containers-images.html

© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Challenges with scaling ML experiments

Cluster ML code and dependencies

Cluster CLI management

Infrastructure management

© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. ML infrastructure and cluster management

Amazon SageMaker ML services Fully-managed service that covers the entire machine learning workflow Jupyter notebook high performance Large-scale Optimization One-click Fully managed with instances algorithms training deployment auto-scaling Image registry Container image repository Management Amazon Elastic Amazon Elastic Deployment, scheduling, Container Service Kubernetes Service Amazon Elastic scaling, and management of (Amazon ECS) (Amazon EKS) Container Registry containerized applications (Amazon ECR)

Compute Amazon EC2 Where the containers run

© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. DEMO 1: Random Search for hyperparameter optimization with Amazon SageMaker Approach: • Bring Your Own Training Script • Choose hyperparameter search strategy (Bayesian, random, custom) • Launch a tuning job

Learning rate Between 0.0001 and 0.1 on log scale Batch size 32, 128, 512, 1024 Momentum Between 0.9 and 0.99 Optimizer SGD, Adam © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Hyperparameter optimization with Amazon SageMaker

Pre-built

SageMaker SDK Deep Learning cifar10.py Container Container registry Training script 1 Instance type (CPU, GPU), …

2 Hyperparameter ranges

Number of parallel jobs (10, 100, …) 3 Amazon S3 Objective metric, … Fully-managed SageMaker cluster © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. DEMO 2: Custom experiment with bring your own container on Amazon SageMaker Approach: • Bring Your Own Container • Specify parameter values • Launch a tuning job

© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Large-scale experiments with Amazon SageMaker

Custom container Docker build

SageMaker SDK

Code files Container registry Build docker image 1 Instance type (CPU, GPU), …

2 Specify parameters to vary

Number of parallel jobs (10, 100, …) Amazon S3 3 Fully-managed SageMaker cluster © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. DEMO 3: Hyperparameter optimization with Amazon EKS, Kubeflow and Katib

Hyperparameter Tuning and Neural Architecture Search

Machine learning workflows on Kubernetes

Amazon Elastic Kubernetes Service (Amazon EKS)

© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Create a Kubernetes cluster

Create cluster Submit a training jobs eksctl create cluster \ --name eks-gpu \ --version 1.12 \ --region us-west-2 \ --nodegroup-name gpu-nodes \ --node-type p3.8xlarge \ CLI --nodes 8 \ --timeout=40m \ --ssh-access \ --ssh-public-key= \ --auto-kubeconfig

SUMMIT © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. Hyperparameter optimization with Kubeflow and Katib

Custom container

SageMaker SDK

Code files Container registry Create a Kubernetes cluster 1 Install Kubeflow

2 Build custom container

3 Specify hyperparameter search space

Amazon EKS cluster 4 © 2019, Amazon Web Services, Inc. or its Affiliates.Launch All rights reserved. Takeaways • Apply scientific method to machine learning experiments Code and • Embrace containers – they let you build dependencies l Cluster management • Leverage services such as Amazon SageMaker and Kubernetes + Kubeflow Infrastructure management to manage large-scale ML workloads. • Choose fully-managed or self-managed based on needs

© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Demos: github.com/shashankprasanna/oreilly-ai-conference-sanjose19.git Resources

Documentation Examples on GitHub AWS ML Blog

docs.aws.amazon.com/sagemaker/ github.com/awslabs/ aws.amazon.com/blogs/machine- latest/dg/whatis.html amazon-sagemaker-examples learning/category/artificial-intelligence/ sagemaker/

© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Thank you and happy experimenting!

Shashank Prasanna, Sr. Technical Evangelist, AI/ML Please rate Questions? Happy to help: this session! Twitter: @shshnkp LinkedIn: linkedin.com/in/shashankprasanna

© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.