using Amazon EKS

Arun Gupta, @arungupta

© 2019, , Inc. or its Affiliates. All rights reserved. Machine Learning 101

© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. The Amazon ML stack: Broadest & deepest set of capabilities

FRAMEWORKS INTERFACES INFRASTRUCTURE

EC2 P3 EC2 EC2 Elastic FPGAs Greengrass Inferentia & P3dn G4 C5 inference

© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. The Amazon ML stack: Broadest & deepest set of capabilities

Algorithms + Reinforcement Ground Truth Notebooks Marketplace Learning Training Optimization Deployment Hosting

FRAMEWORKS INTERFACES INFRASTRUCTURE ML Frameworks + Infrastructure EC2 P3 EC2 EC2 Elastic FPGAs Greengrass Inferentia & P3dn G4 C5 inference

© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. The Amazon ML stack: Broadest & deepest set of capabilities

VISION SPEECH LANGUAGE CHATBOTS FORECASTING RECOMMENDATIONS

REKOGNITION REKOGNITION TEXTRACT POLLY TRANSCRIBE TRANSLATE COMPREHEND LEX FORECAST PERSONALIZE IMAGE VIDEO

Amazon ML Services Reinforcement SageMaker Ground Truth Notebooks Algorithms + Marketplace Learning Training Optimization Deployment Hosting

FRAMEWORKS INTERFACES INFRASTRUCTURE ML Frameworks + Infrastructure EC2 P3 EC2 EC2 Elastic FPGAs Greengrass Inferentia & P3dn G4 C5 inference

© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. The Amazon ML stack: Broadest & deepest set of capabilities

VISION SPEECH LANGUAGE CHATBOTS FORECASTING RECOMMENDATIONS

REKOGNITION REKOGNITION TEXTRACT POLLY TRANSCRIBE TRANSLATE COMPREHEND LEX FORECAST PERSONALIZE IMAGE VIDEO

Amazon Reinforcement SageMaker Ground Truth Notebooks Algorithms + Marketplace Learning Training Optimization Deployment Hosting

FRAMEWORKS INTERFACES INFRASTRUCTURE

EC2 P3 EC2 EC2 Elastic FPGAs Greengrass Inferentia & P3dn G4 C5 inference

© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Storage and Analytics for ML

MACHINE LEARNING ANALYTICS Amazon Amazon Elasticsearch

VISION SPEECH LANGUAGE CHATBOTS FORECASTING RECOMMENDATIONS Athena Service

Amazon EMR Hadoop, Spark, Presto, Amazon Pig, Hive…19 total Kinesis REKOGNITION REKOGNITION TEXTRACT POLLY TRANSCRIBE TRANSLATE COMPREHEND LEX FORECAST PERSONALIZE AWS Glue IMAGE VIDEO Amazon Redshift Amazon + Redshift Spectrum QuickSight

Amazon Algorithms + Reinforcement Ground Truth Notebooks Training Optimization Deployment Hosting STORAGE SageMaker Marketplace Learning NEW

FRAMEWORKS INTERFACES INFRASTRUCTURE

EC2 P3 EC2 EC2 Elastic Amazon Amazon S3 Amazon S3 Amazon S3 & P3dn G4 C5 FPGAs Greengrass inference Inferentia EBS Standard-IA Standard Intelligent- Tiering

NEW

Amazon S3 Amazon Amazon S3 One Zone-IA Glacier Glacier Deep Archive

© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Machine Learning using Amazon EKS

VISION SPEECH LANGUAGE CHATBOTS FORECASTING RECOMMENDATIONS

REKOGNITION REKOGNITION TEXTRACT POLLY TRANSCRIBE TRANSLATE COMPREHEND LEX FORECAST PERSONALIZE IMAGE VIDEO

Amazon Reinforcement SageMaker Ground Truth Notebooks Algorithms + Marketplace Learning Training Optimization Deployment Hosting

FRAMEWORKS INTERFACES INFRASTRUCTURE

EC2 P3 EC2 EC2 Elastic FPGAs Greengrass Inferentia & P3dn G4 C5 inference

© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Why Machine Learning on ?

ON- PREMISES CLOUD

Composability Portability Scalability

http://www.shutterstock.com/gallery-635827p1.html © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon EKS: run Kubernetes in cloud

Managed Kubernetes control plane, attach data plane

Native upstream Kubernetes experience

Platform for enterprises to run production-grade workloads

Integrates with additional AWS services

© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Getting started with Amazon EKS eksctl CLI—create Amazon EKS clusters (eksctl.io)

Creates all resources needed for the cluster

© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Create an EKS cluster using eksctl

Install

Auto generated cluster name 2x m5.large nodes Uses AWS EKS AMI us-west-2 region Dedicated VPCs Static AMI resolver

GPU-powered cluster

© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Set up K8s for ML: Option 1

Train Inference

1 2 3 4

Data Trained model

© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Create K8s cluster for ML: Option 1

Create training cluster

Create inference cluster

© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Set up K8s for ML: Option 2a

Train & inference 2 Trained role: train model 1 3 role: train role: inference

4 Data role: train role: inference

© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Create K8s cluster for ML: Option 2a

Eksctl cluster configuration with two node groups

Create cluster

© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Set up K8s for ML: Option 2b

Train, inference, & applications

role: train

role: train role: inference role: apps

role: train role: inference role: apps

© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Scaling the cluster

© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Challenges in setting up containers for ML

© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. AWS Deep Learning Containers

Pre-packaged Best performance Works with Amazon EKS, container images and scalability Amazon ECS, fully configured without tuning and Amazon EC2 and validated

KEY FEATURES

Customizable Support for TensorFlow, Single and multi-node container images Apache MXNet training and inference

© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. 16 container images

Training Inference

GPU CPU

Python 2.7 Python 3.6

© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Train twice as fast 65% 90%+ with TensorFlow Scaling efficiency Scaling efficiency with 256 GPUs with 256 GPUS

STOCK TENSORFLOW AWS-OPTIMIZED TENSORFLOW

© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. ML on K8s: Without KubeFlow

Credits: @aronchik

© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. ML on K8s: With KubeFlow

Credits: @aronchik

© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. What’s in KubeFlow?

© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. KubeFlow Configuration

© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Jupyter Notebook

Web application to build, deploy, and train ML models Create and share documents that contain live code, equations, visualizations, and narrative text 40+ programming languages Use cases data cleaning and transformation numerical simulation data visualization machine learning

© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Jupyter Notebook

© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Single Node Training & Inference

© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Horovod

Distributed training framework for TensorFlow, Keras, PyTorch, and MXNet Install • MPI Operator • CSI driver for Amazon FSx Lustre • Create PV and PVC

© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Storage: FSx for Lustre

High performance file system for processing Amazon S3 or on-premises data Low latency and high throughput Works natively with Amazon S3 Container Storage Interface (CSI) driver

© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Distributed Node Training

© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Advantages of KubeFlow on AWS

kubeflow.org/docs/aws EKS cluster provision with

External traffic with

to manage Lustre file system

Centralized and unified K8s logs in

TLS and Auth with and

for your K8s API server endpoint

Detect GPU instance and install

© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Best Practices for Optimizing Distributed Deep Learning Performance on Amazon EKS

https://aws.amazon.com/blogs/opensource/optimizing-distributed-deep-learning-performance-amazon-eks/

© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Machine Learning pipeline

Setup and Scale & Collect & Choose and Train and manage Deploy model manage prepare Optimize your tune model environments in production environment training data ML algorithm (trial and error) for training in production

© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Machine Learning pipeline for Amazon EKS

Setup and Scale & Collect & Choose and Train and manage Deploy model manage prepare Optimize your tune model environments in production environment training data ML algorithm (trial and error) for training in production

GPU- and CPU- TensorFlow, Linear TensorFlow based clusters, Horovod, EMR, regression, Serving, MXNet *operators MXNet, EKS Redshift, S3 decision tree, Model Server, (TensorFlow, PyTorch, Keras, BYOA , … MXNet, …) …

© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. References

Workshop: eksworkshop.com/kubeflow

Code: github.com/aws-samples/machine-learning-using- k8s/

Optimizing Machine Learning performance: aws.amazon.com/blogs/opensource/optimizing- distributed-deep-learning-performance-amazon-eks/

© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.