Machine Learning using Amazon EKS
Arun Gupta, @arungupta
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Machine Learning 101
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. The Amazon ML stack: Broadest & deepest set of capabilities
FRAMEWORKS INTERFACES INFRASTRUCTURE
EC2 P3 EC2 EC2 Elastic FPGAs Greengrass Inferentia & P3dn G4 C5 inference
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. The Amazon ML stack: Broadest & deepest set of capabilities
Algorithms + Reinforcement Ground Truth Notebooks Marketplace Learning Training Optimization Deployment Hosting
FRAMEWORKS INTERFACES INFRASTRUCTURE ML Frameworks + Infrastructure EC2 P3 EC2 EC2 Elastic FPGAs Greengrass Inferentia & P3dn G4 C5 inference
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. The Amazon ML stack: Broadest & deepest set of capabilities
VISION SPEECH LANGUAGE CHATBOTS FORECASTING RECOMMENDATIONS
REKOGNITION REKOGNITION TEXTRACT POLLY TRANSCRIBE TRANSLATE COMPREHEND LEX FORECAST PERSONALIZE IMAGE VIDEO
Amazon ML Services Reinforcement SageMaker Ground Truth Notebooks Algorithms + Marketplace Learning Training Optimization Deployment Hosting
FRAMEWORKS INTERFACES INFRASTRUCTURE ML Frameworks + Infrastructure EC2 P3 EC2 EC2 Elastic FPGAs Greengrass Inferentia & P3dn G4 C5 inference
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. The Amazon ML stack: Broadest & deepest set of capabilities
VISION SPEECH LANGUAGE CHATBOTS FORECASTING RECOMMENDATIONS
REKOGNITION REKOGNITION TEXTRACT POLLY TRANSCRIBE TRANSLATE COMPREHEND LEX FORECAST PERSONALIZE IMAGE VIDEO
Amazon Reinforcement SageMaker Ground Truth Notebooks Algorithms + Marketplace Learning Training Optimization Deployment Hosting
FRAMEWORKS INTERFACES INFRASTRUCTURE
EC2 P3 EC2 EC2 Elastic FPGAs Greengrass Inferentia & P3dn G4 C5 inference
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Storage and Analytics for ML
MACHINE LEARNING ANALYTICS Amazon Amazon Elasticsearch
VISION SPEECH LANGUAGE CHATBOTS FORECASTING RECOMMENDATIONS Athena Service
Amazon EMR Hadoop, Spark, Presto, Amazon Pig, Hive…19 total Kinesis REKOGNITION REKOGNITION TEXTRACT POLLY TRANSCRIBE TRANSLATE COMPREHEND LEX FORECAST PERSONALIZE AWS Glue IMAGE VIDEO Amazon Redshift Amazon + Redshift Spectrum QuickSight
Amazon Algorithms + Reinforcement Ground Truth Notebooks Training Optimization Deployment Hosting STORAGE SageMaker Marketplace Learning NEW
FRAMEWORKS INTERFACES INFRASTRUCTURE
EC2 P3 EC2 EC2 Elastic Amazon Amazon S3 Amazon S3 Amazon S3 & P3dn G4 C5 FPGAs Greengrass inference Inferentia EBS Standard-IA Standard Intelligent- Tiering
NEW
Amazon S3 Amazon Amazon S3 One Zone-IA Glacier Glacier Deep Archive
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Machine Learning using Amazon EKS
VISION SPEECH LANGUAGE CHATBOTS FORECASTING RECOMMENDATIONS
REKOGNITION REKOGNITION TEXTRACT POLLY TRANSCRIBE TRANSLATE COMPREHEND LEX FORECAST PERSONALIZE IMAGE VIDEO
Amazon Reinforcement SageMaker Ground Truth Notebooks Algorithms + Marketplace Learning Training Optimization Deployment Hosting
FRAMEWORKS INTERFACES INFRASTRUCTURE
EC2 P3 EC2 EC2 Elastic FPGAs Greengrass Inferentia & P3dn G4 C5 inference
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Why Machine Learning on Kubernetes?
ON- PREMISES CLOUD
Composability Portability Scalability
http://www.shutterstock.com/gallery-635827p1.html © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon EKS: run Kubernetes in cloud
Managed Kubernetes control plane, attach data plane
Native upstream Kubernetes experience
Platform for enterprises to run production-grade workloads
Integrates with additional AWS services
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Getting started with Amazon EKS eksctl CLI—create Amazon EKS clusters (eksctl.io)
Creates all resources needed for the cluster
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Create an EKS cluster using eksctl
Install
Auto generated cluster name 2x m5.large nodes Uses AWS EKS AMI us-west-2 region Dedicated VPCs Static AMI resolver
GPU-powered cluster
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Set up K8s for ML: Option 1
Train Inference
1 2 3 4
Data Trained model
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Create K8s cluster for ML: Option 1
Create training cluster
Create inference cluster
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Set up K8s for ML: Option 2a
Train & inference 2 Trained role: train model 1 3 role: train role: inference
4 Data role: train role: inference
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Create K8s cluster for ML: Option 2a
Eksctl cluster configuration with two node groups
Create cluster
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Set up K8s for ML: Option 2b
Train, inference, & applications
role: train
role: train role: inference role: apps
role: train role: inference role: apps
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Scaling the cluster
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Challenges in setting up containers for ML
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. AWS Deep Learning Containers
Pre-packaged Docker Best performance Works with Amazon EKS, container images and scalability Amazon ECS, fully configured without tuning and Amazon EC2 and validated
KEY FEATURES
Customizable Support for TensorFlow, Single and multi-node container images Apache MXNet training and inference
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. 16 container images
Training Inference
GPU CPU
Python 2.7 Python 3.6
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Train twice as fast 65% 90%+ with TensorFlow Scaling efficiency Scaling efficiency with 256 GPUs with 256 GPUS
STOCK TENSORFLOW AWS-OPTIMIZED TENSORFLOW
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. ML on K8s: Without KubeFlow
Credits: @aronchik
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. ML on K8s: With KubeFlow
Credits: @aronchik
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. What’s in KubeFlow?
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. KubeFlow Configuration
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Jupyter Notebook
Web application to build, deploy, and train ML models Create and share documents that contain live code, equations, visualizations, and narrative text 40+ programming languages Use cases data cleaning and transformation numerical simulation data visualization machine learning
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Jupyter Notebook
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Single Node Training & Inference
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Horovod
Distributed training framework for TensorFlow, Keras, PyTorch, and MXNet Install • MPI Operator • CSI driver for Amazon FSx Lustre • Create PV and PVC
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Storage: FSx for Lustre
High performance file system for processing Amazon S3 or on-premises data Low latency and high throughput Works natively with Amazon S3 Container Storage Interface (CSI) driver
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Distributed Node Training
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Advantages of KubeFlow on AWS
kubeflow.org/docs/aws EKS cluster provision with
External traffic with
to manage Lustre file system
Centralized and unified K8s logs in
TLS and Auth with and
for your K8s API server endpoint
Detect GPU instance and install
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Best Practices for Optimizing Distributed Deep Learning Performance on Amazon EKS
https://aws.amazon.com/blogs/opensource/optimizing-distributed-deep-learning-performance-amazon-eks/
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Machine Learning pipeline
Setup and Scale & Collect & Choose and Train and manage Deploy model manage prepare Optimize your tune model environments in production environment training data ML algorithm (trial and error) for training in production
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Machine Learning pipeline for Amazon EKS
Setup and Scale & Collect & Choose and Train and manage Deploy model manage prepare Optimize your tune model environments in production environment training data ML algorithm (trial and error) for training in production
GPU- and CPU- TensorFlow, Linear TensorFlow based clusters, Horovod, EMR, regression, Serving, MXNet *operators MXNet, EKS Redshift, S3 decision tree, Model Server, (TensorFlow, PyTorch, Keras, BYOA Seldon, … MXNet, …) …
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. References
Workshop: eksworkshop.com/kubeflow
Code: github.com/aws-samples/machine-learning-using- k8s/
Optimizing Machine Learning performance: aws.amazon.com/blogs/opensource/optimizing- distributed-deep-learning-performance-amazon-eks/
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.