<<

Introducing Krylov eBay AI Platform - Made Easy

Henry Saputra Technical Lead for Krylov - eBay Unified AI Platform

GPU Technology Conference, 2018 Agenda

1. Data Science and Machine Learning at eBay 2. Introducing Krylov 3. Compute Cluster and Accelerator Support with GPU 4. Quickstart Example 5. Future Roadmap 6. Q & A Data Science and Machine Learning at eBay eBay Patterns - Tools and Frameworks

Patterns for ML Training Tools • Single node • Languages: R, Python, Scala, C++ • Distributed training • IDE-like: RStudio, Notebooks (Juptyer), Python IDE • Frameworks: , SciPy, matplotlib, Scikit-learn, Spark MLLib, H2O • (GPUs) Weka, XGBoost, Moses • Pipelines: Cron, Luigi, Apache Airflow, Apache Oozie

Distributed Training Deep Learning Key takeaway = CHOICE

1. Flexibility of 2. Flexibility of hardware configuration Problems and Challenges

1. 50%-70% is plumbing work a. Accessing and moving secured data b. Environment and tools setup c. Sub-optimal compute instances - NVIDIA GPUs and High memory/ CPUs instances d. Long wait time from platform and infrastructure 2. Lost of productivity and opportunities a. ML lifecycle management of models and features b. Building robust training model pipelines: prepare data, algorithm, hyperparameters tuning, cross validation 3. Collaborations almost impossible 4. Research vs Applied ML Introducing Krylov: Unified eBay AI Platform Overview

● Krylov is the core project of the eBay unified AI Platform initiative to enable easy to use and powerful cloud-based data science and machine learning platform. ● The objective of the project is to enable machine learning jobs with easy to secured-data and eBay cloud computing resources. ● The main goals for the Krylov initiative are: ○ Easy and secure access to training datasets ○ Access to compute in high performance machines, such as GPUs, or cluster of machines. ○ Familiar tools and flexible software to run machine learning model training jobs ○ Interactive data analysis and visualization, with multi-tenancy support to allow quick prototyping of algorithms and data access ○ Sharing and collaboration of ML work between teams in eBay ML Lifecycle Management Lifecycle

MODEL BUILDING MODEL TRAINING MODEL INFERENCING Interactive, iterative Automatable, repeatable, scalable Deployable, Scalable

MODEL RE-FITTING Interactive, iterative

MODEL RE-TRAINING Interactive, iterative

Data + Lifecycle Management Krylov Staircase Design for AI Platform eBay AI Platform Components

AI Speech Recognition Machine Translation Computer Vision Information Retrieval Modules Natural Language Understanding …

AI Engine - Krylov Data Learning Model

Access Pipelines Experimentation AI Hub (Shared Movement Data Scientist Model Lifecycle Repository) Workspaces Management Inferencing Discovery Infrastructure - Krylov Preparation GPU Tall instances

Fast Storage Krylov High Level Architecture Krylov Main Features and Concepts

1. Client Command Line Interface (CLI) via krylovctl program 2. ML Application and Run Specification 3. ML Pipelines: Workflow and Workspace 4. Namespaces - For quota and data isolation 5. Jobs and Runs - Managed by Krylov Tools and Minions 6. Secure Data Access - HDFS, NFS, OpenStack Swift, Custom Krylov CLI - krylovctl Krylov ML Application

● Krylov ML Application is a versioned unit of deployment that contains declaration of the developers’ programs ● Implemented as client project used as to build deployment artifact ● Three main parts: ○ mlapplication.json and artifact.sjon configuration files ○ Source code of the programs ○ Dependencies management via Dockerfile ● Supported types of programs: JVM languages (Java, Scala), Python, Shell script ● Using the ML Application as source, developers can build deployment artifact that can be used by the Run Specification file to deploy it into one of the nodes in the cluster Krylov ML Application Example

{ "tasks": { "prepare_data": { "program": "com..oss.krylov.workflow.JvmMainProgram", "parameters": { "className": "com.ebay.krylov.helloai.HelloWorld" } }, "train_model": { "program": "com.ebay.oss.krylov.workflow.PythonProgram", "parameters": { "file": "helloai-python/helloai/helloworld.py", "args": [] } }, ... Krylov Run Specification

● The Krylov Run Specification is a runtime configuration to add override configuration and parameter passing for each Task in the ML Application job submissions ● It tells Krylov master API server of which the artifact created by ML Application will be used in the compute cluster ● Defined as runspec.json file or can be passed as argument to krylovctl client program. ● The runspec.json file also has definition for the compute resources, such as which NVIDIA GPUs to use, CPU, memory, and which Docker image for dependencies used in ML Application programs Krylov Run Specification Example

{ "jobName": "job-sample", "artifact": "myartifact", "artifactTag": "latest", "mlApplication": "com.ebay.oss.krylov.workflow.app.GenericMLApplication", "applicationParameters": { }, "tasks": { "prepare_data": { "taskParameters": { "prepare_data_parameter_key": "prepare_data_parameter_value" } } } Krylov ML Pipelines: Workflow

● Krylov ML batch lifecycle pipeline is defined as Krylov Workflow definition ○ Declarative ○ Default Generic Workflow ● Important concepts for Krylov Workflow: ○ Workflow - A single pipeline defined within Krylov and the unit of deployment for an ML Application ■ Each Workflow contains one or more Tasks ■ The Tasks are connected to each other as Directed Acyclic Graph (DAG) structure ○ Task - smallest unit of execution that run developers’ Program and executed in a single machine ○ Flows - Contains one or more key-value pairs of name and declaration of Tasks DAGs ○ Flow - The chosen key that will be run from possible selection in the Flows definition Workflow Example in mlapplication.json

{ "tasks": { ... }, "flows": { "sample_flow": { "prepare_data": ["train_model"], "train_model": ["output"] } }, "flow": "sample_flow" } Workflow Runs Flow Krylov ML Pipelines: Workspace

● A Workspace is an interactive web application to allow developers to use web browser to do ML model prototyping, data preparation and exploration ● The Workspace is run as Jupyter Notebook servers and launched on high CPU/ memory or NVIDIA GPU instances ● Enhance the JupyterHub project to allow distributed launching of multi-tenants Jupyter Notebook servers in Krylov compute cluster using Kubernetes ● Krylov Workspace uses configuration file on creation time to override and customize default parameters Workspace Deployment Flow Krylov Compute Cluster Krylov Cluster Infrastructure Krylov Compute Cluster Deployment Krylov Cluster Monitoring ● Metrics - Grafana, InfluxDb, and Telegraf for GPU monitoring Krylov Metrics Management Flow Krylov Compute Resources Management Quickstart Example Steps to Submit Krylov Workflow Job with CLI

1. Download krylovctl program from Krylov release repository 2. Run `krylovctl project create` to create new project in the local machine 3. Update or add code to the Krylov project for the machine learning programs 4. Register them as Program within a Task in the mlapplication.json 5. Add new Flow for the defined Tasks to construct the Workflow as a Directed Acyclic Graph (DAG) 6. Run `krylovctl project build` to build the project. 7. Run `krylovctl artifact create` to copy the runnables of the program into an artifact file 8. Run `krylovctl artifact upload` to upload the artifact file for remote execution 9. Run `krylovctl job run` for local execution, or `krylovctl job submit` for running it in the computing cluster Demo Time

● Here we go ... Future Roadmap Future Roadmap

1. Inferencing Platform 2. Exploration and documentation of RESTful for job management 3. Data Source and Dataset abstraction via Krylov SDKs 4. Managed ML Pipelines - Computer Vision, NLP, Machine Translation 5. Distributed Deep Learning 6. AutoML - Hyper Parameters Tuning 7. AI Hub to share ML Applications and Datasets Question?