Deploying the AI R&D Mlops Platform to Enable End-To-End ML Workflows

Deploying the AI R&D MLOps Platform to Enable End-to-End ML Workflows Deploying the Kubeflow MLOps platform in AWS to enabled our Data Science team to create end-to-end ML workflows for automated delivery of machine-learning models. April | 2021 Authors: Patrick McDermott, Christi Kazakov, Seth Havermann, Ethan McDowell, Sean Doyle, Maksim Poletaev and Jordan Walker of Object Computing Inc. World Wide Technology wwt.com Table of Contents Abstract .......................................................................................................3 Business Justification ...........................................................................................3 Experimental Setup and Methodology ...........................................................................4 Conclusion .................................................................................................... 12 References. 12 Deploying the AI R&D MLOps Platform to Enable End-to-End ML Workflows 2 Abstract In this white paper, you will learn about the MLOps platform that a WWT machine-learning (ML) platform infrastructure team built to reliably deliver trained and validated ML models into production. By deploying the Kubeflow MLOps platform in AWS as a component of our common ML infrastructure, the team enabled WWT data scientists to create end-to-end ML workflows. As part of the MLOps platform deployment, the team built an automated delivery pipeline proof-of-concept to train and productionize a natural language processing (NLP) deep learning model, along with microservices that enable a user to search for relevant WWT platform articles that have been ranked by that productionized model. Business Justification As more organizations adopt a data-driven culture by moving up the Data Maturity Curve, MLOps has become a necessity for any data mature company. While ML capabilities have progressed greatly over the last decade, many organizations still struggle with how to deploy these models consistently and reliably. As the industry enters the second decade of the machine learning revolution, there is a need for machine learning systems that can operate in an automated, repeatable and platform independent manner. That upgraded environment should allow data scientists to focus more on the data science and less on where and how they will do the data science. Further, as cloud computing has allowed for big data and massive modeling (e.g., large language models like Bidirectional Encoder Representations from Transformers (BERT) which are specialized Natural Language Processing (NLP) neural networks that often contain 110 million or more parameters) the demand for ML pipelines that encompass the entire machine learning lifecycle is also increasing. Such pipelines allow data scientists to work faster in a repeatable manner while still allowing for innovation. For example, many organizations have built data science teams deploying many models but have siloed the various elements of these models in such a way that these models start to break down in a way that requires constant attention. Through MLOps these models can constantly be deployed reliably and automatically. Here, we present a full ML pipeline with a massive NLP model using Kubeflow as example of a successful pipeline. This by no means constitutes the only available ML pipeline but demonstrates what is possible. MLOps involves building an advanced infrastructure in which the various elements of the pipeline and tools need to integrate seamlessly. Thus, organizations may put together a different puzzle based on data maturity, resources and current infrastructure. That said, many of the lessons learned while building the end-to-end ML workflow with Kubeflow can be applied to a variety of ML pipelines. Deploying the AI R&D MLOps Platform to Enable End-to-End ML Workflows 3 Experimental Setup and Methodology An effective ML infrastructure team enables data scientists to iterate quickly and more efficiently. An organization’s ML infrastructure typically includes platforms and internal tools for ML model training, experimentation and deployment. By unburdening data scientists from the complexity of building ML infrastructure, they can focus on researching the latest concepts in Artificial Intelligence and Deep Learning, performing feature engineering, and training ML models that will eventually provide value in production. We selected Kubeflow to enhance the MLOps capabilities of our AI R&D platform and built a proof-of-concept system including data ingestion, an end-to-end ML workflow and a microservices ecosystem. MLOps Platform Selection When selecting the platform, we assessed multiple options, including MLFlow, Flyte, Domino Data Lab and Kubeflow. We chose Kubeflow because it had several advantages including its level of adoption, its deployment flexibility and its development community. Many organizations have successfully adopted Kubeflow to build reproducible ML pipelines, e.g., Spotify, Bloomberg, Volvo, US Bank, Chase and others. Kubeflow is freely available and, because it is installed on a Kubernetes cluster, it shares Kubernetes’ deployment flexibility in any of the primary cloud providers (AWS, GCP and Azure) and on-premises. To provide the type of MLOps platform that had previously only been achievable by major tech companies building ad-hoc platforms with large ML infrastructure teams, Kubeflow was created by developers at Google, Cisco, IBM, Red Hat, CoreOS and CaiCloud. Critically, the Kubeflow development community provides valuable guidance around installation and use of the MLOps platform and its components. Many MLOps platform options exist today. Selecting one of these platforms and performing a deep dive has allowed WWT to not only gain expertise with Kubeflow but has also enabled WWT data scientists to learn generalized MLOps skills, techniques, patterns that readily transfer across MLOps platforms. Building the Proof-of-Concept As a proof-of-concept of the AI R&D Group’s MLOps capabilities, our ML Platform Infrastructure team built an ML- driven system that enables users to search for technology articles (such as the one you are reading right now) that have been published on WWT’s website, the Advanced Technology Center Platform (which we will call the WWT platform). To build this system, the team deployed Kubeflow, created a component that ingests data from the WWT platform, trained a deep learning model via an end-to-end ML workflow, and built a web application that enables the user to search for articles (Figure 1). The team consisted of a Data Scientist, a Data Engineer and three ML Infrastructure Engineers. For your convenience, we are bolding the key features below. Deploying the AI R&D MLOps Platform to Enable End-to-End ML Workflows 4 Figure 1 The proof-of-concept system ingests articles from the WWT platform, trains a model in the end-to- end ML workflow, and serves predictions to APIs running in the microservices ecosystem. Kubeflow Deployment Because the AI R&D group had adopted AWS for our development environment, we deployed Kubeflow in an Amazon Elastic Kubernetes Service (EKS) cluster. In addition to deploying the MLOps pipeline to Amazon EKS in AWS Cloud, the team also deployed the solution to an EKS cluster running on AWS Outposts. AWS Outposts is a fully managed offering by AWS that offers similar AWS Infrastructure, services, and APIs locally to any datacenter or facility. This is provided via a rack of hardware delivered to a local facility. GitHub Actions along with the Terraform infrastructure-as-code utility helped us automate cluster provisioning and Kubeflow installation. Terraform allowed the team to provisiondifferent cloud services with varying configurations, then perform experiments, and immediately tear down the services to avoid the cost associated with keeping the cloud services (including GPU-accelerated instances) available. In addition to provisioning the EKS cluster, Terraform provisions many other AWS services, including the AWS Elastic File System (EFS). EFS enables multiple Kubeflow components (running as Kubernetes pods) to share training data and ML models with other components. As part of the deployment, after Terraform deploys the AWS cloud services, we use the kfctl utility to install the Kubeflow custom resources on top of the running EKS cluster. By integrating with GitHub Actions, the team built a one-click deployment feature that enables a Data Scientist to easily spin up a Kubeflow cluster without having to gain expertise in Kubernetes or cloud service provisioning. While a deployed Kubeflow cluster and Pipelines let you build end-to-end ML workflows, an ML infrastructure team must enable multiple data scientists to share the cluster. For this multi-tenancy, the team configured multi-user isolation (included with Kubeflow 1.1) which supports multiple user profiles in individual Kubernetes namespaces. In addition, dex (an OpenID Connect identity provider) authenticated each Data Scientist via web-based login form. To further reduce the cluster running cost, the ML infrastructure team implemented autoscaling so GPUs would only run during training. Deploying the AI R&D MLOps Platform to Enable End-to-End ML Workflows 5 Data Ingestion For data ingestion, the data engineer and data scientist collaborated to determine how to transform the WWT article content. The text data from the WWT platform was extracted using the Advanced Technology Center (ATC) connect API. Text extraction used a Python 3.8 program to run locally, ingesting a single

Deploying the AI R&D Mlops Platform to Enable End-To-End ML Workflows

Running ML/DL Workloads Using Red Hat Openshift Container Platform V3.11 Accelerate Your ML/DL Projects Platform Using Kubeflow and NVIDIA Gpus

Application Development with Azure

Running Large-Scale Machine Learning Experiments in the Cloud

Chapter 1 - Overview

Tensorflow 2.0 and Kubeflow for Scalable and Reproducable Enterprise Ai

Using Ipus from Docker Release 1.4.0

Reference Architecture for Kubeflow on Openshift Accelerate ML/DL Workloads Using Kubeflow and Poweredge Servers

CNCF Webinar Taming Your AI/ML Workloads with Kubeflow

Kubeflow: End to End ML Platform Animesh Singh

Machine Learning at Scale with Kubernetes Aug 23Rd, 2018

Application Development with Azure

Build an Event Driven Machine Learning Pipeline on Kubernetes