Machine Learning Using the Dell EMC Ready Architecture for Red Hat
Total Page:16
File Type:pdf, Size:1020Kb
Machine Learning Using the Dell EMC Ready Architecture for Red Hat OpenShift Container Platform Streamline your Machine Learning Projects on OpenShift Container Platform v3.11 using Kubeflow December 2020 H17870.3 White Paper Abstract This white paper describes how to deploy Kubeflow v0.5 on Red Hat OpenShift Container Platform using the Dell EMC Ready Architecture and provides recommendations for achieving optimal performance using the latest Intel Xeon Scalable processors. Dell Technologies Solutions Copyright The information in this publication is provided as is. Dell Inc. makes no representations or warranties of any kind with respect to the information in this publication, and specifically disclaims implied warranties of merchantability or fitness for a particular purpose. Use, copying, and distribution of any software described in this publication requires an applicable software license. Copyright © 2020 Dell Inc. or its subsidiaries. All Rights Reserved. Dell Technologies, Dell, EMC, Dell EMC and other trademarks are trademarks of Dell Inc. or its subsidiaries. Intel, the Intel logo, the Intel Inside logo and Xeon are trademarks of Intel Corporation in the U.S. and/or other countries. Other trademarks may be trademarks of their respective owners. Published in the USA 12/20 White Paper H17870.3. Dell Inc. believes the information in this document is accurate as of its publication date. The information is subject to change without notice. 2 Machine Learning Using the Dell EMC Ready Architecture for Red Hat OpenShift Container Platform Streamline your Machine Learning Projects on OpenShift Container Platform v3.11 using Kubeflow White Paper Contents Contents Executive summary ........................................................................................................................ 4 Solution technology ........................................................................................................................ 6 Machine learning with Kubeflow .................................................................................................... 8 Installing and deploying Kubeflow ................................................................................................ 9 Using Kubeflow ............................................................................................................................. 11 Summary ........................................................................................................................................ 17 References ..................................................................................................................................... 18 Machine Learning Using the Dell EMC Ready Architecture for Red Hat OpenShift Container Platform 3 Streamline your Machine Learning Projects on OpenShift Container Platform v3.11 using Kubeflow White Paper Executive summary Executive summary Business case Enterprises are increasing their investments in infrastructure platforms to support Artificial Intelligence (AI) use cases and the computing needs of their data science teams. Machine Learning (ML) and Deep Learning (DL) are AI techniques that have demonstrated success across every industry vertical, including manufacturing, healthcare, retail, and cloud services. Kubeflow, a Kubernetes-native platform for ML workloads for enterprises, was released as an open-source project in December 2017. Kubeflow makes it easier to develop, deploy, and manage ML applications. For more information, see Kubeflow: The Machine Learning Toolkit for Kubernetes. Kubeflow requires a Kubernetes environment such as Google Kubernetes Engine or Red Hat OpenShift. Running Kubeflow on OpenShift offers several advantages in an ML context: • Using Kubernetes as the underlying platform makes it easier for an ML engineer to develop a model locally using a development system such as a laptop before deploying the application to a production Kubernetes environment. • Running ML workloads in the same environment as the rest of the company’s applications reduces IT complexity. Red Hat Dell Technologies and Red Hat offer a proven platform design that provides accelerated OpenShift delivery of stateless and stateful cloud-native applications using enterprise-grade Container Kubernetes container orchestration. Dell Technologies provides tested and validated Platform design guidance to help customers rapidly implement OpenShift Container Platform on Dell Technologies infrastructure. For more information, see the Dell EMC Ready Architecture for Red Hat OpenShift Container Platform v3.11 Architecture Guide. Dell Technologies uses this enterprise-ready platform as the foundation for building a robust, high-performance ML/DL platform that supports the various life cycle stages of an AI project: model development that uses Jupyter notebooks, experimentation and testing using TensorFlow, and deployment of ML models. Document scope This white paper describes how to install and deploy Kubeflow v0.5 on OpenShift Container Platform and provides an example of running an ML training job on the OpenShift platform using a Kubeflow TFJob to run distributed training. 4 Machine Learning Using the Dell EMC Ready Architecture for Red Hat OpenShift Container Platform Streamline your Machine Learning Projects on OpenShift Container Platform v3.11 using Kubeflow White Paper Executive summary Audience This white paper is for IT administrators and decision makers who intend to to build an ML platform using on-premises infrastructure. Familiarity with ML processes and OpenShift technology is desirable but not essential. We value your Dell Technologies and the authors of this document welcome your feedback on the feedback solution and the solution documentation. Contact the Dell Technologies Solutions team by email or provide your comments by completing our documentation survey. Alternatively, contact the Dell Technologies OpenShift team at [email protected]. Authors Dell EMC: Ramesh Radhakrishnan, John Terpstra Red Hat: Jacob Liberman, Pete MacKinnon Note: The OpenShift Container Platform Info Hub for Ready Solutions space on the Dell EMC Communities website provides links to additional, relevant documentation. Machine Learning Using the Dell EMC Ready Architecture for Red Hat OpenShift Container Platform 5 Streamline your Machine Learning Projects on OpenShift Container Platform v3.11 using Kubeflow White Paper Solution technology Solution technology Solution As shown in Figure 1, the Dell EMC reference architecture for OpenShift Container architecture Platform on Dell EMC infrastructure uses five node types: bastion, control-plane, infrastructure, application, and storage. • Bastion node—The bastion node serves as the main deployment and management server for the OpenShift cluster. • Control-plane nodes—The control-plane nodes perform control and management functions for the entire cluster environment. These nodes are responsible for the creation, scheduling, and management of all objects specific to OpenShift, including the API, controller management, and scheduler capabilities. • Infrastructure nodes—The infrastructure nodes execute a range of control plane services, including the OpenShift Container registry, the HAProxy router, and the Heketi service. • Storage nodes—The storage nodes provide persistent storage for the environment. Kubernetes storage classes can create persistent volumes manually or automatically. The Dell EMC PowerEdge R740 server is used for storage and the PowerEdge R640 server is used for the remaining node types. • Application nodes—The application nodes run containerized workloads. They contain a single binary of OpenShift node components and are used by control- plane nodes to schedule and control containers. Kubeflow uses the application node resources for running ML/DL jobs. 6 Machine Learning Using the Dell EMC Ready Architecture for Red Hat OpenShift Container Platform Streamline your Machine Learning Projects on OpenShift Container Platform v3.11 using Kubeflow White Paper Solution technology Figure 1. Rack diagram Configuration for This section describes the modifications to OpenShift Container Platform v3.11 that Dell OpenShift EMC recommends for ML workloads using Kubeflow. Container Platform with ML/DL training is one of the most computationally intensive workloads in the enterprise Kubeflow data center. Dell EMC recommends using the latest Intel Xeon Scalable processor family to take advantage of the Intel Deep Learning Boost optimizations that are targeted for complex ML workloads. The following table shows the recommended configuration for application nodes: Table 1. Application node configuration Hardware Description SKU Server 3 x Dell EMC PowerEdge R640 210-AKWU Machine Learning Using the Dell EMC Ready Architecture for Red Hat OpenShift Container Platform 7 Streamline your Machine Learning Projects on OpenShift Container Platform v3.11 using Kubeflow White Paper Machine learning with Kubeflow Hardware Description SKU CPU 2 x Intel Xeon Gold 6248 processor 338-BRVO (20 cores, 2.5 GHz, 150W) Memory 384 GB (12 x 32 GB 2666MHz DDR4 ECC 370-ADNF RDIMM) Storage Capacity Tier: 2 x 1.6 =TB Intel SSD DC P4610 400-BELT Boot Drive: BOSS controller card + with 2 x M.2 403-BBTJ sticks 480 GB Machine learning with Kubeflow Kubeflow integrates commonly used ML tools such as Tensorflow and Jupyter Notebooks into a single platform. Multiple iterations are produced as the models are tuned and new datasets are incorporated. Using automation to manage, build, and maintain the