Apache Submarine: a Unified Machine Learning Platform Made Simple

Apache Submarine: A Unified Machine Learning Platform Made Simple Kai-Hsun Chen Huan-Ping Su Wei-Chiu Chuang Academia Sinica Academia Sinica Cloudera [email protected] [email protected] [email protected] Hung-Chang Hsiao Wangda Tan Zhankun Tang National Cheng Kung University Cloudera Cloudera [email protected] [email protected] [email protected] Xun Liu Yanbo Liang Wen-Chih Lo DiDi Facebook National Cheng Kung University [email protected] [email protected] [email protected] Wanqiang Ji Byron Hsu Keqiu Hu JD.com UC Berkeley LinkedIn [email protected] [email protected] [email protected] HuiYang Jian Quan Zhou Chien-Min Wang KE Holdings Ant Group Academia Sinica [email protected] [email protected] [email protected] ABSTRACT Submarine has been widely used in many technology compa- As machine learning is applied more widely, it is necessary to have nies, including Ke.com and LinkedIn. We present two use cases in a machine learning platform for both infrastructure administrators Section 6. and users including expert data scientists and citizen data scientists [27] to improve their productivity. However, existing machine learn- 1 INTRODUCTION ing platforms are ill-equipped to address the “Machine Learning The ubiquity of machine learning in all aspects of businesses im- tech debts” [39] such as glue code, reproducibility, and portability. poses plenty of challenges for infrastructure administrators, expert Furthermore, existing platforms only take expert data scientists data scientists, and citizen data scientists. For example, as described into consideration, and thus they are inflexible for infrastructure in [31], models with large parameters’ space cannot fit into the administrators and non-user-friendly for citizen data scientists. We memory of a single machine, necessitating model parallelism sup- propose Submarine, a unified machine learning platform, to address port. In addition, data parallelism support is necessary for models the challenges. with high computation cost to shorten the training process. There- For infrastructure administrators, Submarine is a flexible and fore, distributed training techniques like data parallelism and model portable platform. To improve resource efficiency and ensure porta- parallelism which leverage distributed computing resources to train bility, Submarine supports both on-premise clusters including Ku- the models are necessary. bernetes [20] and Hadoop YARN [43], and cloud services, rather Despite rapid development of distributed machine learning plat- than only focusing on Kubernetes. In addition, Submarine supports forms, existing platforms fail to meet the needs of both infrastruc- arXiv:2108.09615v1 [cs.DC] 22 Aug 2021 multiple machine learning frameworks to avoid tight coupling be- ture administrators and users, who have different requirements for tween the modeling and infrastructure layers caused by glue code. the system. We will discuss the problems in the existing platforms For expert data scientists, Submarine is a unified system to respectively from perspectives of both infrastructure administra- streamline the workflow from idea to serving models in produc- tors and users including expert data scientists and non-expert users, tion. Data scientists without experience in distributed systems can or “citizen data scientists” [27], who are experts in other fields such operate Submarine smoothly, since Submarine hides the details of as finance, medicine, and education, but who are not necessarily infrastructure. familiar with machine learning techniques. For citizen data scientists, Submarine is an easy-to-use platform From an expert data scientists’ perspective, a machine learn- to save their time on machine learning framework API and help ing platform must be able to help them improve the productivity, them focus on domain-specific problems. To elaborate, Submarine shortening the model development time to production. However, provides a high-level SDK that allows users to develop their models existing platforms are not properly designed to meet this expecta- in a few lines of code. Also, the Submarine Predefined Template tion. Firstly, existing platforms like TFX [19] do not cover the whole Service enables users to run experiments without writing a single model lifecycle which is composed mainly of data preparation, pro- line of code. totyping, model training, and model serving, forcing data scientists to switch toolsets for different stages of the model development, clusters managed by Kubernetes and YARN and clouds to ensure creating an inconsistent user experience. Secondly, developing on portability and resource-efficiency. Also, Submarine supports pop- these platforms requires deep expertise. For instance, MLflow [44] ular machine learning frameworks including TensorFlow, PyTorch requires data scientists to write a Kubernetes Job Spec to run an [37], and MXNet [16] to ensure the flexibility and extensibility of MLflow project on Kubernetes. In Determined [21], data scientists the system. need to rewrite their model code to integrate with its API. Machine Learning Platform is a new and exciting research topic, From a citizen data scientists’ perspective, they spend most of where it sits at the intersection of Systems and Machine Learn- their time understanding the business problems and are less in- ing. Therefore, there are not many best practices available so far. terested in learning machine learning APIs. Hence, citizen data Leading technology companies [13] tried to build their in-house scientists expect a machine learning platform which saves their machine learning platforms, often to fail or only to claim success time on machine learning framework API and helps them focus after repeated trials. We decided to develop Submarine as an open on domain-specific problems. However, existing platforms only source project governed by the Apache Software Foundation so take expert data scientists into consideration, and thus they are we can build a community and share the best practices. Today, the non-user-friendly for citizen data scientists. Ironically, the citizen project is joined by contributors all over the world, including the data scientist population is a much larger and fast growing com- developers from Cloudera, DiDi, Facebook, JD.com, LinkedIn, KE munity than the expert data scientist population. A new platform Holdings, Ant Group, Microsoft, among others. that considers both expert data scientists and citizen data scientists The rest of the paper is organized as follows. At first, Section 2 is a must. gives a brief overview of existing machine learning platforms. Sec- From an infrastructure administrators’ perspective, a machine tion 3 and Section 4 describe working and in-progress compo- learning platform should ideally be resource-efficient and flexible, nents in Submarine. Then, we discuss the uniqueness of Submarine but existing platforms were not capable of achieving these goals. and compare Submarine with other open source platforms in Sec- Firstly, most platforms such as Kubeflow [18] only support Ku- tion 5. We present two use cases in technology companies including bernetes as a resource orchestrator. However, many technology LinkedIn and Ke.com in Section 6 and discuss future works in Sec- companies like LinkedIn have deeply invested in other orchestra- tion 7. Finally, we conclude this work in Section 8. tion technologies, such as YARN. Secondly, some platforms like Metaflow [34] are built on top of proprietary services offered by public cloud providers such as Amazon Web Services, Microsoft 2 RELATED WORK Azure, or Google Cloud Platform. Hence, they are not suitable for Nowadays, many machine learning platforms have been developed companies that deploy machine learning platforms in data centers. by companies and open source communities. We briefly overview Thirdly, existing platforms like TFX are tightly integrated with a some major open source works including Google’s TFX [19], Kube- single machine learning framework (e.g. TensorFlow [8]). The tight flow [18], Netflix’s Metaflow [34], Determined-AI’s Determined coupling between the modeling layer and the platform layer limits [21], Microsoft’s NNI (Neural Network Intelligence) [7], Tencent’s the flexibility and extensibility of the system, because the develop- Angel-ML [11], Databricks’ MLflow [44], and Anyscale’s Ray [35] ers lose the ability to use a different machine learning framework, in this section. which may be more efficient for a different use case. TFX is a toolkit for building machine learning pipelines. Users We made a decision to build Apache Submarine, a unified ma- can define their TFX pipelines which are a sequence of compo- chine learning platform, to address the problems mentioned above. nents from data engineering to model serving. Moreover, with TFX A “unified” machine learning platform describes a platform that pipelines, users can orchestrate their machine learning workflows provides a unified solution for infrastructure administrators, expert on several platforms, such as Apache Airflow [9] Apache Beam data scientists and citizen data scientists. Submarine takes both [14], and Kubeflow Pipelines. infrastructure administrators and users including expert data sci- Kubeflow is an end-to-end Machine Learning platform which entists and citizen data scientists into consideration. For expert aims to make deployments of machine learning workflows on Ku- data scientists, Submarine is a unified platform that provides data bernetes. Kubeflow also provides

Apache Submarine: a Unified Machine Learning Platform Made Simple

Programming Models to Support Data Science Workflows

Presto: the Definitive Guide

Summary Areas of Interest Skills Experience

TR-4798: Netapp AI Control Plane

Using Amazon EMR with Apache Airflow: How & Why to Do It

Migrating from Snowflake to Bigquery Data and Analytics

Spring Boot AMQP Starter 1.5.8.RELEASE

Amazon EMR Migration Guide How to Move Apache Spark and Apache Hadoop from On-Premises to AWS

Building a Google Cloud Data Platform

Quality of Analytics Management of Data Pipelines for Retail Forecasting

Airflow Documentation

Apache Airflow Overview Viktor Kotliar