Mmlspark: Unifying Machine Learning Ecosystems at Massive Scales

MMLSpark: Unifying Machine Learning Ecosystems at Massive Scales Mark Hamilton 1 Sudarshan Raghunathan 2 Ilya Matiach 3 Andrew Schonhoffer 3 Anand Raman 2 Eli Barzilay 1 Karthik Rajendran 4 5 Dalitso Banda 4 5 Casey Jisoo Hong 4 5 Manon Knoertzer 4 5 Ben Brodsky 2 Minsoo Thigpen 4 Janhavi Suresh Mahajan 4 Courtney Cochrane 4 Abhiram Eswaran 4 Ari Green 4 Abstract have proliferated. These different frameworks often have dramatically different APIs, data models, usage patterns, We introduce Microsoft Machine Learning for and scalability considerations. This heterogeneity makes Apache Spark (MMLSpark), an open-source li- it difficult to combine systems and complicates production brary that expands the Apache Spark distributed deployments. In this work, we present Microsoft Machine computing library to tackle problems in deep Learning for Apache Spark (MMLSpark), an ecosystem that learning, micro-service orchestration, gradient aims to unify major machine learning workloads into a sin- boosting, model interpretability, and other areas gle API for execution in a variety of distributed production of modern machine learning. We also present a grade environments and languages. We describe the tech- novel machine learning deployment system called niques and principles used to unify a representative sample Spark Serving that can deploy Apache Spark pro- of machine learning technologies, each with its own soft- grams as distributed, sub-millisecond latency web ware stack, communication requirements, and paradigms. services with significantly greater flexibility and We also introduce tools for deploying these technologies lower latencies than existing frameworks. Spark as distributed real-time web services. Code and documen- Serving generalizes beyond just map-style computation for MMLSpark can be found through our website, tations and allows distributed aggregations, joins, https://aka.ms/spark. and shuffles and allows users to leverage the same cluster for both training and deployment. Our contributions allow easy composition across 2. Background machine learning frameworks, compute modes Throughout this work we build upon the distributed comput- (batch, streaming, and RESTful web serving) and ing framework Apache Spark (Zaharia et al., 2016). Spark cluster types (static, elastic, and serverless). We is capable of a broad range of workloads and applications demonstrate the value of MMLSpark by creating a such as fault-tolerant and distributed map, reduce, filter, and method for deep object detection capable of learn- aggregation style programs. Spark improves on its prede- ing without human labeled data and demonstrate cessors MapReduce and Hadoop by reducing disk IO with its effectiveness for Snow Leopard conservation. in memory computing, and whole program optimization We also demonstrate its ability to create large- (Dean & Ghemawat, 2008; Shvachko et al., 2010). Spark scale image search engines. clusters can adaptively resize to compute a workload effi- ciently (elasticity) and can run on resource managers such as Yarn, Mesos, Kubernetes, or manually created clusters. arXiv:1810.08744v2 [cs.LG] 21 Jun 2019 1. Introduction Furthermore, Spark has language bindings in several popu- As the field of machine learning has advanced, frameworks lar languages like Scala, Java, Python, R, Julia, C# and F#, for using, authoring, and training machine learning systems making it usable from almost any project. In recent years, Spark has expanded its scope to support *Equal contribution 1Microsoft Applied AI, Cambridge, Mas- sachusetts, USA 2Microsoft Applied AI, Redmond, Washing- SQL, streaming, machine learning, and graph style compu- ton, USA 3Microsoft Azure Machine Learning, Cambridge, Mas- tations (Armbrust et al., 2015; Meng et al., 2016; Xin et al., sachusetts, USA 4Microsoft AI Development Acceleration Pro- 2013). This broad set of APIs allows a rich space of compu- gram, Cambridge, Massachusetts, USA 5AUTHORERR: Miss- tations that we can leverage for our work. More specifically, ing nicmlaffiliation. Correspondence to: Mark Hamilton we build upon the SparkML API, which is similar to the <[email protected]>. popular Python machine learning library, scikit-learn (Buit- Proceedings of the 36 th International Conference on Machine inck et al., 2013). Like scikit-learn, all SparkML models Learning, Long Beach, California, PMLR 97, 2019. Copyright have the same API, which makes it easy to create, substitute, 2019 by the author(s). MMLSpark: Unifying Machine Learning Ecosystems at Massive Scales and compose machine learning algorithms into “pipelines”. to distributed serving seamless and instant. However, SparkML has several key advantages such as Many companies such as Microsoft, Amazon, IBM, and limitless scalability, streaming compatibility, support for Google have embraced model deployment with web ser- structured datasets, broad language support, a fluent API vices to provide pre-built intelligent algorithms for a wide for initializing complex algorithms, and a type system that range of applications (Jackson et al., 2010; Microsoft,c; differentiates computations based on whether they extract High, 2012). This standardization enables easy use of cloud state (learn) from data. In addition, Spark clusters can use a intelligence and abstracts away implementation details, en- wide variety of hardware SKUs making it possible to lever- vironment setup, and compute requirements. Furthermore, age modern advances in GPU accelerated frameworks like intelligent services allow application developers to quickly Tensorflow, CNTK, and PyTorch (Abadi et al., 2016; Seide use existing state of the art models to prototype ideas. In the & Agarwal, 2016; Paszke et al., 2017). These properties Azure ecosystem, the Cognitive Services provide intelligent make the SparkML API a natural and principled choice to services in domains such as text, vision, speech, search, unify the APIs of other machine learning frameworks. time series, and geospatial workloads. Across the broader computing literature, many have turned to intermediate languages to “unify” and integrate disparate 3. Contributions forms of computation. One of the most popular of these languages is the Hypertext Transfer Protocol (HTTP) used In this section we describe our contributions in three key ar- widely throughout internet communications. To enable eas: 1) Unifying several Machine Learning ecosystems with broad adoption and integration of code, one simply needs to Spark. 2) Integrating Spark with the networking protocol create a web-hosted HTTP endpoint or “service”. Putting HTTP and several intelligent web services. 3) Deploying compute behind an intermediate language allows different Spark computations as distributed web services with Spark system components to scale independently to minimize bot- Serving. tlenecks. If services reside on the same machine, one can These contributions allow users to create scalable machine use local networking capabilities to bypass internet data learning systems that draw from a wide variety of libraries transfer costs and come closer to the latency of normal func- and expose these contributions as web services for others tion dispatch. This pattern is referred to as a “micro-service” to use. All of these contributions carry the common theme architecture, and powers many of today’s large-scale appli- of building a single distributed API that can easily and cations (Sill, 2016). elegantly create a variety of different intelligent applications. Many machine learning workflows rely on deploying In Section4 we show how to combine these contributions learned models as web endpoints for use in front-end appli- to solve problems in unsupervised object detection, wildlife cations. In the Spark ecosystem, there are several ways to ecology, and visual search engine creation. deploy applications as web services such as Azure Machine Learning Services (AML), Clipper, and MLeap. However, 3.1. Algorithms and Frameworks Unified in these frameworks all compromise on the breadth of models MMLSpark they export, or the latency of their deployed services. AML deploys PySpark code in a dockerized Flask application that 3.1.1. DEEP LEARNING uses Spark’s Batch API on a single node standalone cluster To enable GPU accelerated deep learning on Spark, we have (Microsoft,a; Grinberg, 2018). Clipper uses on an interme- previously parallelized Microsoft’s deep learning frame- diate RPC communication service to invoke a Spark batch work, the Cognitive Toolkit (CNTK) (Seide & Agarwal, job for each request (Crankshaw et al., 2017). Both method 2016; Hamilton et al., 2018). This framework powers use Spark’s Batch API which adds large overheads. Further- roughly 80% of Microsoft’s internal deep learning work- more, if back-end containers are isolated, this precludes ser- loads and is flexible enough to create most models described vices with inter-node communication like shuffles and joins. in the deep learning literature. CNTK is similar to other MLeap achieves millisecond latencies by re-implementing automatic differentiation systems like Tensorflow, PyTorch, SparkML models in single threaded Scala, and exporting and MxNet as they all create symbolic computation graphs SparkML pipelines to this alternate implementation (Com- that automatically differentiate and compile to machine code. bust). This incurs a twofold development cost, a lag behind These tools liberate developers and researchers from the dif- the SparkML library, and a export limitation to models in ficult task of deriving training

Mmlspark: Unifying Machine Learning Ecosystems at Massive Scales

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support