Matrix Computations and Optimization in Apache Spark

Matrix Computations and Optimization in Apache Spark ∗ Reza Bosagh Zadeh Xiangrui Meng Alexander Ulanov Stanford and Matroid Databricks HP Labs 475 Via Ortega 160 Spear Street, 13th Floor 1501 Page Mill Rd Stanford, CA 94305 San Francisco, CA 94105 Palo Alto, CA 94304 [email protected] [email protected] [email protected] Burak Yavuz Li Pu Shivaram Venkataraman Databricks Twitter UC Berkeley 160 Spear Street, 13th Floor 1355 Market Street Suite 900. 465 Soda Hall San Francisco, CA 94105 San Francisco, CA 94103 Berkeley, CA 94720 [email protected] [email protected] [email protected] Evan Sparks Aaron Staple Matei Zaharia UC Berkeley Databricks MIT and Databricks 465 Soda Hall 160 Spear Street, 13th Floor 160 Spear Street, 13th Floor Berkeley, CA 94720 San Francisco, CA 94105 San Francisco, CA 94105 [email protected] [email protected] [email protected] ABSTRACT CCS Concepts We describe matrix computations available in the cluster •Mathematics of computing ! Mathematical soft- programming framework, Apache Spark. Out of the box, ware; Solvers; •Computing methodologies ! MapRe- Spark provides abstractions and implementations for dis- duce algorithms; Machine learning algorithms; Concur- tributed matrices and optimization routines using these ma- rent algorithms; trices. When translating single-node algorithms to run on a distributed cluster, we observe that often a simple idea Keywords is enough: separating matrix operations from vector operations and shipping the matrix operations to be ran on the Distributed Linear Algebra, Matrix Computations, Opti- cluster, while keeping vector operations local to the driver. mization, Machine Learning, MLlib, Spark In the case of the Singular Value Decomposition, by taking this idea to an extreme, we are able to exploit the computa- 1. INTRODUCTION tional power of a cluster, while running code written decades Modern datasets are rapidly growing in size and many ago for a single core. Another example is our Spark port of datasets come in the form of matrices. There is a press- the popular TFOCS optimization package, originally built ing need to handle large matrices spread across many ma- for MATLAB, which allows for solving Linear programs as chines with the same familiar linear algebra tools that are well as a variety of other convex programs. We conclude with available for single-machine analysis. Several `next genera- a comprehensive set of benchmarks for hardware accelerated tion' data flow engines that generalize MapReduce [5] have matrix computations from the JVM, which is interesting in been developed for large-scale data processing, and build- its own right, as many cluster programming frameworks use ing linear algebra functionality on these engines is a prob- the JVM. The contributions described in this paper are al- lem of great interest. In particular, Apache Spark [12] has ready merged into Apache Spark and available on Spark emerged as a widely used open-source engine. Spark is a installations by default, and commercially supported by a fault-tolerant and general-purpose cluster computing system slew of companies which provide further services. providing APIs in Java, Scala, Python, and R, along with an optimized engine that supports general execution graphs. In this work we present Spark's distributed linear alge- ∗Corresponding author. bra and optimization libraries, the largest concerted cross- institution effort to build a distributed linear algebra and optimization library. The library targets large-scale matri- Permission to make digital or hard copies of all or part of this work for personal or ces that benefit from row, column, entry, or block sparsity to classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation store and operate on distributed and local matrices. The li- on the first page. Copyrights for components of this work owned by others than the brary, named linalg consists of fast and scalable implemen- author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or tations of standard matrix computations for common linear republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]. algebra operations including basic operations such as mul- KDD 2016 8/13-17, San Francisco tiplication and more advanced ones such as factorizations. KDD '16, August 13 - 17, 2016, San Francisco, CA, USA It also provides a variety of underlying primitives such as c 2016 Copyright held by the owner/author(s). Publication rights licensed to ACM. ISBN 978-1-4503-4232-2/16/08. $15.00 column and block statistics. Written in Scala and using na- DOI: http://dx.doi.org/10.1145/2939672.2939675 tive (C++ and fortran based) linear algebra libraries on each 31 node, linalg includes Java, Scala, and Python APIs, and is (c) RowMatrix which assumes each row is small released as part of the Spark project under the Apache 2.0 enough to fit in memory. There is an option to license. use a sparse or dense representation for each row. 1.1 Apache Spark These matrix types and the design decisions behind We restrict our attention to Spark, because it has several them are outlined in Section 2. features that are particularly attractive for matrix computations: 2. Matrix Computations must be adapted for running on a cluster, as we cannot readily reuse linear algebra al- 1. The Spark storage abstraction called Resilient Dis- gorithms available for single-machine situations. A key tributed Datasets (RDDs) is essentially a distributed idea that lets us distribute many operations is to sep- fault-tolerant vector on which programmers can per- arate algorithms into portions that require matrix op- form a subset of operations expected from a regular erations versus vector operations. Since matrices are local vector. often quadratically larger than vectors, a reasonable assumption is that vectors fit in memory on a single 2. RDDs permit user-defined data partitioning, and the machine, while matrices do not. Exploiting this idea, execution engine can exploit this to co-partition RDDs we were able to distribute the Singular Value Decom- and co-schedule tasks to avoid data movement. position via code written decades ago in FORTRAN90, as part of the ARPACK [6] software package. By sepa- 3. Spark logs the lineage of operations used to build an rating matrix from vector computations, and shipping RDD, enabling automatic reconstruction of lost par- the matrix computations to the cluster while keeping titions upon failures. Since the lineage graph is rel- vector operations local to the driver, we were able to atively small even for long-running applications, this distribute two classes of optimization problems: approach incurs negligible runtime overhead, unlike checkpointing, and can be left on without concern for (a) Spectral programs: Singular Value Decomposi- performance. Furthermore, Spark supports optional tion (SVD) and PCA in-memory distributed replication to reduce the amount (b) Convex programs: Gradient Descent, LBFGS, Ac- of recomputation on failure. celerate Gradient, and other unconstrained opti- 4. Spark provides a high-level API in Scala that can be mization methods. We provide a port of the pop- easily extended. This aided in creating a coherent API ular TFOCS optimization framework [1], which for both collections and matrix computations. covers Linear Programs and a variety of other convex objectives There exists a history of using clusters of machines for Separating matrix operations from vector operations distributed linear algebra, for example [3]. These systems helps in the case that vectors fit in memory, but ma- are often not fault-tolerant to hardware failures and assume trices do not. This covers a wide array of applications, random access to non-local memory. In contrast, our library since matrices are often quadratically larger. However, is built on Spark, which is a dataflow system without direct there are some cases for which vectors do not fit in access to non-local memory, designed for clusters of commod- memory on a single machine. For such cases, we use ity computers with relatively slow and cheap interconnects, an RDD for the vector as well, and use BlockMatrix and abundant machines failures. All of the contributions de- for data storage. scribed in this paper are already merged into Apache Spark and available on Spark installations by default, and com- We give an outline of the most interesting of these mercially supported by a slew of companies which provide computations in Section 3. further services. 3. Many distributed computing frameworks such as Spark 1.2 Challenges and Contributions and Hadoop run on the Java Virtual Machine (JVM), Given that we have access to RDDs in a JVM environ- which means that achieving hardware-specific acceler- ment, four key challenges arise to building a distributed lin- ation for computation can be difficult. We provide a ear algebra library, each of which we address: comprehensive survey of tools that allow matrix computations to be pushed down to hardware via the Basic 1. Data representation: how should one partition the en- Linear Algebra Subprograms (BLAS) interface from tries of a matrix across machines so that subsequent the JVM. In addition to a comprehensive set of bench- matrix operations can be implemented as efficiently marks, we have made all the code for producing the as possible? This led us to develop three different benchmark public to allow for reproducibility. distributed matrix representations, each of which has In Section 4 we provide results and pointers to code benefits depending on the sparsity pattern of the data. and benchmarks. We have built 4. Given that there are many cases when distributed ma- (a) CoordinateMatrix which puts each nonzero into trices and local matrices need to interact (for example a separate RDD entry.

Load more