Presto: Distributed Machine Learning and Graph Processing with Sparse Matrices

Presto: Distributed Machine Learning and Graph Processing with Sparse Matrices Shivaram Venkataraman1 Erik Bodzsar2 Indrajit Roy Alvin AuYoung Robert S. Schreiber 1UC Berkeley, 2University of Chicago, HP Labs [email protected], ferik.bodzsar, indrajitr, alvina, [email protected] Abstract Array-based languages such as R and MATLAB provide It is cumbersome to write machine learning and graph al- an appropriate programming model to express such machine gorithms in data-parallel models such as MapReduce and learning and graph algorithms. The core construct of arrays Dryad. We observe that these algorithms are based on matrix makes these languages suitable to represent vectors and ma- computations and, hence, are inefficient to implement with trices, and perform matrix computations. R has thousands the restrictive programming and communication interface of of freely available packages and is widely used by data min- such frameworks. ers and statisticians, albeit for problems with relatively small In this paper we show that array-based languages such amounts of data. It has serious limitations when applied to as R [3] are suitable for implementing complex algorithms very large datasets: limited support for distributed process- and can outperform current data parallel solutions. Since R ing, no strategy for load balancing, no fault tolerance, and is is single-threaded and does not scale to large datasets, we constrained by a server’s DRAM capacity. have built Presto, a distributed system that extends R and 1.1 Towards an efficient distributed R addresses many of its limitations. Presto efficiently shares sparse structured data, can leverage multi-cores, and dynam- We validate our hypothesis that R can be used to efficiently ically partitions data to mitigate load imbalance. Our results execute machine learning and graph algorithms on large show the promise of this approach: many important machine scale datasets. Specifically, we tackle the following chal- learning and graph algorithms can be expressed in a single lenges: framework and are substantially faster than those in Hadoop Effective use of multi-cores. R is single-threaded. The and Spark. easiest way to incorporate parallelism is to execute programs across multiple R processes. Existing solutions for 1. A matrix-based approach parallelizing R use message passing techniques, includ- ing network communication, to communicate among pro- Many real-world applications require sophisticated analysis cesses [25] . This multi-process approach, also used in com- on massive datasets. Most of these applications use machine mercial parallel MATLAB, has two limitations. First, it learning, graph algorithms, and statistical analyses that are makes local copies of many data objects, boosting memory easily expressed as matrix operations. requirements. Figure1 shows that two R instances on a sin- For example, PageRank corresponds to the dominant gle physical server would have two copies of the same data, eigenvector of a matrix G that represents the Web graph. hindering scalability to larger datasets. Second, the network It can be calculated by starting with an initial vector x communication overhead becomes proportional to the num- and repeatedly performing x=G∗x until convergence [8]. ber of cores utilized instead of the number of distinct servers, Similarly, recommendation systems in companies like Net- again limiting scalability. flix are implemented using matrix decomposition [37]. Even Existing efforts for parallelizing R have another limita- graph algorithms, such as shortest path, centrality measures, tion. They do not support point-to-point communication. In- strongly connected components, etc., can be expressed using stead data has to be moved from worker processes to a desig- operations on the matrix representation of a graph [19]. nated master process after each phase. Thus, it is inefficient to execute anything that is not embarrassingly parallel [25]. Even simple iterative algorithms are costly due to the communication overhead via the master. Permission to make digital or hard copies of all or part of this work for personal or Imbalance in sparse computations. Most real-world classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation datasets are sparse. For example, the Netflix prize dataset on the first page. To copy otherwise, to republish, to post on servers or to redistribute is a matrix with 480K users (rows) and 17K movies (cols) to lists, requires prior specific permission and/or a fee. Eurosys’13 April 15-17, 2013, Prague, Czech Republic but only 100 million of the total possible 8 billion ratings Copyright c 2013 ACM 978-1-4503-1994-2/13/04. $15.00 are available. Similarly, very few of the possible edges are Server 1 Server 2 gramming model have led to inefficient implementations of algorithms or the development of domain specific systems. R process R process R process R process For example, Pregel was created for graph algorithms be- copy of copy of copy of data data data data cause MapReduce passes the entire state of the graph be- local copy tween steps [24]. network copy There have been recent efforts to better support large- network copy scale matrix operations. Ricardo [11] and HAMA [30] con- vert matrix operations to MapReduce functions but end up Figure 1. R’s poor multi-core support: multiple copies of data on inheriting the inefficiencies of the MapReduce interface. the same server and high communication overhead across servers. PowerGraph [14] uses a vertex-centric programming model (non matrix approach) to implement data mining and graph 5000 Netflix algorithms. MadLINQ provides a linear algebra platform on 1000 LiveJournal Dryad but does not efficiently handle sparse matrix compu- 500 ClueWeb−1B Twitter tations [29]. Unlike MadLINQ and PowerGraph, our aim is 100 50 to address the issues in scaling R, a system which already 10 has a large user community. Additionally, our techniques for 5 handling load imbalance in sparse matrices can be applicable 1 Block density (normalized) to existing systems like MadLINQ. 0 20 40 60 80 100 1.3 Our Contribution Block id We present Presto, an R prototype to efficiently process large, sparse datasets. Presto introduces the distributed array, Figure 2. Variance in block density. Y-axis shows density of a darray, as the abstraction to process both dense and sparse block normalized by that of the sparsest block. Lower is better. datasets in parallel. Distributed arrays store data across mul- present in Web graphs. It is important to store and manipu- tiple machines. Programmers can execute parallel functions late such data as sparse matrices and retain only non-zero en- that communicate with each other and share state using ar- tries. These datasets also exhibit skew due to the power-law rays, thus making it efficient to express complex algorithms. distribution [14], resulting in severe computation and com- Presto programs are executed by a set of worker processes munication imbalance when data is partitioned for parallel which are controlled by a master. For efficient multi-core execution. Figure2 illustrates the result of na ¨ıve partition- support each worker on a server encapsulates multiple R in- ing of various sparse data sets: LiveJournal (68M edges) [4], stances that read shared data. To achieve zero copying over- Twitter (2B edges), pre-processed ClueWeb sample1 (1.2B head, we modify R’s memory allocator to directly map data edges), and the ratings from Netflix prize (100M ratings). from the worker into the R objects. This mapping preserves The y-axis represents the block density relative to the spars- the metadata in the object headers and ensures that the allo- est block, when each matrix is partitioned into 100 blocks cation is garbage collection safe. having the same number of rows/columns. The plot shows To mitigate load imbalance, the runtime tracks the exe- that a dense block may have 1000× more elements than cution time and the number of elements in each array parti- a sparse block. Depending upon the algorithm, variance in tion. In case of imbalance, the runtime dynamically merges block density can have a substantial impact on performance or sub-divides array partitions between iterations and assigns (Section7). them to a new task, thus varying the parallelism and load in the system. Dynamic repartitioning is especially helpful for 1.2 Limitations of current data-parallel approaches iterative algorithms where computations are repeated across Existing distributed data processing frameworks, such as iterations. MapReduce and DryadLINQ, simplify large-scale data pro- We have implemented seven different applications in cessing [12, 17]. Unfortunately, the simplicity of the pro- Presto, ranging from a recommendation system to a graph gramming model (as in MapReduce) or reliance on rela- centrality measure. Our experience shows that Presto pro- tional algebra (as in DryadLINQ) makes these systems un- grams are easy to write and can be used to express a wide va- suitable for implementing complex algorithms based on ma- riety of complex algorithms. Compared to published results trix operations. Current systems either do not support state- of Hadoop and Spark [36], Presto achieves equally good ful computations, or do not retain the structure of global execution times with only a handful of multi-core servers. shared data (e.g., mapping of data to matrices), or do not For the PageRank algorithm, Presto

Presto: Distributed Machine Learning and Graph Processing with Sparse Matrices

A Parallel Lanczos Algorithm for Eigensystem Calculation

Exact Diagonalization of Quantum Lattice Models on Coprocessors

An Implementation of a Generalized Lanczos Procedure for Structural Dynamic Analysis on Distributed Memory Computers

Lanczos Vectors Versus Singular Vectors for Effective Dimension Reduction Jie Chen and Yousef Saad

Restarted Lanczos Bidiagonalization for the SVD in Slepc

A Parallel Implementation of Lanczos Algorithm in SVD Computation

A High Performance Block Eigensolver for Nuclear Conﬁguration Interaction Calculations Hasan Metin Aktulga, Md

Comparison of Numerical Methods and Open-Source Libraries for Eigenvalue Analysis of Large-Scale Power Systems

Error Analysis of the Lanczos Algorithm for the Nonsymmetric Eigenvalue Problem

Solving Large Sparse Eigenvalue Problems on Supercomputers

Parallel Numerical Algorithms Chapter 5 – Eigenvalue Problems Section 5.2 – Eigenvalue Computation

Advanced Lanczos Methods for Large-Scale Matrix Problems