A Fast and Scalable Matrix Inversion Method in Apache Spark ICDCN’18,January4–7,2018,Varanasi,India

SPIN: A Fast and Scalable Matrix Inversion Method in Apache Spark Chandan Misra Swastik Haldar Indian Institute of Technology Kharagpur Indian Institute of Technology Kharagpur West Bengal, India West Bengal, India [email protected] [email protected] Sourangshu Bhaacharya Soumya K. Ghosh Indian Institute of Technology Kharagpur Indian Institute of Technology Kharagpur West Bengal, India West Bengal, India [email protected] [email protected] ABSTRACT many of the workloads. In the big data era, many of these applica- e growth of big data in domains such as Earth Sciences, Social tions have to work on huge matrices, possibly stored over multiple Networks, Physical Sciences, etc. has lead to an immense need servers, and thus consuming huge amounts of computational re- for efficient and scalable linear algebra operations, e.g. Matrix in- sources for matrix inversion. Hence, designing efficient large scale version. Existing methods for efficient and distributed matrix in- distributed matrix inversion algorithms, is an important challenge. version using big data platforms rely on LU decomposition based Since its release in 2012, Spark [19] has been adopted as a dom- block-recursive algorithms. However, these algorithms are com- inant solution for scalable and fault-tolerant processing of huge plex and require a lot of side calculations, e.g. matrix multiplica- datasets in many applications, e.g., machine learning [11], graph tion, at various levels of recursion. In this paper, we propose a processing [7], climate science [12], social media analytics [2], etc. different scheme based on Strassen’s matrix inversion algorithm Spark has gained its popularity for its in-memory distributed data (mentioned in Strassen’s original paper in 1969), which uses far processing ability, which runs interactive and iterative applications fewer operations at each level of recursion. We implement the faster than Hadoop MapReduce. It’s close intergration with Scala proposed algorithm, and through extensive experimentation, show / Java, and the flexible structure for RDDs allow distributed recur- that it is more efficient than the state of the art methods. Further- sive algorithms to be implemented efficiently, without compromis- more, we provide a detailed theoretical analysis of the proposed al- ing on scalability and fault-tolerance. Hence, in this paper we fo- gorithm, and derive theoretical running times which match closely cus on Spark for implementation of large scale distributed matrix with the empirically observed wall clock running times, thus ex- inversion. plaining the U-shaped behaviour w.r.t. block-sizes. ere are a variety of existing inversion algorithms, e.g. meth- odsbased onQRdecomposition[13], LUdecomposition[13], Cholesky CCS CONCEPTS decomposition [5], Gaussian Elimination [3], etc. Most of them require O(n3) time (where n denotes the order of the matrix), and •Computing methodologies → MapReduce algorithms; main speed-ups in shared memory seings come from architecture KEYWORDS specific optimizations (reviewed in section 2). Surprisingly, there are not many studies on distributed matrix inversion using big- Linear Algebra, Matrix Inversion, Strassen’s Algorithm, Apache data frameworks, where jobs could be distributed over machines Spark with a diverse set of architectures. LU decomposition is the most arXiv:1801.04723v1 [cs.DC] 15 Jan 2018 widely used technique for distributed matrix inversion, possibly 1 INTRODUCTION due to it’s efficient block-recursive structure. Xiang et al. [17] proposed a Hadoop based implementation of inverting a matrix Dense matrix inversion is a basic procedure used by many applica- relying on computing the LU decomposition and discussed many tions in Data Science, Earth Science, Scientific Computing, etc, and Hadoop specific optimizations. Recently, Liu et al. [10] proposed has become an essential component of many such systems. It is several optimized block-recursive inversion algorithms on Spark an expensive operation, both in terms of computational and space based on LU decomposition. In the block recursive approach [10], complexity, and hence consumes a large fraction of resources in the computation is broken down into subtasks that are computed as a pipeline of Spark tasks on a cluster. e costliest part of the Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed computation is the matrix multiplication and the authors have given for profit or commercial advantage and that copies bear this notice and the full cita- a couple of optimized algorithms to reduce the number of multipli- tion on the first page. Copyrights for components of this work owned by others than cations. However, in spite of being optimized, the implementation ACM must be honored. Abstracting with credit is permied. To copy otherwise, or re- 3 publish, to post on serversor to redistribute to lists, requires prior specific permission requires 9 O(n ) operations on the leaf node of the recursion tree, and/or a fee. Request permissions from [email protected]. 12 multiplications at each recursion level of LU decomposition and ICDCN ’18, Varanasi, India an additional 7 multiplication aer the LU decomposition to invert © 2018 ACM. 978-1-4503-6372-3/18/01...$15.00 DOI: 10.1145/3154273.3154300 the matrix, which makes the implementation perform slower. ICDCN ’18, January 4–7, 2018, Varanasi, India C. Misra et al. In this paper, we use a much simpler and less exploited algo- an algorithm suggested by Strassen in [16]. It uses Newton itera- rithm, proposed by Strassen in his 1969 multiplication paper [16]. tion method to increase its stability while preserving parallelism. e algorithm follows similar block-recursion structure as LU de- Most of the above works are based on specialized matrices and not compostion, yet providing a simpler approach to matrix inversion. meant for general matrices. In this paper, we concentrate on any is approach involves no additional matrix multiplication at the kind of square positive definite and invertible matrices which are leaf level of recursion, and requires only 6 multiplications at inter- distributed on large clusters which the above algorithms are not mediate levels. We propose and implement a distributed matrix suitable for. inversion algorithm based on Strassen’s original serial inversion scheme. We also provide a detailed analysis of wall clock time for the proposed algorithm, thus revealing the ‘U’-shaped behaviour 2.2 Multicore and GPU based approach with respect to block size. Experimentally, we show comprehen- sively, that the proposed approach is superior to the LU decom- In order to fully exploit the multicore architecture, tile algorithms position based approaches for all corresponding block sizes, and have been developed. Agullo et al. [1] developed such a tile algo- hence overall. We also demonstrate that our analysis of the pro- rithm to invert a symmetric positive definite matrix using Cholesky posed approach matches with the empirically observed wall clock decomposition. Sharma et al. [15] presented a modified Gauss- time, and similar to ideal scaling behaviour. In summary: Jordan algorithm for matrix inversion on CUDA based GPU platform and studied the performance metrics of the algorithm. Ezzai (1) We propose and implement a novel approach (SPIN) to dis- et al. [6] presented several algorithms for computing matrix in- tributed matrix inversion, based on an algorithm proposed verse based on Gauss-Jordan algorithm on hybrid platform consist- by Strassen [16]. ing of multicore processors connected to several GPUs. Although (2) We provide a theoretical analysis of our proposed algo- the above works have demonstrated that GPU can considerably rithm which matches closely with the empirically observed reduce the computational time of matrix inversion, they are non- wall clock time. scalable centralized methods and need special hardwares. (3) rough extensive experimentation, we show that the proposed algorithm is superior to the LU decomposition based approach. 2.3 MapReduce based approach MadLINQ [14] offered a highly scalable, efficient and fault toler- 2 RELATED WORK ant matrix computation system with a unified programming model e literature on parallel and distributed matrix inversion can be which integrates with DryadLINQ, data parallel computing sys- divided broadly into three categories: 1) HPC based approach, 2) tem. However, it does not mention any inversion algorithm ex- GPU based approach and 3) Hadoop and Spark based approach. plicitly. Xiang et al. [17] implemented first LU decomposition Here, we briefly review them. based matrix inversion in Hadoop MapReduce framework. How- ever, it lacks typical Hadoop shortcomings like redundant data 2.1 HPC based approach communication between map and reduce phases and inability to preserve distributed recursion structure. Liu et al. [10] provides LINPACK, LAPACK and ScaLAPACK are some of the most robust the same LU based distributed inversion on Spark platform. It op- linear algebra soware packages that support matrix inversion. LIN- timizes the algorithm by eliminating redundant matrix multiplica- PACK was wrien in Fortran and used on shared-memory vec- tions to achieve faster execution. Almost all the MapReduce based tor computers. It has been superseded by LAPACK which runs approaches relies on LU decomposition to invert a matrix. e more efficiently on modern cache-based architectures. LAPACK reason is that it partitions

A Fast and Scalable Matrix Inversion Method in Apache Spark ICDCN’18,January4–7,2018,Varanasi,India

Fast Matrix Multiplication

Efficient Matrix Multiplication on Simd Computers* P

Matrix Multiplication: a Study Kristi Stefa Introduction

The Strassen Algorithm and Its Computational Complexity

Using Machine Learning to Improve Dense and Sparse Matrix Multiplication Kernels

Divide-And-Conquer Algorithms

Algebraic Elimination of Epsilon-Transitions Gérard Duchamp, Hatem Hadj Kacem, Éric Laugerotte

Implementing Strassen's Algorithm with BLIS

28 Matrix Operations

Minimizing Communication in All-Pairs Shortest Paths

High-Performance Matrix Multiplication on Intel and FGPA

Sparsifying the Operators of Fast Matrix Multiplication Algorithms