arXiv:1801.04723v1 [cs.DC] 15 Jan 2018 08AM 978-1-4503-6372-3/18/01...$15.00 ACM. 2018 © ihteeprclyosre alcokrnigtms thu times, running clock wall observed empirically the with C utb ooe.Asrcigwt rdti emŠd To permiŠed. is credit with Abstracting honored. be must ACM use which 1969), in paper original Strassen’s in (mentioned ba decomposition LU on rely matr platforms data distributed big and using efficient version for methods Existing version. opeiy n ec osmsalrefato fresource of fraction large a is consumes s and It hence computational and of systems. complexity, terms such in many both operation, of expensive component an essential an become Computing, has Scientific Science, appl Earth many Science, by Data used in procedure tions basic a is inversion Dense INTRODUCTION 1 A Algorithm, Strassen’s Spark Inversion, Matrix Algebra, Linear KEYWORDS • CONCEPTS CCS block-sizes. w.r.t. behaviour U-shaped the plaining O:10.1145/3154273.3154300 DOI: India Varanasi, [email protected] ’18, from ICDCN permissions requ Request lists, fee. to a redistribute to and/or or servers on post to work publish, this of components n for this Copyrights bear page. copies first that the on and thi tion advantage of commercial ar part copies or that or profit provided all for fee of without granted copies is hard use or classroom digital make to Permission c match which times Furt running prop theoretical methods. the derive of and art analysis gorithm, the theoretical of detailed state a provide the we than more, th efficient implement more is We it that recursion. experimentatio extensive of through and level algorithm, proposed each at operations propo fewer algor we inversion paper, matrix this Strassen’s In on mult based recursion. matrix scheme of e.g. different levels calculations, various side at of tion, lot a ar require and algorithms plex these However, algorithms. block-recursive nee Ma immense e.g. operations, an algebra to linear lead scalable Socia has and Sciences, efficient Earth for etc. as such Sciences, domains Physical in Networks, data big of growth Œe ABSTRACT optn methodologies Computing PN atadSaal arxIvrinMto nApache in Method Inversion Matrix Scalable and Fast A SPIN: ninIsiueo ehooyKharagpur Technology of Institute Indian ninIsiueo ehooyKharagpur Technology of Institute Indian [email protected] ornsuBhaŠacharya Sourangshu [email protected] etBna,India Bengal, West etBna,India Bengal, West hna Misra Chandan → aRdc algorithms; MapReduce rspirseicpermission specific prior ires o aeo distributed or made not e g. okfrproa or personal for work s tc n h ulcita- full the and otice we yohr than others by owned oyohrie rre- or otherwise, copy t,and etc, ,show n, sdal- osed rxin- trix com- e pache losely iplica- xin- ix ithm far s pace ex- s ea se in s her- sed ica- Spark d e l ieyue ehiu o itiue arxivrin po inversion, matrix the distributed is for decomposition technique LU used architectures. widely of set diverse a with aa n h eil tutr o Dsalwdistributed allow RDDs for structure flexible the and Java, / 2mlilctosa ahrcrinlvlo Udecomposit LU of level recursion each at multiplications 12 ain.Hwvr nsieo en piie,teimplemen the 9 optimized, requires being mult of of spite number in However, the reduce cations. of to part algorithms optimized costliest of Œe couple ha a authors cluster. the a and multiplication on matrix the tasks is computation comput Spark are of that pipeline subtasks a into as [ down approach broken recursive is block computation the the In decomposition. LU o on algorithms pro based [10] inversion block-recursive al. et optimized Liu several man Recently, discussed optimizations. and matri specific decomposition a Hadoop LU inverting the al. of computing et on implementation Xiang relying based Hadoop structure. a block-recursive proposed efficient it’s to due h arx hc ae h mlmnainpromslower. perform implementation the makes which to matrix, decomposition LU the the a‰er multiplication 7 additional an atrta aopMpeue tscoeitrrto with intergration close It’s MapReduce. Hadoop than faster a iterative and distribut interactive in-memory runs which its ability, processing for popularity its gained analyti has media Spark social [11] [12], science learning climate machine [7], e.g., processing applications, processing many fault-tolerant in and datasets scalable for solution inant itiue arxivrinagrtm,i nipratc important an is algorithms, inversion lar matrix r efficient distributed computational designing Hence, of inversion. amounts matrix for huge sources consuming mul thus over appli stored and these possibly servers, of matrices, huge many on era, work data to big have the tions In workloads. the of many aafaeok,weejb ol edsrbtdoe machi over b distributed using be inversion could matrix jobs distributed where frameworks, on data studies many Surprising not 2). are section in (reviewed optimizations architect from specific come seŠings memory shared in speed-ups main u nSakfripeetto flresaedsrbtdm distributed scale large of inversion. pape implementation this for in Spark Hence, on cus fault-tolerance. and scalability on compr without ing efficiently, implemented be to algorithms sive eopsto 5,Gusa lmnto 3,ec oto t of Most etc. [3], require Elimination Gaussian [5], C [13], decomposition decomposition LU [13], decomposition QR on based ods ic t ees n2012, in release its Since eeaeavreyo xsigivrinagrtm,eg me e.g. algorithms, inversion existing of variety a are Œere ninIsiueo ehooyKharagpur Technology of Institute Indian ninIsiueo ehooyKharagpur Technology of Institute Indian O ( n O 3 ( ) n ie(where time 3 [email protected] ) prtoso h efnd ftercrintree, recursion the of node leaf the on operations omaK Ghosh K. Soumya etBna,India Bengal, West etBna,India Bengal, West wsi Haldar Swastik [email protected] n Spark eoe h re ftemti) and matrix), the of order the denotes 1]hsbe dpe sadom- a as adopted been has [19] pplications s[] etc. [2], cs hallenge. y there ly, Spark n fhuge of escale ge efo- we r o and ion egiven ve ddata ed graph , invert tation posed recur- ssibly omis- Scala most atrix tiple hem holesky [17] ipli- nes 10], ure the ca- th- ed ig- e- y x ICDCN ’18, January 4–7, 2018, Varanasi, India C. Misra et al.

In this paper, we use a much simpler and less exploited algo- an algorithm suggested by Strassen in [16]. It uses Newton itera- rithm, proposed by Strassen in his 1969 multiplication paper [16]. tion method to increase its stability while preserving parallelism. Œe algorithm follows similar block-recursion structure as LU de- Most of the above works are based on specialized matrices and not compostion, yet providing a simpler approach to matrix inversion. meant for general matrices. In this paper, we concentrate on any Œis approach involves no additional at the kind of square positive definite and invertible matrices which are leaf level of recursion, and requires only 6 multiplications at inter- distributed on large clusters which the above algorithms are not mediate levels. We propose and implement a distributed matrix suitable for. inversion algorithm based on Strassen’s original serial inversion scheme. We also provide a detailed analysis of wall clock time for the proposed algorithm, thus revealing the ‘U’-shaped behaviour 2.2 Multicore and GPU based approach with respect to block size. Experimentally, we show comprehen- sively, that the proposed approach is superior to the LU decom- In order to fully exploit the multicore architecture, tile algorithms position based approaches for all corresponding block sizes, and have been developed. Agullo et al. [1] developed such a tile algo- hence overall. We also demonstrate that our analysis of the pro- rithm to invert a symmetric positive definite matrix using Cholesky posed approach matches with the empirically observed wall clock decomposition. Sharma et al. [15] presented a modified Gauss- time, and similar to ideal scaling behaviour. In summary: Jordan algorithm for matrix inversion on CUDA based GPU plat- form and studied the performance metrics of the algorithm. EzzaŠi (1) We propose and implement a novel approach (SPIN) to dis- et al. [6] presented several algorithms for computing matrix in- tributed matrix inversion, based on an algorithm proposed verse based on Gauss-Jordan algorithm on hybrid platform consist- by Strassen [16]. ing of multicore processors connected to several GPUs. Although (2) We provide a theoretical analysis of our proposed algo- the above works have demonstrated that GPU can considerably rithm which matches closely with the empirically observed reduce the computational time of matrix inversion, they are non- wall clock time. scalable centralized methods and need special hardwares. (3) Œrough extensive experimentation, we show that the pro- posed algorithm is superior to the LU decomposition based approach. 2.3 MapReduce based approach MadLINQ [14] offered a highly scalable, efficient and fault toler- 2 RELATED WORK ant matrix computation system with a unified programming model Œe literature on parallel and distributed matrix inversion can be which integrates with DryadLINQ, data parallel computing sys- divided broadly into three categories: 1) HPC based approach, 2) tem. However, it does not mention any inversion algorithm ex- GPU based approach and 3) Hadoop and Spark based approach. plicitly. Xiang et al. [17] implemented first LU decomposition Here, we briefly review them. based matrix inversion in Hadoop MapReduce framework. How- ever, it lacks typical Hadoop shortcomings like redundant data 2.1 HPC based approach communication between map and reduce phases and inability to preserve distributed recursion structure. Liu et al. [10] provides LINPACK, LAPACK and ScaLAPACK are some of the most robust the same LU based distributed inversion on Spark platform. It op- so‰ware packages that support matrix inversion. LIN- timizes the algorithm by eliminating redundant matrix multiplica- PACK was wriŠen in Fortran and used on shared-memory vec- tions to achieve faster execution. Almost all the MapReduce based tor computers. It has been superseded by LAPACK which runs approaches relies on LU decomposition to invert a matrix. Œe more efficiently on modern cache-based architectures. LAPACK reason is that it partitions the computation in a way suitable for has also been extended to run on distributed-memory MIMD par- MapReduce based systems. In this paper, we show that matrix in- allel computers in ScaLAPACK package. However, these packages version can be performed efficiently in a distributed environment are based on architectures and frameworks which are not fault tol- like Spark by implementing Strassen’s scheme which requires less erant and MapReduce based matrix inversion are more scalable number of multiplications than the earlier providing faster execu- than ScaLAPACK as shown in [17]. Lau et al. [8] presented two tion. algorithms for inverting sparse, symmetric and positive definite matrices on SIMD and MIMD respectively. Œe algorithm uses technique and the sparseness of the matrix to achieve higher performance. Bientinesi et al. [4] presented 3 ALGORITHM DESIGN a parallel implementation of symmetric positive definite matrix In this section, we discuss the implementation of SPIN on Spark on three architechtures — sequential processors, symmetric multi- framework. First, we describe the original Strassen’s inversion al- processors and distributed memory parallel computers using Cholesky gorithm [16] for serial matrix inversion in section 3.1. Next, in sec- factorization technique. Yang et al. [18] presented a parallel al- tion 3.2, we describe the BlockMatrix data structure from MLLib gorithm for matrix inversion based on Gauss-Jordan elimination which is used in our algorithm to distribute the large input matrix with partial pivoting. It used efficient mechanism to reduce the into the distributed file system. Finally, section 3.3 describes the communication overhead and also provides good scalability. Bai- distributed inversion algorithm, and its implementation strategy ley et al. presented techniques to compute inverse of a matrix using using Blockmatrix. SPIN: A Fast and Scalable Matrix Inversion Method in Apache Spark ICDCN’18,January4–7,2018,Varanasi,India

3.1 Strassen’s Algorithm for Matrix Inversion A Strassen’s matrix inversion algorithm appeared in the same paper in which the well known Strassen’s matrix multiplication was pub- A11 V lished. Œis algorithm can be described as follows. Let two matri- ces A and C = A−1 be split into half-sized sub-matrices: A11 V A11 V −1 A11 A12 = C11 C12 A A C C A−1 −1 A−1 −1 A−1 −1 A−1 −1  21 22  21 22 11 V 11 V 11 V 11 V Œen the result C can be calculated as shown in Algorithm 1. Intuitively, the steps involved in the algorithm are difficult to be performed in parallel. However, for input matrices which are too large to be fit into the memory on a single server, each such step is Figure 1: Recursion tree for Algorithm 2 required to be processed distributively. Œese steps include break- ing a matrix into four equal size sub-matrices, multiplication and subtraction of two matrices, multiplying a matrix to a scalar and array representing the elements of the matrix arranged in a column arranging four half-sized sub-matrices into a full matrix. All these major fashion. steps are done by spliŠing the matrix into blocks which act as ex- ecution unit of the spark job. A brief description of the block data 3.3 Distributed Block-recursive Matrix structure is given below. Inversion Algorithm Œe distributed block-recursive algorithm can be visualized as Fig- Algorithm 1: Strassen’s Serial Inversion Algorithm ure 1, where upper le‰ sub-matrix is divided recursively until it can function Inverse(); be inverted serially on a single machine. A‰er the leaf node inver- Input :Matrix A (input matrix of size n × n), int threshold sion, the inverted matrix is used to compute intermediate matrices, Output:Matrix C (invert of matrix A where each step is done distributively. Another recursive call is begin performed for matrix VI until leaf node is reached. Like A11, it is if n=threshold then also inverted on a single node when the leaf node is reached. Œe invert A in any approach (e.g., LU, QR, SVD core inversion algorithm (described in Algorithm 2) takes a ); (say A) represented as BlockMatrix, as input as shown in Figure 1. else Œe core computation performed by the algorithm is based on six = n distributed methods, which are as follows: Compute A11, B11,..., A22, B22 by computing n 2 ; −1 • breakMat: Breaks a matrix into four equal sized sub-matrices I ← A11 II ← A21.I • xy: Returns one of the four sub-matrices a‰er the break- III ← I .A12 ing, according to the index specified by x and y. IV ← A21.III • multiply: Multiplies two BlockMatrix V ← IV − A22 • subtract: Subtracts two BlockMatrix VI ← V −1 • scalarMul: Multiples a scalar with a BlockMatrix C12 ← III .VI • arrange: Arranges four equal quarter BlockMatrices into a C21 ← VI .II single full BlockMatrix. VII ← III .C21 Below we describe the methods in a liŠle bit more details and C11 ← I − VII also provide the algorithm for each. C22 ←−VI breakMat method breaks a matrix into four sub-matrices but end does not return four sub-matrices to the caller. It just prepare the return C input matrix to a form which help filtering each part easily. As end described in Algorithm 3 it takes a BlockMatrix and returns a Pair- RDD of tag and Block using a mapToPair transformation. First, the BlockMatrix is converted into an RDD of MatrixBlocks. Œen, each MatrixBlock of the RDD is mapped to tuple of (tag, MatrixBlock), 3.2 Data Structure resulting a pairRDD of such tuples. Inside the mapToPair trans- In order to distribute the matrix in the HDFS (Hadoop Distributed formation, we carefully tag each MatrixBlock according to which File System), we create a distributed matrix called BlockMatrix, quadrant it belongs to. which is basically an RDD of MatrixBlocks spread in the cluster. xy method is a generic method signature for four methods used Distributing the matrix as a collection of Blocks makes them easy for accessing one of the four sub-matrices of size 2n−1 from a ma- to be processed in parallel and follow divide and conquer approach.MatrixBlocktrix of size 2n. Each method consists of two transformation — filter is a block of matrix represented as a tuple ((rowIndex, columnIndex), and map. filter takes the matrix as a pairRDD of (tag, MatrixBlock) Matrix). Here, rowIndex and columnIndex are the row and column tuple which was the output of breakMat method and filters the ap- index of a block of the matrix. Matrix refers to a one-dimensional propriate portion against the tag associated with the MatrixBlock. ICDCN ’18, January 4–7, 2018, Varanasi, India C. Misra et al.

Algorithm 2: Spark Algorithm for Strassen’s Inversion Algorithm 3: Spark Algorithm for breaking a BlockMatrix Scheme function breakMat(); function Inverse(); begin begin Input : BlockMatrix A, int size Input :BlockMatrix A, int size, int blockSize Output:PairRDD brokenRDD Output:BlockMatrix AInv ARDD ← A.toRDD size = Size of matrix A or B; MapToPair(); blockSize = Size of a single matrix block; begin = size Input :block of ARDD n blockSize ; if n = 1 then Output:tuple of brokenRDD RDD < Block > invA ← A.toRDD() ri ← block.rowIndex Map(); ci ← block.colIndex begin if ri/size = 0 & ci/size = 0 then Input :Block block tag ← “A11” Output:Block block end block.matrix ← locInverse(block.matrix) else if ri/size = 0 & ci/size = 1 then return block tag ← “A12” end end blockAInv ← invA.toBlockMatrix() else if ri/size = 1 & ci/size = 0 then return blockAInv tag ← “A21” else end size ← size/2 else , pairRDD ← breakMat(A size) tag ← “A22” , A11 ← 11(pairRDD blockSize) end , A12 ← 12(pairRDD blockSize) block.rowIndex ← ri%size , A21 ← 21(pairRDD blockSize) block.colIndex ← ci%size , A22 ← 22(pairRDD blockSize) return Tuple2(tag,block) , , I ← Inverse(A11 size blockSize) end II ← multiply(A21,I) return brokenMat III ← multiply(I,A12) end IV ← multiply(A21,III) V ← subtract(IV, A22) VI ← Inverse(V, size,blockSize) Algorithm 4: Spark Algorithm for multiplying a scalar to a C12 ← multiply(III,VI) distributed matrix C21 ← multiply(VI, II) function xy(); VII ← multiply(III,C21) begin C11 ← subtract(I,VII) Input : PairRDD brokenRDD C22 ← scalerMul(VI, −1,blockSize) Output:BlockMatrix xy C ← arrange(C11,C12,C21,C22,size,blockSize) filter(); return C begin end Input :PairRDD brokenRDD end Output:PairRDD f ilteredRDD return brokenRDD.tag = “Axy ” end map(); begin Input :PairRDD f ilteredRDD Output:RDD rdd Œen it converts the pairRDD into RDD using the map transforma- return f ilteredRDD.block tion. end multiply method multiplies two input sub-matrices and returns xy ← rdd.toBlockMatrix() another sub-matrix of BlockMatrix type. Multiply method in our return xy algorithm uses naive block matrix multiplication approach, which end replicates the blocks of matrices and groups the blocks together to be multiplied in the same node. It uses co-group to reduce the communication cost. subtract method subtracts two BlockMatrix and returns the re- sult as BlockMatrix. SPIN: A Fast and Scalable Matrix Inversion Method in Apache Spark ICDCN’18,January4–7,2018,Varanasi,India

scalarMul method (as described in Algorithm 5), takes a Block- Algorithm 6: Spark Algorithm for rearranging four sub- Matrix and returns another BlockMatrix using a map transforma- matrices into single matrix tion. Œe map takes blocks one by one and multiply each element function arrange(); of the block with the scalar. begin Input : BlockMatrix C11, BlockMatrix C12, BlockMatrix Algorithm 5: Spark Algorithm for multiplying a scalar to a C21, BlockMatrix C22, int size, int blockSize distributed matrix Output:BlockMatrix arranged function scalarMul(); C11RDD ← C11.toRDD() begin C12RDD ← C12.toRDD() Input :BlockMatrix A, double scalar, int blockSize C21RDD ← C21.toRDD() Output:BlockMatrix productMat C22RDD ← C22.toRDD() ARDD ← A.toRDD() Map(); Map(); begin begin Input :block of C12RDD Input :block of ARDD Output:block of C1 Output:block of productRDD block.colIndex ← block.colIndex + size product ← block.matrix.toDoubleMatrix return block block.matrix ← product.toMatrix end return block Map(); end begin productMat ← product.toBlockMatrix() Input :block of C21RDD return productMat Output:block of C2 end block.rowIndex ← block.rowIndex + size return block end arrange method (as described in Algorithm 6), takes four sub- Map(); matrices of size 2n−1 which represents four co-ordinates of a full n begin matrix of size 2 and arranges them in later and returns it as Block- Input :block of C22 Matrix. It consists of four maps, each one for a separate BlockMa- Output:block of C3 trix. Each map maps the block index to a different block index that block.rowIndex ← block.rowIndex + size provides the final position of the block in the result matrix. block.colIndex ← block.colIndex + size return block 4 PERFORMANCE ANALYSIS end In this section, we aŠempt to estimate the performances of the pro- unionRDD ← C11RDD.union(C1.union(C2.union(C3))) posed approach, and state-of-the-art approach using LU decompo- C ← unionRDD.toBlockMatrix() sition for distributed matrix inversion. In this work, we are inter- return C ested in the wall clock running time of the algorithms for varying end number of nodes, matrix sizes and other algorithmic parameters e.g., partition / block sizes. Œis is because we are interested in the practical efficiency of our algorithm which includes not only the time spent by the processes in the CPU, but also the time taken while waiting for the CPU as well as data communication during shuffle. Œe wall clock time depends on three independently an- • i = current processing level of algorithm in the recursion alyzed quantities: total computational complexity of the sub-tasks tree. to be executed, total communication complexity between executors • m = total number of levels of the recursion tree. of different sub-tasks on each of the nodes, and parallelization fac- tor of each of the sub-tasks or the total number of processor cores Œerefore, available. Later, in section 5, we compare the theoretically derived esti- • Total number of blocks in matrix A or B = b2 mates of wall clock time with empirically observed ones, for vali- p−q dation. We consider only square matrices of dimension 2p for all • b =2 the derivations. Œe key input and tunable parameters for the al- gorithms are: Lemma 4.1. Še proposed distributed block recursive strassen’s ma- • n = 2p : number of rows or columns in matrix A trix inversion algorithm or SPIN (presented in Algorithm 2) has a • b = number of splits for complexity in terms of wall clock execution time requirement, where q = n • 2 b = block size in matrix A n is the matrix dimension, b is the number of splits, and cores is the • cores = Total number of physical cores in the cluster actual number of physical cores available in the cluster, as ICDCN ’18, January 4–7, 2018, Varanasi, India C. Misra et al.

multiply method multiplies two BlockMatrices, the computa- n3 10b2 − 6b (b − 1) + 9b2 + n2 (b + 1) tion cost of which can be derived as CostSPIN = + + b2  b 2 b 2  min i , cores b × min i+ ,cores 4 4 1 m−1 3 3 2 h i h i i n n b − 1 2 2 2 Comp = 2 × = (6) n (b n + b − 2n) multiply i+1 2  + Õi=0  8  6b 2 n2 b × min i+ ,cores 4 1 and the parallelization factor will be h i (1) Proof. Before going into details of the analysis, we give the n2 PF = min , cores (7) performance analysis of the methods described in section 3.3. A multiply 4i+1  summary of the independently analyzed quantities is given in Ta- subtract method subtracts two BlockMatrices using a map trans- ble 1. formation. Œere are two subtraction in each recursion level. Œere- Œere are two primary part of the algorithm — if part and else fore, part. If part does the calculation of the leaf nodes of the leaf nodes of the recursion tree as shown in Figure 1, while else part does the m−1 computation for internal nodes. It is clearly seen from the figure n2 n2 (b − 1) Comp = 2i × = (8) that, at level i, there are 2i nodes and the leaf level contains 2p−q subtract i+1 2b Õi=0  4  nodes. Œere is only one transformation in if part which is map. It and the parallelization factor will be calculates the inverse of a matrix block in a single node using serial matrix inversion method. Œe size of each block is n/b and we n2 PF = min ,cores (9) need ≈ (n/b)3 time to perform each such method. Œerefore, the subtract  4i+1  computation cost to process all the leaf nodes is scalarMul method (as described in Algorithm 5), takes a Block- n 3 n3 Matrix and returns another BlockMatrix using a map transforma- = p−q × = (2) CompleafNode 2 2 tion. Œe map takes blocks one by one and multiply each each b  b In SPIN, leaf nodes processes one block on a single machine of element of the block with the scalar. Œerefore, the computation the cluster. In spite of being small enough to be accommodated in cost of scalarMul is a single node, we do not collect them in the master node for the − communication cost. Instead, we do a map which takes the only m 1 b2 b Comp = 2i × = (b − 1) (10) block of the RDD, do the calculation and return the RDD again. scalarMul i+1 Õ=  4  2 breakMat method takes a BlockMatrix and returns a PairRDD i 0 of tag and Block using a mapToPair transformation. If the method Again, here each block is consumed in parallel giving paralleliza- is executed for m levels, the computation cost of breakMat is tion factor as

− m 1 b2 b2 = i × = ( − ) = , (11) CompbreakMat 2 i 2b b 1 (3) PFscalarMul min i+1 cores Õi=0  4   4   Note that, ith level contains 2i nodes. Here each block is con- arrange method (as described in Algorithm 6), takes four sub- sumed in parallel giving parallelization factor as matrices of size 2n−1 which represents four co-ordinates of a full matrix of size 2n and arranges them in later and returns it as Block- b2 Matrix. It consists of four maps, each one for a separate BlockMa- PF = min , cores (4) breakMat  4i   trix. Each map maps the block index to a different block index that b 2 provides the final position of the block in the result matrix. Œe Œe total number of blocks processed in filter and map are i 4 computation cost and parallelization factor for maps are same as 2   b th scalarMul, which can be found in equation 10. and 4i+1 for i level respectively. Consequently, the paralleliza-   b 2 b 2 SPIN requires 4 xy method calls, 6 multiplications, and 2 sub- tionfactorofbothofthemaremin i ,cores andmin i+ ,cores 4 4 1 tractions for each recursion level. When summed up it will give h  i h  i respectively. Œerefore, the computation cost for xy is equation 1.  m−1 i b 2 m−1 i b 2 = 2 × i = 2 × i+ i 0 4 i 0 4 1 Lemma 4.2. Compxy =   +   Še proposed distributed block recursive LU decom-  Í b 2 Í b 2  min i ,cores min i+ ,cores  position based matrix inversion algorithm or SPIN (presented in Al-  4 4 1   h  i h  i    (5) gorithm 5, 6, and 7 in [10]) has a complexity in terms of wall clock    8b2 − 4b 2b2 − 2b  execution time requirement, where n is the matrix dimension, b is = + the number of splits, and cores is the actual number of physical cores  b 2 b 2  min i ,cores min i+ ,cores   4 4 1  available in the cluster, as below  h  i h  i        SPIN: A Fast and Scalable Matrix Inversion Method in Apache Spark ICDCN’18,January4–7,2018,Varanasi,India

Table 1: Summary of the cost analysis of LU and SPIN

Computation Cost Parallelization Factor Method LU SPIN LU SPIN 3 3 × n n leafNode 9 b 2 b 2 — — 2 2 2 2 − + 2 − b , b , breakMat 3 b 3b 2 2b 2b min 4i cores min 4i cores 2 2  2 hb 2 i hb 2 i − + − + , , xy (filter) 3 b 3b 2 8b 4b min 4i 1 cores min 4i cores 1 2  2 h b 2 i hb 2 i − + − + , + , xy (map) 6 b 3b 2 2b 2b min 4i 2 cores min 4i 1 cores 3 3 h 2 i h 2 i 16n ( 3 − +  ) n ( 2 − ) n , n , multiply (large) 21b 3 b 7b 6 6b 2 b 1 min 4i cores min 4i+1 cores 2 2 2 2 2 h i h i 8n (b −1)(8b −112) n (b −1) b 2 , b 2 , multiply Communication (large) 105b 2 6b min 4i cores min 4i+1 cores 8n3 3 hn2 i h i ( − + ) + , multiply (small) 42b 3 b 7b 6 — min 4i 1 cores — 2 2 2 h i n (b −1)(8b −112) b 2 , multiply Communication (small) 105b 2 — min 4i+1 cores — 2n2 2 n2 h n2 i n2 ( − + ) ( − ) , + , subtract 3b 2 b 3b 2 2b b 1 min 4i cores min 4i 1 cores 4 2 b h b 2 i h b 2 i − + ( − ) , + , scalarMul 3 b 3b 2 2 b 1 min 4i cores min 4i 1 cores h i h 2 i  b ( − ) b , arrange — 2 b 1 — min 4i+1 cores n 3 n2 h i Additional Cost 7 × 2 — min 4 ,cores —  h i

7 × n 3 = 2 CompAdditionalCost 3 2 + 2 + 2 2  (13) 9n (b − 1)[210b (b − 2) 64n (b 1)(b − 14)] min n ,cores CostLU = + 4 b2 2 b 2 h i 105b × min i , cores 4 Œere are two primary parts of the LU method — if part and h i 2 + 2 + 2 else part. If part does the LU decomposition at the leaf nodes of + (b − 1)[70b (b − 2) 8n (b 1)(b − 14)] the recursion tree while else part does for the internal nodes. If 2 b 2 105b × min i+ ,cores 4 1 part requires 2 LU decomposition, 4 matrix inversion and 3 matrix h i − (b − 1)(b − 2) multiplications and there are 2p q number of leaf nodes in the re- + 3 2 n 2 × b , cursion tree. Each of these processing requires O( ) time for a 105b min 4i+2 cores b h i   2 2 + + + matrix of n dimension. Œerefore, the total cost of the if part is + 2n (b − 1)[8n(b b 6) 7b(b − 2)] 3 n2 3 21b × min i , cores n n 4 Comp = 9 × 2p−q ×( )3 = 9 × (14) h i leagNode b b2  8n3(b − 1)(b2 + b − 6) 7n3 + + Œe else part requires 4 multiply, 1 subtraction and 2 calls to n2 n2 3 × + , × , 42b min 4i 1 cores 8 min 4 cores getLU method. getLU method compose the LU of a matrix by tak- h i h i (12) ing 9 matrices of dimension 2k and arranges them to return 3 ma- + trices of size 2k 1. It requires 4 multiply and 2 scalarMul methods of matrices of dimension 2k . Proof. Liu et al. in [10] has described several algorithms for Œe recursion scheme of LU decomposition is liŠle bit different i distributed matrix inversion using LU decomposition. We are re- from SPIN. Here the number of LU call at level i is 2 − 1 instead i ferring the most optimized one (stated as Algorithm 5, 6 and 7 in of 2 of SPIN. Œe computation and communication costs for the the paper) for the performance analysis. Œe core computation of methods (summarized in Table 1) can be summed up to get equa-  the algorithm is done with 1) a call to a recursive method LU which tion 12. basically decomposes the input matrix recursively until leaf nodes of the tree where the size of the matrix reaches the block size and 5 EXPERIMENTS 2) the computation a‰er LU decomposition. Œe matrix inversion In this section, we perform experiments to evaluate the execution algorithm performs 7 additional multiplications (as given in Algo- efficiency of our implementation SPIN comparing it with the dis- n rithm 5 in [10]) of size 2 , providing additional cost of, which is tributed LU decomposition based inversion approach (to be men- n basically 7 matrix multiplications  of dimension 2 . We call this tioned as LU from now) and scalability of the algorithm compared as Additional Cost and can be obtained as follows  to ideal scalability. First, we select the fastest wall clock execution ICDCN ’18, January 4–7, 2018, Varanasi, India C. Misra et al.

Table 2: Summary of Test setup components specifications LU SPIN O(n3) O(n3.2) 1,200 Component Name Component Size Specification Processor 2 Intel Xeon 2.60 GHz Core 6 per processor NA 900 Physical Memory 132 GB NA Ethernet 14 Gb/s Infini Band OS NA CentOS 5 600 File System NA Ext3

Apache Spark NA 2.1.0 Execution Time in Sec. ApacheHadoop NA 2.6.0 300 Java NA 1.7.0 update 79 0 4,000 8,000 12,000 16,000

Matrix Size time among different partition size for each approach and compare Figure 2: Fastest running time of LU and Strassen’s based them. Second, we conduct a series of experiments to individually inversion among different block sizes evaluate the effect of partition size and matrix size of each compet- ing approach. At last we evaluate the scalability of our implemen- tation. 5.2 Comparison with state-of-the-art distributed systems 5.1 Test Setup In this section, we compare the performance of SPIN with LU. We report the running time of the competing approaches with increas- All the experiments are carried out on a dedicated cluster of 3 ing matrix dimension in Figure 2. We take the best wall clock time nodes. So‰ware and hardware specifications are summarized in (fastest) among all the running time taken for different block sizes. Table 2. Here NA means Not Applicable. It can be seen that, SPIN takes the minimum amount of time for For block level multiplications both the implementation uses all matrix dimensions. Also, as expected the wall clock execution JBlas [9], a linear algebra library for Java based on BLAS and LA- time increases with the matrix dimension, non-linearly (roughly PACK. We have tested the algorithms on matrices with increasing as O(n3)). Also, the gap in wall clock execution time between both ( × ) ( × ) cardinality from 16 16 to 16384 16384 . All of these test ma- SPIN and LU increases monotonically with input matrix dimen- trices have been generated randomly using Java Random class. sion. As we shall see in the next section, both LU and SPIN follow a U shaped curve as a function of block sizes, hence allowing us to Resource Utilization Plan. While running the jobs in the clus- report the minimum wall clock execution time over all block sizes. ter, we customize three parameters — the number of executors, the executor memory and the executor cores. We wanted a fair com- 5.3 Variation with partition size parison among the competing approaches and therefore, we en- In this experiment, we examine the performance of SPIN with LU sured jobs should not experience thrashing and none of the cases with increasing partition size for each matrix size. We report the tasks should fail and jobs had to be restarted. For this reason, we re- wall clock execution time of the approaches when partition size stricted ourselves to choose the parameters value which provides is increased within a particular matrix size. For each matrix size good utilization of cluster resources and mitigating the chance of (from (4096×4096) to (16384×16384)) we increase the partition size task failures. By experimentation we found that, keeping execu- until we get a intuitive change in the results as shown in Figure. 3. tor memory as 50 GB ensures successful execution of jobs without It can be seen that both LU and SPIN follows a U shape curve. “out of memory” error or any task failures for all the competing However, SPIN outperforms LU when they have the same partition approaches. Œis includes the small amount of overhead to deter- size, for all the matrix sizes. Œe reason of this is manifold. First mine the full request to YARN for each executor which is equal to 3 n 3.5 GB. Œerefore, the executor memory is 46.5 GB. Œough the of all, LU requires 9 times more O b operations compared to physical memory of each node is 132 GB, we keep only 100 GB a single operation of SPIN. For small partition sizes, where leafN- as YARN resource allocated memory for each node. Œerefore, the ode dominates the overall wall clock execution time, this cost is total physical memory for job execution is 100 GB resulting 2 ex- responsible for LU’s slower performance. ecutors per node and a total 6 executors. We reserve, 1 core for Additionally, when the partition size increases, the number of operating system and hadoop daemons. Œerefore, available total recursion level also increases and consequently the cost of multiply core is 11. Œis leaves 5 cores for each executor. We used these method increases which is the costliest method call. Œough there values of the run time resource parameters in all the experiments is a difference between the number of recursion level for any parti- except the scalability test, where we have tested the approach with tion size (), the additional matrix multiplication cost (as shown in varied number of executors. Table 1) provides enough cost to slowdown LU’s performance. SPIN: A Fast and Scalable Matrix Inversion Method in Apache Spark ICDCN’18,January4–7,2018,Varanasi,India

·104 500 LU 3,200 LU LU 400 SPIN SPIN 0.9 SPIN 2,400 300 0.6 1,600 200 800 0.3 100 Execution Time in Sec. Execution Time in Sec. Execution Time in Sec. 248 16 32 248 16 32 4816 32 64 Partition Size Partition Size Partition Size (a) 4096 × 4096 (b) 8192 × 8192 (c) 16384 × 16384 ,

Figure 3: Comparing running time of LU and SPIN for matrix size (4096 × 4096), (8192 × 8192), (16384 × 16384) for increasing partition size

5.4 Comparison between theoretical and Table 3: Experimental results of wall clock execution time experimental result of different methods in SPIN(‡e unit of execution time is millisecond) In this experiment, we compare the theoretical cost of SPIN with the experimental wall clock execution time to validate our theoret- Method b=2 b=4 b=8 b=16 ical cost analysis. Figure 4 shows the comparison for three matrix leafNode 43504 11550 5040 3980 sizes (from (4096 × 4096) to (16384 × 16384) and for each matrix breakMat 178 441 901 1764 size with increasing partition size. As expected, both theoretical and experimental wall clock exe- xy 2913 1353 693 309 cution time shows a U shaped curve with increasing partition size. multiply 7836 13116 23256 37968 Œe reason is that, for smaller partition sizes, the block size be- subtract 1412 1854 2820 5592 comes very large for large matrix size. As a result, the single node scalar 333 728 1308 2450 matrix inversion shares most of the execution time and subdues arrange 307 685 1510 3074 the effect of matrix multiplication execution time which are pro- Total 56483 29727 35528 55137 cessed distributedly. Œat is why we find large execution time at beginning, which is also depicted in Table 3, where experimental wall clock execution time is tabulated for different methods used 6 CONCLUSION in the algorithm for matrix of dimension 4096. It is seen that for In this paper, we have focused on the problem of distributed matrix b = 2, the leafNode cost is far more than matrix multiply method. inversion of large matrices using Spark framework. To make large Later, when partition size further increases, the leaf node cost scale matrix inversion faster, we have implemented Strassen’s ma- n3 trix inversion technique which requires six multiplications in each drops sharply as the cost depends on b 2 , which decreases the cost by square of partition size. On the other hand, the number of multi- recursion step. We have given the detailed algorithm, called SPIN, ply becomes large for enhanced recursion level, and thus the effec- of the implementation and also presented the details of the cost tive cost which subdues the effect of leafNode cost. As in Table 3, analysis along with the baseline approach using LU decomposition. for b = 8 onwards the multiply cost becomes more and more dom- By doing that, we discovered that the primary boŠleneck of inver- inating resulting further increase in wall clock execution time. sion algorithm is matrix multiplications and that SPIN is faster as it requires less number of multiplications compared to LU based approach. 5.5 Scalability We have also performed extensive experiments on wall clock execution time of both the approaches for increasing partition size In this section, we investigate the scalability of SPIN. For this, we as well as increasing matrix size. Results showed that SPIN out- generate three test cases, each containing a different set of two ma- performed LU for all the partition and matrix sizes and also the trices of sizes equal to (4096 × 4096), (8192 × 8192) and (16384 × difference increases as we increase matrix size. We also showed 16384). Œe running time vs. the number of spark executors for the resemblance between theoretical and experimental findings of these 3 pairs of matrices is shown in Figure 5. Œe ideal scalability SPIN, which validated our cost analysis. At last we showed that = line (i.e. T (n) T (1)/n - where n is the number of executors) has SPIN has a good scalability with increasing matrix size. been over-ploŠed on this figure in order to demonstrate the scala- bility of our algorithm. We can see that SPIN has a good scalability, REFERENCES with a minor deviation from ideal scalability when the size of the [1] Emmanuel Agullo, Henricus Bouwmeester, Jack Dongarra, Jakub Kurzak, Julien matrix is low (i.e. for (4096 × 4096) and (8192 × 8192)). Langou, and Lee Rosenberg. 2010. Towards an Efficient Tile Matrix Inversion of ICDCN ’18, January 4–7, 2018, Varanasi, India C. Misra et al.

, 1,600 6 000 200 Œeoretical Œeoretical Œeoretical Experimental Experimental 4,800 Experimental 150 3,600 800 100 2,400 1,200 50 Execution Time in Sec. Execution Time in Sec. Execution Time in Sec. 24 8 16 248 16 32 4816 32 64 Partition Size Partition Size Partition Size (a) 4096 × 4096 (b) 8192 × 8192 (c) 16384 × 16384 ,

Figure 4: Comparing theoretical and experimental running time of SPIN for matrix size (4096×4096), (8192×8192), (16384×16384) for increasing partition size

200 Ideal 800 Ideal Ideal SPIN SPIN 3,600 SPIN 150 600 2,400 100 400 1,200 50 200 Execution Time in Sec. Execution Time in Sec. Execution Time in Sec. 5 10 15 20 25 5 10 15 20 25 5 10 15 20 25 Total Number of Physical Cores Total Number of Physical Cores Total Number of Physical Cores

(a) 4096 × 4096 (b) 8192 × 8192 (c) 16384 × 16384 ,

Figure 5: ‡e scalability of SPIN, in comparison with ideal scalability (blue line), on matrix (4096 × 4096), (8192 × 8192) and (16384 × 16384)

Symmetric Positive Definite Matrices on Multicore Architectures.. In VECPAR, 2016. Mllib: Machine learning in apache spark. Še Journal of Machine Learning Vol. 10. Springer, 129–138. Research 17, 1 (2016), 1235–1241. [2] Ali Al Essa and Miad Faezipour. 2017. MapReduce and Spark-Based Analytic [12] Rahul PalamuŠam, Renato Marroqu´ın Mogrovejo, Chris MaŠmann, Brian Wil- Framework Using Social Media Data for Earlier Flu Outbreak Detection. In In- son, Kim Whitehall, Rishi Verma, Lewis McGibbney, and Paul Ramirez. 2015. dustrial Conference on Data Mining. Springer, 246–257. SciSpark: Applying in-memory distributed computing to weather event detec- [3] Steven C Althoen and Renate Mclaughlin. 1987. Gauss-Jordanreduction: A brief tion and tracking. In Big Data (Big Data), 2015 IEEE International Conference on. history. Še American mathematical monthly 94, 2 (1987), 130–142. IEEE, 2020–2026. [4] Paolo Bientinesi, Brian Gunter, and Robert A Geijn. 2008. Families of algorithms [13] William H Press. 2007. Numerical recipes 3rd edition: Še art of scientific comput- related to the inversion of a symmetric positive definite matrix. ACM Transac- ing. Cambridge university press. tions on Mathematical So‡ware (TOMS) 35, 1 (2008), 3. [14] Zhengping Qian, Xiuwei Chen, Nanxi Kang, Mingcheng Chen, YuanYu, Œomas [5] Adrian Burian, Jarmo Takala, and Mikko Ylinen. 2003. A fixed-point imple- Moscibroda, and Zheng Zhang. 2012. MadLINQ: large-scale distributed matrix mentation of matrix inversion using Cholesky decomposition. In Circuits and computation for the cloud. In Proceedings of the 7th ACM european conference on Systems, 2003 IEEE 46th Midwest Symposium on, Vol. 3. IEEE, 1431–1434. Computer Systems. ACM, 197–210. [6] Pablo EzzaŠi, Enrique S ‹intana-Orti, and Alfredo Remon. 2011. High perfor- [15] Girish Sharma, Abhishek Agarwala, and Baidurya BhaŠacharya. 2013. A fast mance matrix inversion on a multi-core platform with several GPUs. In Parallel, parallel Gauss Jordan algorithm for matrix inversion using CUDA. Computers Distributed and Network-Based Processing (PDP), 2011 19th Euromicro Interna- & Structures 128 (2013), 31–37. tional Conference on. IEEE, 87–93. [16] . 1969. Gaussian elimination is not optimal. Numerische mathe- [7] Joseph E Gonzalez, Reynold S Xin, Ankur Dave, Daniel Crankshaw, Michael J matik 13, 4 (1969), 354–356. Franklin, and Ion Stoica. 2014. GraphX: Graph Processing in a Distributed [17] Jingen Xiang, Huangdong Meng, and Ashraf Aboulnaga. 2014. Scalable matrix Dataflow Framework.. In OSDI, Vol. 14. 599–613. inversion using mapreduce. In Proceedings of the 23rd international symposium [8] KK Lau, MJ Kumar, and R Venkatesh. 1996. Parallel matrix inversion techniques. on High-performance parallel and distributed computing. ACM, 177–190. In Algorithms & Architectures for Parallel Processing, 1996. ICAPP 96. 1996 IEEE [18] Kaiqi Yang, Yubai Li, and Yijia Xia. 2013. A parallel method for matrix inversion Second International Conference on. IEEE, 515–521. based on gauss-jordan algorithm. Journal of Computational Information Systems [9] Linear Algebra for Java 2017. Linear Algebra for Java. hŠp://jblas.org/. (2017). 9, 14 (2013), 5561–5567. [Online; accessed 30-July-2017]. [19] Matei Zaharia, Mosharaf Chowdhury, Michael J Franklin, ScoŠ Shenker, and [10] Jun Liu, Yang Liang, and Nirwan Ansari. 2016. Spark-based large-scale matrix Ion Stoica. 2010. Spark: Cluster computing with working sets. HotCloud 10, inversion for big data processing. IEEE Access 4 (2016), 2166–2176. 10-10 (2010), 95. [11] Xiangrui Meng, Joseph Bradley, Burak Yavuz, Evan Sparks, Shivaram Venkatara- man, Davies Liu, Jeremy Freeman, DB Tsai, Manish Amde, Sean Owen, et al.