Towards a Graphblas Library in Chapel

Towards a Graphblas Library in Chapel

Towards a GraphBLAS Library in Chapel Ariful Azad, Aydın Buluc¸ fazad,[email protected] Computational Research Division Lawrence Berkeley National Laboratory Abstract—The adoption of a programming language is posi- called burdened parallelism in literature [4], is not specific to tively influenced by the breadth of its software libraries. Chapel Chapel and it manifests in many parallel programming plat- is a modern and relatively young parallel programming language. forms such as OpenMP. However, the problem is exacerbated Consequently, not many domain-specific software libraries exist that are written for Chapel. Graph processing is an important in distributed memory due to increased thread creation and domain with many applications in cyber security, energy, social communication costs. The primary goal in implementing a networking, and health. Implementing graph algorithms in the GraphBLAS-compliant library is performance. Consequently, language of linear algebra enables many advantages including we believe that divergence from the recommended program- rapid development, flexibility, high-performance, and scalability. ming style is justified as the library backend is rarely inspected The GraphBLAS initiative aims to standardize an interface for linear-algebraic primitives for graph computations. This paper by users and might not even be available for inspection. presents initial experiences and findings of implementing a subset of important GraphBLAS operations in Chapel. We analyzed II. BACKGROUND the bottlenecks in both shared and distributed memory. We also provided alternative implementations whenever the default A. Matrix and vector notations implementation lacked performance or scaling. A matrix A 2 Rm×n is said to be sparse when it is computationally advantageous to treat it differently from a I. INTRODUCTION dense matrix. In our experiments, we only use square matrices Chapel is a high-performance programming language de- and denote the number of rows/columns of the matrix by veloped by Cray [1]. Supporting a multithreaded execution n. The capacity of a vector x 2 Rn×1 is the number of model, Chapel provides a different model of programming entries it can store. The nnz() function computes the number than the Single Program, Multiple Data (SPMD) paradigm that of nonzeros in its input, e.g., nnz(x) returns the number of is prevalent in many of the HPC languages and libraries. nonzeros in x. For a sparse vector x, nnz(x) is less than or GraphBLAS [2] is a community effort to standardize linear- equal to capacity(x). algebraic building blocks for graph computations (hence the Our algorithms work for all inputs with different sparsity suffix -BLAS in its name). In GraphBLAS, the graph itself structures of the matrices and vectors. However, for simplicity, is represented as a matrix, which is often sparse, and the we only experimented with randomly generated matrices and operations on graphs are expressed in basic linear algebra vectors. Randomly generated matrices give us precise control operations such as matrix-vector multiplication or generalized over the nonzero distribution. Therefore, they are very useful matrix indexing [3]. Chapel provides support for index sets in evaluating our prototype library. In the Erdos-R˝ enyi´ random as first class citizens, hence making it an interesting and po- graph model G(n; p), each edge is present with probability p tentially productive language to implement distributed sparse independently from each other. For p = d=m where d m, matrices. in expectation d nonzeros are uniformly distributed in each In this work, we report on our early experiences in imple- column. We use f as shorthand of nnz(x)=capacity(x), which menting a sizable set of GraphBLAS operations in Chapel. is the density of a sparse vector. Our experience so far suggests that built-in implementations In this paper we only considered the Compressed Sparse of many sparse matrix operations are not scalable enough to Rows (CSR) format to store a sparse matrix because this is be used for large scale distributed computing. For some of supported in Chapel. CSR has three arrays: rowptrs is an these operations, we provide alternative implementations that integer array of length n + 1 that effectively stores pointers to improve the scalability substantially. the start and end positions of the nonzeros for each row, colids In many cases, we have found that adhering to a stricter is an integer array of length nnz that stores the col ids for SPMD programming style provides better performance than nonzeros, and values is an array of length nnz that stores the relying on the recommended Chapel multithreaded program- numerical values for nonzeros. CSR supports random access ming style. According to our preliminary analysis, this is to the start of a row in constant time. In Chapel, CSR matrices due to the thread creation and communication costs involved keep the column ids of nonzeros within each row sorted. in spawning threads in distributed memory, especially when In Chapel, the indices of sparse vectors are kept sorted and the data size is not large enough to create work that would stored in an array. This format is space efficient, requiring only amortize the parallelization overheads. This problem, often O(nnz) space. Listing 1: Creating a block-distributed sparse array Listing 2: apply() - version 1 1 var n = 6 1 //Implementing apply() using forall loop 2 var D = {0..#n, 0..#n} dmapped Block({0..#n 2 proc Apply1(spArr, unaryOp) ,0..#n}, sparseLayoutType=CSR)); 3 { 3 var spD: sparse subdomain(D); // sparse 4 forall a in spArr do domain 5 a = unaryOp(a); 4 spD = ((0,0), (2,3)); // adding indices 6 } 5 var A = [spD] int; // sparse array Listing 3: apply() - version 2 B. Chapel notations 1 //Implementing apply() with local arrays A locale is a Chapel abstraction for a piece of a target archi- 2 proc Apply2(spArr, unaryOp){ tecture that has processing and storage capabilities. Therefore, 3 var locArrs = spArr._value.locArr; 4 coforall locArr in locArrs do a locale is often used to represent a node of a distributed- 5 on locArr { memory system. 6 forall a in locArr.myElems do In this paper we only used 2-D block-distributed partitions 7 a = unaryOp(a); of sparse matrices and vectors [5], since they have been 8 } shown to be more scalable than 1-D block distributions 9 } of matrices and vectors. In 2-D block-distribution, locales are organized in a two dimensional grid and array indices are are partitioned ”evenly” across the target locales. An example of creating a 2-D block-distributed sparse accounting for overloads for different objects [7]. The API matrix is shown in Listing 1. A 2-D block-distributed does not differentiate matrices as sparse or dense. Instead, array relies on four classes: (a) SparseBlockDom, (b) it leaves it to the runtime to fetch the most appropriate LocSparseBlockDom, (c) SparseBlockArr and implementation. Consequently, it also does not differentiate (d) LocSparseBlockArr. SparseBlockDom and operations based on the sparsity of its operands. For example, SparseBlockArr describe the distributed domains the MXV operation can be used to multiply a dense matrix with and arrays, respectively. LocSparseBlockDom and a dense vector, a sparse matrix with a sparse vector, or a sparse LocSparseBlockArr describe non-distributed domain matrix with a dense vector. Efficient backend implementations, and arrays placed on individual locales. SparseBlockDom however, has to specialize their implementations based on class defines locDoms: a non-distributed array of local sparsity for optimal performance. domain classes. Similarly, SparseBlockArr class defines In this work, we target a sizable subset of the GraphBLAS locArr: a non-distributed array of local array classes. specification. Our operations are chosen such that they can For efficiency, we directly manipulate local domains and be composed to implement an efficient breadth-first search arrays via _value field of classes. The actual local domains algorithm, which is often the “hello world” example of Graph- and arrays in SparseBlockDom and SparseBlockArr BLAS. Since we are illustrating an efficient backend, we classes can be accessed by mySparseBlock and myElems, also specialize our operations based on the sparsity of their respectively. operands. Below is the list of operations we focus in this paper: C. Experimental platform • Apply operation applies a unary operator to only the nonzeros of a matrix or a vector. We evaluate the performance of our implementations on • Assign operation assigns a matrix (vector) to a subset of Edison, a Cray XC30 supercomputer at NERSC. In Edison, indices of a another matrix (vector). nodes are interconnected with the Cray Aries network using • eWiseMult can be used to perform element-wise multi- a Dragonfly topology. Each compute node is equipped with plication of two matrices (vectors). 64 GB RAM and two 12-core 2.4 GHz Intel Ivy Bridge • SpMSpV multiplies a sparse matrix with a sparse vector processors, each with 30 MB L3 cache. We built Chapel on a semiring. version 1.14.0 from source using gcc 6.1.0. We built Chapel from source because the Cray-provided compiler on Edison A powerful aspect of GraphBLAS is its ability to work is much older and does not have several latest sparse array on arbitrary semirings, monoids, and functions. In layman’s functionalities. We used aries conduit for GASNet and slurm- terms, a GraphBLAS semiring allows overloading the scalar srun launcher. Finally, qthreads threading package [6] from multiplication and addition with user defined binary operators. Sandia National Labs was used for threading. A semiring also has to contain an additive identity element. A GraphBLAS monoid is a semiring with only one binary oper- III. GRAPHBLAS OPERATIONS ator and an identity element. Finally, a GraphBLAS function The upcoming GraphBLAS specification and the C lan- is simply a binary operator and is allowed in operations that guage API contains approximately ten distinct functions, not do not require an identify element (e.g.

View Full Text

Details

  • File Type
    pdf
  • Upload Time
    -
  • Content Languages
    English
  • Upload User
    Anonymous/Not logged-in
  • File Pages
    10 Page
  • File Size
    -

Download

Channel Download Status
Express Download Enable

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

  • Not to be reproduced or distributed without explicit permission.
  • Not used for commercial purposes outside of approved use cases.
  • Not used to infringe on the rights of the original creators.
  • If you believe any content infringes your copyright, please contact us immediately.

Support

For help with questions, suggestions, or problems, please contact us