<<

IBM POWER9 and cognitive M. Kumar W. P. Horn computing J. Kepner J. E. Moreira Cognitive applications are complex and are composed of multiple P. Pattnaik components exhibiting diverse workload behavior. Efficient execution of these applications requires systems that can effectively handle this diversity. In this paper, we show that IBM POWER9TM shared memory systems have the compute capacity and memory throughput to efficiently handle the broad spectrum of computing requirements for cognitive workloads. We first review the GraphBLAS interface defined for supporting cognitive applications, particularly whole-graph . We show that this application-programming interface effectively separates the concerns between the analytics application developer and the system developer and simultaneously enables good performance by permitting system developers to make platform- specific optimizations. A linear algebra formulation and execution of betweenness centrality kernel in the High-Performance Computing Scalable Graph Analysis Benchmark, for 256 million vertices and 2 billion edges graphs, delivers a sixfold reduction in execution time over a reference implementation. Following that, we present the results of benchmarking the forward propagation step of deep neural networks (DNNs) written in GraphBLAS and executed on POWER9. We present the rationale and evidence for weight matrices of large DNNs being sparse and show that for sparse weight matrices, GraphBLAS/POWERÒ has a two orders-of-magnitude performance advantage over dense implementations. Applications requiring analysis of graphs larger than several tens of billion vertices require distributed computing environments such as Apache Spark to provide resilience and parallelism. We show that when linear algebra techniques are implemented in an Apache Spark environment, we are able to leverage the parallelism available in POWER9 Servers.

Introduction prevention use this large graph representation for modeling Cognitive systems create actionable knowledge from data. and analysis in Stage 3 [2]. Stage 2 also includes the data The recent growth in is due to the preparation (e.g., selection, curation, sampling, interpolation) availability of a large volume of relevant data, large prior to modeling in Stage 3. In this paper, we focus on whole- amounts of computational power, and the high value of the graph analytics. We do not dwell on queries to retrieve a actionable knowledge to many large businesses [1]. fraction of data, which are supported by various NoSQL The creation of actionable knowledge by a cognitive databases such as Accumulo, Apache Giraph, Cassandra, system encompasses four processing stages (Figure 1). Stage CouchDB, MongoDB, and Neo4J. 1 is primarily intra-record analysis of extremely diverse data The modeling phase of cognitive computing, Stage 3 sources such as call records, click streams, images, or videos. depicted in Figure 1, encompasses two approaches. The first The output of this stage is data tagged with metadata, the tags is driven by statistical models, primarily based on Bayesian enabling fusion or linking of data from the diverse sources methods. Various regression, classification, clustering into a large graph in Stage 2. Various applications in techniques, kernel methods in general, and support vector healthcare, social network analytics, and financial fraud machines reside in this category [3, 4]. The second approach comprises approaches, particularly

Digital Object Identifier: 10.1147/JRD.2018.2846958 deep neural networks (DNNs) of various flavors [5–8].

ß Copyright 2018 by International Business Machines Corporation. Copying in printed form for private use is permitted without payment of royalty provided that (1) each reproduction is done without alteration and (2) the Journal reference and IBM copyright notice are included on the first page. The title and abstract, but no other portions, of this paper may be copied by any means or distributed royalty free without further permission by computer-based and other information-service systems. Permission to republish any other portion of this paper must be obtained from the Editor. 0018-8646/18 ß 2018 IBM

IBM J. RES. & DEV. VOL. 62 NO. 4/5 PAPER 10 JULY/SEPTEMBER 2018 M. KUMAR ET AL. 10 : 1 Figure 1 Four pillars of cognitive computing: intra-source analysis, data linking, actionable knowledge (model) extraction, and model deployment.

These networks currently have hundreds of stages with usually very sparse. A significant amount of the time in thousands of neurons in each stage. Success of DNNs is graph analytics is spent in multiplication of this adjacency driving consideration of even larger DNNs, and recent matrix with a vector representing a set of nodes or node research suggests that the weight matrices for large DNNs properties. As graphs become very large, parallel and can be made sparse without sacrificing their prediction distributed computing solutions are necessary to address accuracy [9–12]. This enables larger DNN models to be both storage capacity and computation time requirements. evaluated on a given hardware platform capable of taking In the fourth section of this paper (“Computational advantage of sparsity. performance in the Spark environment”), we discuss the In the next section of this paper (“Linear algebra implementation of a GraphBLAS model of computation on formulation of graph analytics”), we describe the processing the Apache Spark distributed computing framework. We and storage requirements of Stage 2 of cognitive applications analyze the scalability of an example graph and for creating and storing these large graphs, and we highlight show that we can efficiently use the multiple parallel features of IBM POWER exploited in achieving high resources in a POWER9 server. performance. We summarize GraphBLAS [13, 14], an Our key message in this paper is that the computational interface defined for the sparse-matrix linear algebra requirements of the various tasks of cognitive systems, or in approach for graph analytics, and describe the other words the workload behavior of these tasks outlined implementation of the high-performance Graph Processing in Figure 1, are diverse. This diversity is discussed again in Interface (GPI) library that currently implements an early the conclusion of this paper. Specialized systems for narrow variant of the GraphBLAS interface [15, 16] and is tasks such as DNN training exist [18, 19] and perform very optimized for POWER processors [17]. Then, we report the well on them. However, POWER systems effectively cover performance on POWER9 of representative kernels in graph all stages of cognitive computing illustrated in Figure 1 analytics, performed in Stages 2 and 3 of Figure 1, and because of their large shared memory multiprocessing illustrate the advantage of the linear algebra approach over capability and high bandwidth to memory. This breadth of conventional methods. In the third section of this paper coverage includes Stage 4 in Figure 1, where the actionable (“DNN computations on POWER9”), we present the business knowledge is deployed in business systems. performance of POWER9 on the forward propagation kernel We refer the reader to companion articles in this issue for of artificial neural networks. We show that the sparse-matrix an exploration of the features and capabilities of the outperform their dense-matrix counterparts, even POWER9 processor. In particular, Le et al. [17] describe with limited sparsity in the weight matrices. the high-performance processing cores of POWER9, while In the linear algebra formulation of graph analytics, a the article Arimilli et al. [20] describe the cache hierarchy graph is represented as an adjacency matrix, which is that supports the computing capacity of those cores and

10 : 2 M. KUMAR ET AL. IBM J. RES. & DEV. VOL. 62 NO. 4/5 PAPER 10 JULY/SEPTEMBER 2018 Figure 2 GraphBLAS primitives to support linear algebra formulation of graph analytics. Uppercase letters are matrices; lowercase b, c, and m are vectors. M and m are masks. Red and blue lettering indicates optional parameters and modifiers.

Starke et al. [21] describe the memory architecture and Finally, the graph data is highly non-uniform in terms of in/ connectivity of POWER9 systems. out degree distribution of vertices and the presence of community structures, a very loose definition of clique. Linear algebra formulation of graph analytics This complicates the exploitation of parallelism on modern Attaining good performance on analysis of linked data in multi-core processor-based parallel systems [17, 22] Stage 2 of cognitive applications is tenuous for the analytics because load balancing, minimization of synchronization application developer because of the need to manage overhead, and minimization of inter-task communication performance consequences of irregular memory accesses become more difficult to manage. over a large address space and exploit complex hardware In addition to these idiosyncrasies of the graph analytics features, as explained first in this section. The GraphBLAS problems, modern multi-core processors also have their interface, described next, unburdens the application own complexities that need to be factored into the developer from the chores of managing parallelism and application programs to minimize the execution time of platform-specific optimizations. The platform-specific applications. Programmers must restructure their optimizations are factored into a library of select graph applications to optimize performance of graph analytics on operations, the GPI library in our case. We observed that an modern processors [23–25]. Cache line sizes, cache order-of-magnitude speedup for typical operations on large capacities, page sizes, and limited translation look-aside graphs (vertices not contained in last level cache on chip) buffer (TLB) entries are a few examples of hardware can be obtained by using such platform-optimized libraries. implementation-specific parameters reflected in highly optimized application programs. Developing useful and Challenges in analyzing linked/graph data on innovative graph analytics applications that have high modern hardware performance requires two complementary and highly The graph analytics applications differ from the more developed skills. Accordingly, computer science research conventional high-performance computing, transaction has responded to this challenge by developing various processing, and the emerging machine intelligence and graph analytics frameworks that separate the concerns of deep learning applications in three important respects. First, developing the applications from the concern of their the size of graph data is very large. Social graphs are optimal execution. Some of the notable frameworks are already approaching billions of vertices with several described elsewhere [26–29]. hundred attributes per vertex or edge, requiring algorithms to be crafted carefully to minimize latency for data access GraphBLAS interface and the library implementing across the cache/memory hierarchy. Second, the data access it on POWER patterns are highly irregular, i.e., lack predictable strides or To support large graph analytics on POWER9, we espoused locality in reference. This makes the task of managing the the linear algebra approach advocated by Kepner and Gilbert cache/memory hierarchy substantially more difficult. [26]. The ability of the linear algebra approach to express a

IBM J. RES. & DEV. VOL. 62 NO. 4/5 PAPER 10 JULY/SEPTEMBER 2018 M. KUMAR ET AL. 10 : 3 Figure 3 Betweenness Centrality algorithm expressed in linear algebra notation. wide range of graph analytics is covered in detail in their book The linear algebra formulation of graph analytics results [26]. Furthermore, we observed that a compact set of linear in: 1) simplification of application logic as code for algebra primitives (Figure 2) suffice to implement most traversal across the vertices of the graph, managing whole-graph analytics algorithms. Many of these primitives parallelism, and maintaining graph data structures is we originally defined as GPI became part of GraphBLAS, a subsumed in the sparse-matrix operation implemented in recently standardized interface defining liner algebra the library; 2) sometimes expansion in the number of operations for graph analytics [14]. We currently implement instructions executed when tasks associated with active GraphBLAS on POWER by providing a thin adaptation layer vertices get performed on all vertices, the inactive vertices from GraphBLAS to GPI library optimized for POWER. not impacting the result; 3) a reduction in overall execution The GraphBLAS API calls for the primitives listed in time as the instructions are of higher quality, i.e., they Figure 2 have three important features. First, the assignment execute much more efficiently, more than compensating the operation is preceded by the accumulate operator , distinct expansion in instruction count. Figure 3 illustrates the from the other operators to the right of the assignment brevity of linear algebra formulation of betweenness operator. The existing value of a matrix or vector element centrality algorithm by Brandes [30]. In Figure 3, the being updated and the value produced by the right-hand operator stands for element-wise multiplication of two expression for that element are combined using the vectors, while the operator stands for standard vector- accumulate operator to obtain the value assigned. Second, matrix multiplication. The OpenMP C implementation of each API call specifies a mask, with each mask element the algorithm in SSCA [31] is 200 to 300 lines of code. In converted to a Boolean true or false, matching the size and addition to being compact, the linear algebra formulation shape of the output variable. The mask can be optionally performs better than the SSCA2 reference implementation, negated, and it allows the elements of the matrix or vector to as discussed later in this section. be updated selectively. Finally, each call has a descriptor associated with it. The descriptor specifies whether the mask Algorithmic innovations to optimize graph analytics needs to be inverted, and whether the matrices used as input performance on POWER need to be transposed before their use. These features enable Multiplication of sparse matrix with a dense or sparse optimizations that minimize the data transfer from off-chip vector or vectors, y ¼ M v or y ¼ v M, is the key memory, a significant impediment in linear algebra operation in most whole-graph analytics algorithms. Its formulation of graph analytics, despite the linear algebra performance is gated by memory access latency. For formulation being a better choice despite the memory example, if M is stored in compressed storage by row latency issues with non-linear algebra formulations. The (CSR) format, there is locality in access to y when apply call in Figure 2 applies the unary function f element- performing the M v operation. However, access to v have wise to the elements of the vector or matrix. poor cache locality. In general, for various representations

10 : 4 M. KUMAR ET AL. IBM J. RES. & DEV. VOL. 62 NO. 4/5 PAPER 10 JULY/SEPTEMBER 2018 Figure 4 Left: Speedup of GPI implementation (linear algebra formulation) over the standard implementation. Right: The ratio of instructions executed in the GPI implementation with the instructions executed in the standard. of sparse matrices, we have poor access latencies either for M v operation, or CSR format for v M operation, a the input vector or for the output vector, or a compromise GraphBLAS implementation would rearrange the read between the two. pattern for y instead of the write pattern for S: The decision To improve the memory access locality in accessing both criterion for choosing an optimal implementation of a the input and output vectors, we use a two-phase approach GraphBLAS operation, from the many available, is built that is described elsewhere [23]. In the first phase, we scale into the GraphBLAS implementation. the matrix M with the values of the input vector to create a The two-phase approach outlined earlier minimizes scaled matrix S ¼ M v, where each column of S is the latency by executing extra instruction to reorder accesses corresponding column of M multiplied by the and using extra bandwidth to write and read intermediate corresponding element of vector v. In the second phase, we matrix S: In vector-vector operations for large vectors, reduce the rows of S to get the result y ¼S, y is the performance is often gated by the bandwidth. The mask vector obtained by summing the columns of S. In between vectors in the GraphBLAS application programming the two phases, the elements of matrix S are buffered in on- interface (API) save the round trip to memory for chip caches before being written to local memory, using intermediate results. Furthermore, the input operands can be different buffers for different cacheable regions of y. This negated before being used. The implementation on POWER ensures that we get good cache locality while writing as also leverages POWER specific instructions like ‘cntlz’ to well as reading back. In [23], we detail how a wide range of count the number of leading zeros in mask vectors to graph algorithms can benefit from similarly changing the expedite access to sparse vectors. order in which input/output operands and intermediate results are read and written into local memory. Performance of the GraphBLAS library on POWER In the above discussion, we assumed that the matrix was The performance advantage of linear algebra formulation in CSR format, and the input vector v was dense. Our comes from the tradeoff of executing more instructions implementation of GraphBLAS does not rely on such efficiently as opposed to less instructions inefficiently. The assumptions. GraphBLAS objects are opaque, i.e., the active vertices in a graph, even though often sparse, are programmer cannot access the data by means other than the treated as dense vector if the sparsity is below a certain well-defined access methods of GraphBLAS. GraphBLAS threshold. That causes redundant vertices to be processed implementations can change the representation of the but eliminates the need for indexed access to the vertices. matrices and vectors to optimize performance. For example, The computations performed for non-active vertices are if the input vector v were sparse, the CSR representation discarded without overhead by using mask operand of the could be automatically converted to compressed sparse GraphBLAS functions. Figure 4 illustrates the instruction blocks (CSB) representation, better suited for multiplying count and efficiency of execution tradeoff between the sparse matrices with sparse vectors. Each operation in a linear algebra formulation and straightforward GraphBLAS library can have multiple implementations, the implementation of breadth first search (BFS), betweenness right implementation is selected automatically depending centrality (BC), and Google’s page rank algorithms. The on the representation of the matrix/vector objects and their graphs evaluated are rMat graphs [32] having parameter sparsity. For example, if the matrix in the above paragraph values a ¼ 0.55, b ¼ c ¼ 0.2, and preferential attachment was in compressed storage by columns (CSC) format for the graphs having parameter values p = 0.545 and q = 0.273.

IBM J. RES. & DEV. VOL. 62 NO. 4/5 PAPER 10 JULY/SEPTEMBER 2018 M. KUMAR ET AL. 10 : 5 The numerical suffix _xx in the legend indicates that the graph has 2xx vertices. Each vertex has an average of 16 undirected edges. The vectors of interest in BFS, such as vertices visited and vertices on a search frontier, are relatively dense. Since the relatively dense vectors are treated as dense vectors, we have an increase in the number of instructions executed, as shown on the right-hand side of Figure 4. However, we can process these dense vectors far more efficiently resulting in net speedup, as shown on the left side of Figure 4. The speedup for larger graph is more significant than that for smaller graphs because accesses to data in larger graphs miss the last level cache almost entirely, and therefore the GPI approach has greater impact on off-chip memory latency. Experiments were performed on an IBM Power Systems AC922 server configured with 20 POWER9 cores and running the Ubuntu 16.04 distribution of Linux for Figure 5 PowerPC Little Endian. All code was compiled with Gnu Illustration of a Deep Neural Network. 5.4 compilers (gcc/gþþ). Each core can run one (single- thread or ST) to eight (SMT8) simultaneous threads of execution. The cores have a nominal frequency of 3.2 GHz, intelligence problems [33–35]. The availability of large although POWER9 automatically adjusts the running corpora of validated data sets [35–38], and the increases frequency based on power, thermal and workload in computational power fueled by graphics processing considerations. The server is configured with 512 GiB of units (GPUs) [39, 40], have allowed the effective training memory, accessible through a total of 16 memory channels. of large DNNs with 100,000s of input features, N ¼ y0, Total bandwidth from processing cores to memory is over and hundreds of layers, L ¼ 4, that are capable of

300 GB/s, available for both read and write operations. choosing from among 100,000s categories, M ¼ y4, as We measured the Traversed Edges Per Second (TEPS) shown in Figure 5. The impressive performance of large score for kernel 4 of the HPC Scalable Graph Analysis DNNs encourages the training and testing of even larger Benchmark SSCA2 v2.2 [31]. Kernel 4 is betweenness networks. However, increasing N; L,andM,eachbya centrality. We compared the TEPS scores for a 256 million factor of 10 results in a 1,000-fold increase in the number node graph with an average of eight directed edges per node of weight and bias parameters. Tradeoffs are currently for the reference implementation (SSCA) and the GPI being made between precision and accuracy of DNN implementation. We chose to operate the cores in SMT8 weight matrices to save storage and computation [9–12]. (eight threads per core) mode for the SSCA implementation, In this section, we show that DNN inferencing can be the thread/core choice that gives the best performance for carried out efficiently on sparse weight matrices by the SSCA2, and in ST mode for the GPI implementation, once GPI/GraphBLAS library on POWER9. again the thread/core choice that gives the best performance for GPI. While the SSCA2 reference implementation [31] Computation underlying inferencing in DNNs performs at 40 million TEPS, the GraphBLAS The primary mathematical operation performed by a DNN implementation performs about 233 million TEPS. The per- network is the inference, or forward- propagation step. node TEPS performance is comparable to the per-node Inference is executed repeatedly during training to performance reported for other POWER-based architectures determine both the weight matrices Wk and the bias vectors and other many-core architectures without attached bk of the DNN. The inference computation shown in accelerators, however, we are processing much larger graphs Figure 5 is given by on a per-core basis, leveraging the ability of POWER9 to support large shared memories, and running the simple ykþ1 ¼ hWðÞk yk þ bk ; algorithm shown in Figure 3. where hðxÞ is a non-linear function applied to each element DNN computations on POWER9 of the vector. A commonly used function is the rectified The recent and significant improvements in neural linear unit (ReLU) given by hðyÞ¼maxðy; 0Þ, which sets network training algorithms make it possible to train values less than 0 to 0 and leaves other values unchanged. neural networks that are capable of better-than-human When training a DNN, it is common to compute multiple yk performance in a variety of important artificial vectors at once in a batch that can be denoted as the matrix

10 : 6 M. KUMAR ET AL. IBM J. RES. & DEV. VOL. 62 NO. 4/5 PAPER 10 JULY/SEPTEMBER 2018 Yk. In matrix form, the inference step becomes

Ykþ1 ¼ hWðÞk Yk þ Bk ; where Bk is a replication of bk along columns. In GraphBLAS, the inference computation can be expressed as a linear function over two different semirings. First, the matrix-vector (or matrix-matrix) multiplication is performed using a conventional arithmetic semiring ð; þÞ, followed by the addition of bias and rectification, which are performed using a max-plus semiring ðþ; maxÞ.

Experiments All our DNN experiments are performed on the same IBM Power Systems AC922 described in the previous section. As mentioned earlier, the GraphBLAS API was implemented as a compatibility layer on top of GPI Library, which relies on OpenMP for multithreaded processing. GPI Figure 6 Library transparently uses various storage formats and POWER9 execution times of the OpenBLAS and GPI/GraphBLAS. strategies for dividing the work among multiple threads. In our specific case, the weight matrices were represented in compressed sparse row (CSR) format and they were and 32K 64 min-batch corresponds to a lower bound of distributed by rows. For dense linear algebra, we use 76% for the theoretical peak floating-point operations per OpenBLAS version 0.2.18. second (FLOPS) for POWER core. We measured the execution time of the equation As the weight matrix becomes sparse, initially the Ykþ1 ¼ WkYk þ Bk for weight matrices, W, of different execution time for GraphBLAS starts scaling down sizes and sparsity. This represents forward propagation step proportionally to the number of non-zeros. Even at a in one layer of the neural network. All weight matrices are sparsity (or fill factor) of 50% GPI outperforms m m square matrices of single-precision (32-bit) GraphBLAS. For highly sparse 32K 32K matrix, the floating-point numbers. All layer input/output matrices Y GraphBLAS implementation outperforms the OpenBLAS are tall and skinny m 64 matrices that represent a mini- implementation by three orders of magnitude. The batch of size 64. Initially, dense weight matrices are execution time of GraphBLAS as a function of sparsity generated by populating each entry with a random number begins to bottom out when the sparsity is about one non- chosen from a U[1; 3) distribution. Sparse weight zero entry per 16 matrix rows. This saturation value of matrices are generated from these dense matrices by execution time indicates the overhead of processing an selecting the location of nonzero entries using independent empty matrix in GraphBLAS. Bernoulli distributions and taking the corresponding entry The performance of DNN inference scales well as value (generated from the U[1; 3) distribution). Layer multiple threads are deployed. Speedup for 16 threaded input matrices are generated using a U[0; 1) distribution for execution, 16 cores running in single threaded mode (ST the entry values. mode), is shown in Figure 7. Once again, the x-axis is inverse sparsity. This time the y-axis is speedup with Results respect to single threaded execution. For smaller matrices Figure 6 shows the execution times of the forward the speedup tails off with decreasing sparsity most likely propagation step in one layer of the neural network for due to limited parallelism. single threaded execution for both the OpenBLAS and the GPI/GraphBLAS implementation. The x-axis is the inverse Computational performance in the Apache of sparsity, i.e., m2=non-zeros. The y-axis is the execution Spark environment time in seconds. Both axes are on log scale. One can readily As mentioned in earlier in this paper, cognitive computing observe that for dense matrices GraphBLAS is and analytics are emerging workloads that provide approximately 20% to 50% slower than BLAS. That is significant challenges for modern processors. The potential basically the overhead of supporting sparse representation. performance advantages of native C and Cþþ-based The execution time of OpenBLAS does not depend on programs have made these languages popular choices for sparsity. The 6.6-second execution time for performing the implementing applications that execute these workloads. forward propagation step for 32K 32K weight matrix, However, many of the more interesting problems are born

IBM J. RES. & DEV. VOL. 62 NO. 4/5 PAPER 10 JULY/SEPTEMBER 2018 M. KUMAR ET AL. 10 : 7 Figure 7 Speedup achieved with Open-MP 16 threads for GraphBLAS imple- mentation of the forward propagation step, with respect to one-thread OpenMP execution. on the web and, for a variety of reasons [41], the Figure 8 programming interfaces tend to be available in non-native Linear algebra formulation of the Single Source Shortest Path problem. languages (e.g., Java, .Net, and Python). In addition to these The same code can be used for both distance and path calculations, advantages, recent studies [42, 43] indicate that Java is a simply by choosing the appropriate semiring. reasonable candidate for computationally intensive applications. Perhaps the most popular Java-based systems for semiring definitions to implement a variety graph of addressing challenges in cognitive computing and big data algorithms in a Spark application running in Spark analytics is Apache Spark [44]. Spark scales by exploiting a environment. Implementing graph algorithms using linear network of interconnected, distributed Java Virtual algebra techniques has several potential advantages [26] Machines (JVMs). Much of Spark is written in Scala. By including: a reduction in syntactic complexity, easier implementing in Scala, Spark can exploit the plethora of implementation, and improved performance. As discussed existing Java-based packages and tooling while giving earlier in this paper, the improved performance is due to developers access to a state-of-the-art programming exploiting the separation-of-concerns between the language that supports the compact expression of both algorithm developer and the underlying linear algebra functional and object-oriented constructs. For example, implementation. For example, in an attempt to improve Spark’s support for a wide variety of data sources including performance, a previous version of LAGRAPH would relational databases, NoSQL systems, streaming sources perform some linear algebra operations by invoking and a variety of distributed file systems is obtained by platform-optimized sparse-matrix routines packaged in a exploiting existing Java-based packages. separate shared library [16]. In this section, we demonstrate the use of Spark in solving an important graph problem (Single Source Shortest Benchmark Path [SSSP]) through parallel and distributed computing. Following the approach outlined by Fineman and Robinson We formulate a linear algebra version of the SSSP [45], we implemented two versions of the SSSP algorithm. algorithm and implement it in Scala. Using Spark, we In the first version, we compute the shortest distance from a automatically scale the problem to use multiple cores in a specified source vertex to every other vertex in the graph. POWER9 server, delivering good parallel efficiency in For this, we use a (sparse) adjacency matrix where the large graph problems. weight of an edge is represented as a single precision It should be noted that the graph component of Spark floating-point value. If there is no edge between source and V2, GraphX, is based on Pregel techniques [27] and does target, then the weight is set equal to the maximum float not exploit linear algebra techniques. However, an value. The algorithm is shown in Figure 8. The code used emerging alternative, LAGRAPH [41], extensively uses to perform the matrix initialization is shown in Lines 4–18, matrix and vector operations in combination with custom and it uses a functor mInitDist (Figure 8, Lines 5–13)

10 : 8 M. KUMAR ET AL. IBM J. RES. & DEV. VOL. 62 NO. 4/5 PAPER 10 JULY/SEPTEMBER 2018 Figure 9 High-level overview of the Spark runtime. that is passed into the GPI call hc.mZipWithIndex Figure 10 (Figure 8, Lines 14–18). The zip with index call Execution time of SSSP (both distance and path variants) on a B ¼ hc:mZipWithIndexðfunc; A; x½; s POWER9 server. These are weak scaling experiments, maintaining the ratio of number of vertices to number of executors (and processor cores) constant. For each group m n of bars, the number of vertices invokes the functor for each element index ði; jÞ of the input is 2m and the number of executors (cores) is n. matrix A and assigns the result value to the element ði; jÞ of matrix B. An optional value s of the semiring result type b b can be used as input to the functor for the diagonal values. As in [16], we partition the adjacency matrix as an Upon completion, the result vector d_prev contains all array of blocks. Each operation, such as matrix-vector pairwise shortest distances. That is, each entry d_prev multiplication, is performed in a blocked fashion where the (t) will contain the distance from the starting vertex to the multiplication of each block is handled by a single task vertex t. The algorithm starts by initializing the result executing in an executor exclusively bound to a single core. vector with 0.0 for the entry corresponding to the specified source vertex and Float.MaxValue for all the others. Tuning During each iteration, the algorithm refines the distance Our experiments were performed using multiple cores of using the (min,þ) semiring (Figure 8, Line 29) until there the POWER9 P922 server previously described. We used are no changes in the output result (Figure 8, Lines 30–32). Spark V2.3.0 running the OpenJDK [46] version of Java numactl For more details on the algorithm see [45]. 1.8.0. The command is used to bind each spark In the second version of the Single Source Shortest Path worker to the threads belonging to its assigned core. For algorithm, we compute not only the distance from the example: source for each other vertex but also the path. In this numactlphycpubind ¼ 0 7localalloc::: version, the core algorithm remains effectively unchanged except a different semiring is employed. In this case, d_prev(t) contains not only the distance from the source Obtaining optimum performance from a JVM involves but also the index of vertex preceding t on the path to t. tuning the many parameters available for the JIT and GC. The additive operation of the semiring is augmented to Since, in practice, the Spark environment will not be append the current vertex to an already existing path. focused on matrix-vector multiplication alone, we refrain from tuning the JVM for these specific operations. Instead Spark architecture we use standard tuning recommendations for Spark A high-level view of the Spark runtime architecture is systems. shown in Figure 9. Here, a main program (called the driver program) manages a set of Spark worker nodes. Each Results worker is responsible for managing a set of executor JVMs. Weak scaling results for SSSP (both distance and path For POWER systems, the memory hierarchy dictates the variants) are shown in Figure 10. Each group of bars is for allocation of executor JVMs, experience has shown [16] a different configuration of graph scale and number of cores that configuring the system such that a worker is bound to (executors): 2, 8, and 32 million vertices, and 1, 4, and specific core and each worker launches a single executor 16 cores, respectively. This keeps the number of vertices (also bound to the same core) yields optimum performance. per core constant (2 million vertices/core). Each graph has

IBM J. RES. & DEV. VOL. 62 NO. 4/5 PAPER 10 JULY/SEPTEMBER 2018 M. KUMAR ET AL. 10 : 9 8 edges per vertex. We experiment with different Acknowledgment blocking factors for the adjacency matrix and find that We thank Joefon Jann and R. S. Burugula (Sarma) for partitioning it as an 8 8 array of blocks results in the setting up the POWER9 system and helping with the lowest total execution time. (Those are the times reported in performance data collection. The work reported in this Figure 10.) paper builds on the contributions of many of our colleagues. We observe good parallel efficiency, since the execution We would also like to thank our IBM colleagues, coauthors time increases slowly with the problem size/number of of [15], particularly K. Ekanadham and Mauricio Serrano, cores. Execution for a 16-fold larger graph on 16 cores is for the specification and implementation of the GPI library. only 35% higher than the execution time on 1 core. This Next, we thank Prof. David Bader and his students. We demonstrates the benefits of integrating the linear algebra used their SSCA2 V2.2 code in our experiments. Finally, approach with Spark for exploiting multi-core parallelism we thank our GraphBLAS colleagues, working with them through distributed computing. on GraphBLAS specification was truly gratifying. This material is based in part on work at MIT Lincoln Conclusions Laboratory supported by the NSF under grant number In this paper, we make four main contributions. First, we DMS-1312831. Any opinions, findings, and conclusions or identified a small set of graph analytics primitives that recommendations expressed in this material are those of the cover almost all computations carried out in whole-graph authors and do not necessarily reflect the views of the analytics. We have shown that an optimized National Science Foundation. implementation of these primitives, including POWER9 specific optimizations, leads to an order-of-magnitude References improvements in performance of graph analytics. 1. J. Manyika, M. Chui, B. Brown et al., “Big data: The next frontier Furthermore, these primitives now included in the for innovation, competition, and productivity,” McKinsey Global Inst., New York, NY, USA, McKinsey Global Inst. Rep., May GraphBLAS API, enable effective separation of concerns 2011. between the analytics application developer and system 2. “Special issue: Graphs and networks,” MIT Lincoln Lab. J., developer. vol. 20, no. 1, 2013. 3. C. M. Bishop, Pattern Recognition and . Second, we have shown that to attain good performance New York, NY, USA: Springer, 2006. on Graph Algorithms, it is advantageous to move away 4. T. Hastie, R. Tibshirani, and J. Friedman, The Elements of from the conventional approach of minimizing the number Statistical Learning: , Inference and Prediction, 2nd ed. New York, NY, USA: Springer, 2009. of arithmetic/logic operations, which often ignores memory 5. O. Ronneberger, P. Fischer, and T. Brox, “U-Net: Convolutional accesses in cost analysis. Decreasing the number of bad networks for biomedical image segmentation,” in Proc. Int. Conf. memory accesses, memory accesses that miss the last level Med. Image Comput. Comput.-Assisted Intervention, 2015, pp. 234–241. cache, even at the cost of increasing “good” memory 6. K. He, X. Zhang, S. Ren et al., “Deep residual learning for image accesses or arithmetic logic operations, provides better recognition,” in Proc. IEEE Conf. Comput. Vision Pattern performance. Bad memory accesses are hard to predict and Recognit., 2016, pp. 770–778. 7. L. Deng and D. Yu, “Deep learning: Methods and applications,” are un-localized, rendering on chip caches useless. Found. Trends Signal Process., vol. 7, nos. 3/4, pp. 197–387, 2014. Third, we presented the rationale and early evidence to 8. I. Goodfellow, Y. Bengio, and A. Courville, Deep Learning, support the view that weight matrices for DNNs will Adaptive Computation and Machine Learning Series. Cambridge, MA, USA: MIT Press, 2016. become sparse. Then we showed that GraphBLAS/GPI can 9. D. Yu, F. Seide, G. Li et al., “Exploiting sparseness in deep neural get significant improvement in performance of the forward networks for large vocabulary ,” in Proc. IEEE propagation step over the currently dense linear-algebra Int. Conf. Acoust., Speech Signal Process., 2012, pp. 4409–4412. 10. F. N. Iandola, S. Han, M. W. Moskewicz et al., “Squeezenet: approach, when the weight matrices are sparse. Alexnet-level accuracy with 50x fewer parameters and <0.5MB Finally, we integrated our linear algebra approach to model size,” arXiv:1602.07360, 2016. graph analytics with the Spark distributed computing 11. S. Han, H. Mao, and W. J. Dally, “Deep compression: Compressing deep neural network with pruning, trained platform. We showed how this combination lets us scale quantization and Huffman coding,” arXiv:1510.00149, 2015. large graph problems to use multiple cores in a POWER9 12. J. Kalagnanam, “0.968 accuracy achieved on MNIST data set server, with good parallel efficiency. using only 2.73% of DNN connections. Fully connected weight matrices give 0.978 accuracy,” Unpublished work, 2017. The overarching theme of this paper is that the 13. J. Kepner, P. Aaltonen, D. Bader et al., “Mathematical foundations computational requirements of the various tasks in a of the GraphBLAS,” in Proc. IEEE High Perform. Extreme cognitive application are diverse. Specialized systems for Comput. Conf., 2016, pp. 1–9. 14. A. Buluc¸, T. Mattson, S. McMillan et al., “The GraphBLAS C API narrow tasks such as DNN training exist [39, 40] and specication: Provisional release,” Version 1.0.2, Aug. 2017. perform very well on them. However, POWER systems [Online]. Available: http://gauss.cs.ucsb.edu/_ aydin/ effectively cover all stages of cognitive computing GraphBLAS_API_C.pdf 15. K. Ekanadham, B. Horn, J. Jann et al., “Graph programming illustrated in Figure 1, owing to their large shared memory interface: Rationale and specification,” IBM Res., Armonk, NY, multiprocessing capability and high bandwidth to memory. USA, Tech. Rep. RC25508, Nov. 2014.

10 : 10 M. KUMAR ET AL. IBM J. RES. & DEV. VOL. 62 NO. 4/5 PAPER 10 JULY/SEPTEMBER 2018 16. W. Horn, M. Kumar, J. Jann et al., “Graph programming interface 40. M. Rhu, N. Gimelshein, J. Clemons et al., “Virtualizing deep (GPI): A linear algebra programming model for large scale graph neural networks for memory-efficient neural network design,” computations,” Int. J. Parallel Programm., vol. 46, pp. 412–440, arXiv: 1602.08124, 2016. 2018. 41. W. Horn, G. Tanase, H. Yu et al., “A linear algebra-based 17. H. Le, J. Van Norstrand, B. Thompto et al., “IBM POWER9 programming interface for graph computations in Scala and processor core,” IBM J. Res. & Dev., vol. 62, nos. 4/5, Paper 2, Spark,” in Proc. IEEE Int. Parallel Distrib. Process. Symp. 2018 (this issue). Workshops, 2017, pp. 653–659. 18. N. P. Jouppi, C. Young, N. Patil et al., “In-datacenter performance 42. A. Fries, “The use of Java in large scientific applications in HPC analysis of a tensor processing unit,” in Proc. 44th Int. Symp. environments,” Universitat de Barcelona, Barcelona, Spain, 2013. Comput. Archit., Toronto, ON, Canada, Jun. 2017, pp. 1–12. 43. G. L. Taboada, J. Tourino,~ and R. Doallo, “Java in the high- 19. P. Zhang, M. Zalewski, A. Lumsdaine et al., “Gbtl-: Graph performance computing arena: Research, practice and algorithms and primitives for GPUs,” in Proc. Int. Parallel experience,” Sci. Comput. Programm., vol. 78, no. 5, pp. 425–444, Distrib. Process. Symp. Workshops, 2016, pp. 912–920. 2013. 20. B. Arimilli, B. Blaner, B. C. Drerup et al., “IBM POWER9 44. M. Zaharia, R. S. Xin, P. Wendell et al., “Apache Spark: A unified processor and system features for computing in the cognitive era,” engine for big data processing,” Commun. ACM, vol. 59, no. 11, IBM J. Res. & Dev., vol. 62, nos. 4/5, Paper 1, 2018 (this issue). pp. 56–65, 2016. 21. W. J. Starke, J. S. Dodson, J. Stuecheli et al., “IBM POWER9 45. J. T. Fineman and E. Robinson, “Fundamental graph algorithms,” memory architectures for optimized systems,” IBM J. Res. & Dev., Graph Algorithms Lang. Linear Algebra, vol. 22, pp. 45–58, 2011. vol. 62, nos. 4/5, Paper 3, 2018 (this issue). 46. OpenJDK. [Online]. Available: http://openjdk.java.net/_ 22. D. Mulnix, “Intel Xeon processor scalable family technical overview,” Sep. 2017. [Online]. Available: https://software.intel._ com/en-us/articles/intel-xeon-processor-scalable-family- technical-overview Received March 31, 2018; accepted for publication April 17, 23. M. Kumar, M. Serrano, J. Moreira et al., “Efficient implementation of 2018 scatter-gather operations for large scale graph analytics,” in Proc. IEEE High Perform. Extreme Comput. Conf., Sep. 2016, pp. 1–7. 24. D. Buono, J. A. Gunnels, X. Que et al., “Optimizing sparse linear Manoj Kumar IBM Research, Thomas J. Watson Research Center, algebra for large-scale graph analytics,” Computer, vol. 48, no. 8, Yorktown Heights, NY 10598 USA ([email protected]). Dr. Kumar is pp. 26–34, 2015. Program Director for Analytics Systems in in the Scalable Systems 25. D. Buono, F. Petrini, F. Checconi et al., “Optimizing sparse matrix- department. He received a B.S. degree in 1979 from I.I.T. Kanpur and vector multiplication for large-scale data analytics,” in Proc. Int. M.S. and Ph.D. degrees from Rice University in 1981 and 1984, Conf. Supercomput. Istanbul, Turkey, Jun. 2016, pp. 37–48. respectively, all in electrical engineering. He subsequently joined IBM, 26. J. Kepner and J. Gilbert, Graph Algorithms in the Language of where he has worked on design and implementation of a broad range of Linear Algebra. Philadelphia, PA, USA: SIAM, 2011. scalable/parallel systems and processor architectures. Dr. Kumar was 27. G. Malewicz, M. H. Austern, A. J. Bik et al., “Pregel: A system for responsible for developing IBM’s early e-Commerce offerings, for large-scale graph processing,” in Proc. ACM SIGMOD Int. Conf. which he received two Outstanding Innovation awards. His research Manage. Data, New York, NY, USA, 2010, pp. 135–146. interests include big data analytics and machine learning. Dr. Kumar is a 28. D. Nguyen, A. Lenharth, and K. Pingali, “A lightweight member of the Institute of Electrical and Electronics Engineers (IEEE). infrastructure for graph analytics,” in Proc. ACM Symp. Oper. Syst. Principles, 2013, pp. 456–471. William P. Horn IBM Research, Thomas J. Watson Research Center, 29. R. Nasre, M. Burtscher, and K. Pingali, “Data-driven versus Yorktown Heights, NY 10598 USA ([email protected]). Dr. Horn is a topology-driven irregular computations on GPUs,” in Proc. IEEE member of the research staff at the IBM Thomas J. Watson Research 27th Int. Parallel Distrib. Process. Symp., 2013, pp. 463–474. Center. He received his Ph.D. degree from Cornell University. His 30. U. Brandes, “A faster algorithm for betweenness centrality,” current research interests are in distributed systems and graph J. Math. Sociol., vol. 25, no. 2, pp. 163–177, 2001. computations. His most recent work has been focused on the 31. D. A. Bader, J. Feo, J. Gilbert et al., “HPCS scalable synthetic development of graph algorithms using functional programming compact applications #2 graph analysis,” V2.2: Released 5 techniques on distributed systems. September 2007. [Online]. Available: http://www_ .graphanalysis._ org/benchmark/HPCS-SSCA2_Graph-Theory_v2.2.pdf 32. D. Chakrabarti, Y. Zhan, and C. Faloutsos, “R-mat: A recursive Jeremy Kepner MIT Lincoln Laboratory, Lexington, MA 02420 model for graph mining,” in Proc. 4th SIAM Int. Conf. Data USA ([email protected]). Dr. Kepner is a MIT Lincoln Laboratory Mining, 2004, vol. 4, pp. 442–446. Fellow. He holds a B.A. degree in astrophysics from Pomona College 33. D. A. Reynolds, T. F. Quatieri, and R. B. Dunn, “Speaker and a Ph.D. degree in astrophysics from Princeton University. He verification using adapted gaussian mixture models,” Digital founded the Lincoln Laboratory Supercomputing Center and pioneered Signal Process., vol. 10, nos. 1–3, pp. 19–41, 2000. the establishment of the Massachusetts Green High Performance 34. A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet Computing Center. He has developed novel big data and parallel classification with deep convolutional neural networks,” in Proc. computing software used by thousands of scientists and engineers 25th Int. Conf. Neural Inf. Process. Syst., 2012, pp. 1097–1105. worldwide. He has led several embedded computing efforts, which 35. Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” Nature, earned him a 2011 R&D 100 Award. Dr. Kepner has chaired SIAM Data vol. 521, no. 7553, pp. 436–444, 2015. Mining, IEEE Big Data, and the IEEE HPEC conference. Dr. Kepner is 36. J. P. Campbell, “Testing with the YOHO CD-ROM voice the author of two bestselling books on Parallel MATLAB and Graph verification corpus,” in Proc. Int. Conf. IEEE Acoust., Speech, Algorithms. His peer-reviewed publications include works on abstract Signal Process., 1995, vol. 1, pp. 341–344. algebra, astronomy, astrophysics, cloud computing, cybersecurity, data 37. Y. LeCun, C. Cortes, and C. J. Burges, “The MNIST database of mining, databases, graph algorithms, health sciences, plasma physics, handwritten digits,” AT&T Labs, 2018. [Online]. Available: signal processing, and 3D visualization. In 2014, he received Lincoln _ http://yann.lecun.com/exdb/mnist Laboratory’s Technical Excellence Award. 38. J. Deng, W. Dong, R. Socher et al., “Imagenet: A large-scale hierarchical image database,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2009, pp. 248–255. Jose E. Moreira IBM Research, Thomas J. Watson Research Center, 39. D. Povey, X. Zhang, and S. Khudanpur, “Parallel training of deep Yorktown Heights, NY 10598 USA ([email protected]). Dr. Moreira is neural networks with natural gradient and parameter averaging,” a Research Staff Member in the Scalable Systems department at the IBM arXiv: 1410.7455, 2014. T. J. Watson Research Center. He received a B.S. degree in physics and

IBM J. RES. & DEV. VOL. 62 NO. 4/5 PAPER 10 JULY/SEPTEMBER 2018 M. KUMAR ET AL. 10 : 11 B.S. and M.S. degrees in electrical engineering from the University of Pratap Pattnaik IBM Research, Thomas J. Watson Research Sao Paulo, Brazil, in 1987, 1988, and 1990, respectively. He also received Center, Yorktown Heights, NY 10598 USA ([email protected]). a Ph.D. degree in electrical engineering from the University of Illinois at Dr. Pattnaik is an IBM Fellow and is currently the Senior Manager of the Scalable Systems group in IBM Research. Over the past 15 years, he – Urbana Champaign in 1995. Since joining IBM at the T. J. Watson and his team have developed a number of key technologies for IBM’s Research Center, he has worked on a variety of high-performance high-end servers. His current research work includes the development computing projects. He was System Software Architect for the IBM Blue and design of computer systems, including system and processor ’ Gene/L supercomputer, for which he received an IBM Corporate Award, hardware, operating systems and autonomic components for IBM s UNIX and mainframe Servers and theory of computing. In the past, he and Chief Architect of the Commercial Scale Out project. He currently has worked in the field of parallel algorithms for Molecular Dynamics, leads IBM Research work on the architecture of Power processors. He is solutions of linear systems, Quantum Monte Carlo and theory of High author or coauthor of over 100 technical papers and patents. Dr. Moreira is Temperature Superconductivity, communication subsystems, fault a member of the Institute of Electrical and Electronics Engineers (IEEE) management subsystems, etc. He also has over 10 years of research experience in various aspects of design and and a Distinguished Scientist of the Association for Computing fabrication, silicon processing, and condensed matter theory. Machinery (ACM).

10 : 12 M. KUMAR ET AL. IBM J. RES. & DEV. VOL. 62 NO. 4/5 PAPER 10 JULY/SEPTEMBER 2018