IBM POWER9 and Cognitive Computing

IBM POWER9 and cognitive M. Kumar W. P. Horn computing J. Kepner J. E. Moreira Cognitive applications are complex and are composed of multiple P. Pattnaik components exhibiting diverse workload behavior. Efficient execution of these applications requires systems that can effectively handle this diversity. In this paper, we show that IBM POWER9TM shared memory systems have the compute capacity and memory throughput to efficiently handle the broad spectrum of computing requirements for cognitive workloads. We first review the GraphBLAS interface defined for supporting cognitive applications, particularly whole-graph analytics. We show that this application-programming interface effectively separates the concerns between the analytics application developer and the system developer and simultaneously enables good performance by permitting system developers to make platform- specific optimizations. A linear algebra formulation and execution of betweenness centrality kernel in the High-Performance Computing Scalable Graph Analysis Benchmark, for 256 million vertices and 2 billion edges graphs, delivers a sixfold reduction in execution time over a reference implementation. Following that, we present the results of benchmarking the forward propagation step of deep neural networks (DNNs) written in GraphBLAS and executed on POWER9. We present the rationale and evidence for weight matrices of large DNNs being sparse and show that for sparse weight matrices, GraphBLAS/POWERÒ has a two orders-of-magnitude performance advantage over dense implementations. Applications requiring analysis of graphs larger than several tens of billion vertices require distributed computing environments such as Apache Spark to provide resilience and parallelism. We show that when linear algebra techniques are implemented in an Apache Spark environment, we are able to leverage the parallelism available in POWER9 Servers. Introduction prevention use this large graph representation for modeling Cognitive systems create actionable knowledge from data. and analysis in Stage 3 [2]. Stage 2 also includes the data The recent growth in cognitive computing is due to the preparation (e.g., selection, curation, sampling, interpolation) availability of a large volume of relevant data, large prior to modeling in Stage 3. In this paper, we focus on whole- amounts of computational power, and the high value of the graph analytics. We do not dwell on queries to retrieve a actionable knowledge to many large businesses [1]. fraction of data, which are supported by various NoSQL The creation of actionable knowledge by a cognitive databases such as Accumulo, Apache Giraph, Cassandra, system encompasses four processing stages (Figure 1). Stage CouchDB, MongoDB, and Neo4J. 1 is primarily intra-record analysis of extremely diverse data The modeling phase of cognitive computing, Stage 3 sources such as call records, click streams, images, or videos. depicted in Figure 1, encompasses two approaches. The first The output of this stage is data tagged with metadata, the tags is driven by statistical models, primarily based on Bayesian enabling fusion or linking of data from the diverse sources methods. Various regression, classification, clustering into a large graph in Stage 2. Various applications in techniques, kernel methods in general, and support vector healthcare, social network analytics, and financial fraud machines reside in this category [3, 4]. The second approach comprises deep learning approaches, particularly Digital Object Identifier: 10.1147/JRD.2018.2846958 deep neural networks (DNNs) of various flavors [5–8]. ß Copyright 2018 by International Business Machines Corporation. Copying in printed form for private use is permitted without payment of royalty provided that (1) each reproduction is done without alteration and (2) the Journal reference and IBM copyright notice are included on the first page. The title and abstract, but no other portions, of this paper may be copied by any means or distributed royalty free without further permission by computer-based and other information-service systems. Permission to republish any other portion of this paper must be obtained from the Editor. 0018-8646/18 ß 2018 IBM IBM J. RES. & DEV. VOL. 62 NO. 4/5 PAPER 10 JULY/SEPTEMBER 2018 M. KUMAR ET AL. 10 : 1 Figure 1 Four pillars of cognitive computing: intra-source analysis, data linking, actionable knowledge (model) extraction, and model deployment. These networks currently have hundreds of stages with usually very sparse. A significant amount of the time in thousands of neurons in each stage. Success of DNNs is graph analytics is spent in multiplication of this adjacency driving consideration of even larger DNNs, and recent matrix with a vector representing a set of nodes or node research suggests that the weight matrices for large DNNs properties. As graphs become very large, parallel and can be made sparse without sacrificing their prediction distributed computing solutions are necessary to address accuracy [9–12]. This enables larger DNN models to be both storage capacity and computation time requirements. evaluated on a given hardware platform capable of taking In the fourth section of this paper (“Computational advantage of sparsity. performance in the Spark environment”), we discuss the In the next section of this paper (“Linear algebra implementation of a GraphBLAS model of computation on formulation of graph analytics”), we describe the processing the Apache Spark distributed computing framework. We and storage requirements of Stage 2 of cognitive applications analyze the scalability of an example graph algorithm and for creating and storing these large graphs, and we highlight show that we can efficiently use the multiple parallel features of IBM POWER exploited in achieving high resources in a POWER9 server. performance. We summarize GraphBLAS [13, 14], an Our key message in this paper is that the computational interface defined for the sparse-matrix linear algebra requirements of the various tasks of cognitive systems, or in approach for graph analytics, and describe the other words the workload behavior of these tasks outlined implementation of the high-performance Graph Processing in Figure 1, are diverse. This diversity is discussed again in Interface (GPI) library that currently implements an early the conclusion of this paper. Specialized systems for narrow variant of the GraphBLAS interface [15, 16] and is tasks such as DNN training exist [18, 19] and perform very optimized for POWER processors [17]. Then, we report the well on them. However, POWER systems effectively cover performance on POWER9 of representative kernels in graph all stages of cognitive computing illustrated in Figure 1 analytics, performed in Stages 2 and 3 of Figure 1, and because of their large shared memory multiprocessing illustrate the advantage of the linear algebra approach over capability and high bandwidth to memory. This breadth of conventional methods. In the third section of this paper coverage includes Stage 4 in Figure 1, where the actionable (“DNN computations on POWER9”), we present the business knowledge is deployed in business systems. performance of POWER9 on the forward propagation kernel We refer the reader to companion articles in this issue for of artificial neural networks. We show that the sparse-matrix an exploration of the features and capabilities of the algorithms outperform their dense-matrix counterparts, even POWER9 processor. In particular, Le et al. [17] describe with limited sparsity in the weight matrices. the high-performance processing cores of POWER9, while In the linear algebra formulation of graph analytics, a the article Arimilli et al. [20] describe the cache hierarchy graph is represented as an adjacency matrix, which is that supports the computing capacity of those cores and 10 : 2 M. KUMAR ET AL. IBM J. RES. & DEV. VOL. 62 NO. 4/5 PAPER 10 JULY/SEPTEMBER 2018 Figure 2 GraphBLAS primitives to support linear algebra formulation of graph analytics. Uppercase letters are matrices; lowercase b, c, and m are vectors. M and m are masks. Red and blue lettering indicates optional parameters and modifiers. Starke et al. [21] describe the memory architecture and Finally, the graph data is highly non-uniform in terms of in/ connectivity of POWER9 systems. out degree distribution of vertices and the presence of community structures, a very loose definition of clique. Linear algebra formulation of graph analytics This complicates the exploitation of parallelism on modern Attaining good performance on analysis of linked data in multi-core processor-based parallel systems [17, 22] Stage 2 of cognitive applications is tenuous for the analytics because load balancing, minimization of synchronization application developer because of the need to manage overhead, and minimization of inter-task communication performance consequences of irregular memory accesses become more difficult to manage. over a large address space and exploit complex hardware In addition to these idiosyncrasies of the graph analytics features, as explained first in this section. The GraphBLAS problems, modern multi-core processors also have their interface, described next, unburdens the application own complexities that need to be factored into the developer from the chores of managing parallelism and application programs to minimize the execution time of platform-specific optimizations. The platform-specific applications. Programmers must restructure their optimizations are factored into a library of select graph applications

Load more