Novel Graph Processor Architecture

NOVEL GRAPH PROCESSOR ARCHITECTURE Novel Graph Processor Architecture William S. Song, Jeremy Kepner, Vitaliy Gleyzer, Huy T. Nguyen, and Joshua I. Kramer Graph algorithms are increasingly used in Many problems in computation and data applications that exploit large databases. » analysis can be framed by graphs and solved with graph algorithms. A graph, which is However, conventional processor defined as a set of vertices connected by architectures are hard-pressed to handle edges, as shown on the left in Figure 1, adapts well to pre- the throughput and memory requirements senting data and relationships. Graphs take two forms: a directed graph has edges with orientation as shown in of graph computation. Lincoln Laboratory’s Figure 1, and an undirected graph has edges with no ori- graph-processor architecture represents a entation. Graph algorithms perform operations on graphs fundamental rethinking of architectures. It to yield desired information. In general, graphs can also be utilizes innovations that include high-bandwidth represented as full standard and sparse matrices as shown in Figure 1 [1, 2]. The graph G (V, E) with vertices V and three-dimensional (3D) communication links, edges E can be represented with the sparse matrix A where a sparse matrix-based graph instruction set, the matrix element Aij represents the edge between the accelerator-based architecture, a systolic vertex i and vertex j. In this example, Aij is set to 1 when there is an edge from vertex i to vertex j. If there is no edge sorter, randomized communications, a between vertices i and j, then Aij would be zero and thus cacheless memory system, and 3D packaging. would have no entry in the sparse matrix. In this example, the sparse matrix has reduced the required data points representing the graph from 16 to 7. Increasingly, commercial and government applications are making use of graph algorithms [3]. These applications address a wide variety of tasks—finding the shortest or fastest routes on maps, routing robots, analyzing DNA, corporate scheduling, transaction processing, and analyzing social networks—as well as network optimizations for communication, transportation, water supply, electricity, and traffic. Some of the graph algorithm applications involve analyzing very large databases. These large databases could contain consumer purchasing patterns, financial transac- tions, social networking patterns, financial market infor- 92 LINCOLN LABORATORY JOURNAL n VOLUME 20, NUMBER 1, 2013 WILLIAM S. SONG, JEREMY KEPNER, VITALIY GLEYZER, HUY T. NGUYEN, AND JOSHUA I. KRAMER Graph Conventional Sparse matrix standard full matrix configuration To jth vertex To jth vertex Vertex Edge 1 2 3 4 1 2 3 4 1 1 0 1 0 0 1 1 2 2 2 1 0 1 0 1 1 th vertex th th vertex th i i 3 3 1 1 0 0 1 1 3 4 From From From From 4 4 0 1 1 0 1 1 FIGURE 1. A sparse matrix representation of a graph reduces the amount of computation power necessary by representing the graph with a minimum (sparse) number of data points. mation, Internet data, test data, biological data, cyber in red is a sparse matrix multiply kernel running on iden- communication, or intelligence, surveillance, and recon- tical processors. As one can see, the graph computation naissance (ISR) sensor data. For example, an analyst might throughput is approximately 1000 times lower; this result be interested in spotting a cyber attack, locating a terrorist is consistent with typical application codes. cell, or identifying a market niche. Some graphs may con- Recently, multiple processor cores have become avail- tain billions of vertices and edges requiring petabytes of able on a single processor die. Multicore processors can memory for storage. For these large database applications, speed up graph computation somewhat, but are still limited computation efficiency often falls off dramatically. Because by conventional architectures that are optimized essentially conventional cache-based processor architectures are gen- for dense matrix processing using cache-based memory. erally not well matched to the flow of the graph computa- Parallel processors have often been used to speed up tion, it is impossible for computation hardware to keep up large conventional computing tasks. A parallel processor with computation throughput requirements. For example, generally consists of conventional multicore processors most modern processors utilize cache-based memory in that are connected through a communication network so order to take advantage of highly localized memory access that different portions of the computations can be done patterns. However, memory access patterns associated with on different processors. For many scientific computing graph processing are often random in nature and can result applications, these processors provide significant speedup in high cache miss rates. In addition, graph algorithms over a single processor. require significant overhead computation for dealing with However, large graph processing tasks often run inef- the indices of vertices and edges of graphs. ficiently on conventional parallel processors. The speedup For benchmarking graph computation, we often use often levels off after only a small number of processors sparse matrix operations for estimating graph algorithm are utilized (Figure 3) because the computing patterns performance because sparse matrix arithmetic operations for graph algorithms require much more communica- have computational flow and throughput very similar to tion between processor nodes than conventional, highly the flow and throughput of graph processing. Once the localized processing requires. The limited communication graphs have been converted to the sparse matrix format, bandwidth of conventional parallel processors generally the sparse matrix operations can be used to implement cannot keep pace with the demands of graph algorithms. most graph algorithms. Figure 2 shows an example of the In the past, numerous attempts have been made to computational throughput differences between conven- speed up graph computations by optimizing processor tional processing and graph processing [4]. Shown in architecture. Parallel processors such as Cray XMT and blue is a conventional matrix multiply kernel running on Thinking Machine’s Connection Machine are example PowerPC and Intel Zeon processors. In contrast, shown attempts to speed up large graph processing with spe- VOLUME 20, NUMBER 1, 2013 n LINCOLN LABORATORY JOURNAL 93 NOVEL GRAPH PROCESSOR ARCHITECTURE 1010 1010 System 3.2 GHz Intel Xeon 9 IBM p5 570* 10 9 10 Custom HPC2 *Fixed problem size 8 10 Matrix 1.5 GHz PowerPC 8 10 B = 1 GB/s Graph 1.5 GHz PowerPC e Matrix 3.2 GHz Intel Xeon FLOP/s COTS cluster 7 operations/s Graph 10 Graph 3.2 GHz Intel Xeon 107 model Be = 0.1 GB/s 106 106 1 10 100 1000 Processors 105 2 4 6 8 10 10 10 10 10 10 FIGURE 3. Graph processing computational throughput Number of elements/edges per row/vertex in networked multiprocessors levels off at the use of a rela- tively small number of processors. FIGURE 2. A comparison of computational throughput differences between conventional and graph processing shows that in conventional processors computational efficiency is ing unit (CPU). The processor nodes utilize new, efficient significantly lower for graph processing compared to con- message-routing algorithms that are statistically opti- ventional processing. mized for communicating very small packets of data such as sparse matrix elements or partial products. The proces- cialized parallel architectures. However, inherent dif- sor hardware design is also optimized for very-high-band- ficulties associated with graph processing, including width three-dimensional (3D) communications. Detailed distributed memory access, indices-related computa- analysis and simulations have demonstrated an orders- tion, and interprocessor communications, have limited of-magnitude increase in computational throughput and the performance gains. power efficiency for running complex graph algorithms Lincoln Laboratory has been developing a promis- on large distributed databases. ing new processor architecture that may deliver orders of magnitude higher computational throughput and power Parallel Graph Processor Architecture Based on a efficiency over the best commercial alternatives for large Sparse Matrix Algebra Instruction Set graph problems. Assume that the graph has been converted into sparse matrix format before being inputted into the processor. Graph Processor The sparse matrix opervations are then used to implement The Laboratory’s new graph processor architecture rep- the graph algorithms. There are a number of advantages in resents a fundamental rethinking of the computer archi- implementing the graph algorithms as sparse matrix oper- tecture for optimizing graph processing. The instruction ations. One advantage is that the number of lines of code is set is unique in that it is based on and optimized for significantly reduced in comparison to the amount of code sparse matrix operations. In addition, the instruction set required by traditional software that directly implements is designed to operate on sparse matrix data distributed graph algorithms using conventional instruction sets. over multiple processors. The individual processor node— However, while this advantage can increase software devel- an architecture that is a great departure from the con- opment efficiency, it does not necessarily result in higher ventional von Neumann architecture—has local cacheless computational throughput in conventional

Load more