Breadth First Search Vectorization on the Intel Xeon Phi
Total Page:16
File Type:pdf, Size:1020Kb
Breadth First Search Vectorization on the Intel Xeon Phi Mireya Paredes Graham Riley Mikel Luján University of Manchester University of Manchester University of Manchester Oxford Road, M13 9PL Oxford Road, M13 9PL Oxford Road, M13 9PL Manchester, UK Manchester, UK Manchester, UK [email protected] [email protected] [email protected] ABSTRACT Today, scientific experiments in many domains, and large Breadth First Search (BFS) is a building block for graph organizations can generate huge amounts of data, e.g. social algorithms and has recently been used for large scale anal- networks, health-care records and bio-informatics applica- ysis of information in a variety of applications including so- tions [8]. Graphs seem to be a good match for important cial networks, graph databases and web searching. Due to and large dataset analysis as they can abstract real world its importance, a number of different parallel programming networks. To process large graphs, different techniques have models and architectures have been exploited to optimize been applied, including parallel programming. The main the BFS. However, due to the irregular memory access pat- challenge in the parallelization of graph processing is that terns and the unstructured nature of the large graphs, its large graphs are often unstructured and highly irregular and efficient parallelization is a challenge. The Xeon Phi is a this limits their scalability and performance when executing massively parallel architecture available as an off-the-shelf on off-the-shelf systems [18]. accelerator, which includes a powerful 512 bit vector unit The Breadth First Search (BFS) is a building block of with optimized scatter and gather functions. Given its po- graph algorithms. Despite its simplicity, its parallelization tential benefits, work related to graph traversing on this has been a real challenge but it remains a good candidate architecture is an active area of research. for acceleration. To date, different accelerators have been We present a set of experiments in which we explore ar- targeted to improve the performance of this algorithm such chitectural features of the Xeon Phi and how best to exploit as GPGPUs [11] and FPGAs [28]. The Intel Xeon Phi, also them in a top-down BFS algorithm but the techniques can known as MIC (Intel's Many Integrated Core Architecture), be applied to the current state-of-the-art hybrid, top-down is a massively parallel off-the-shelf system consisting of up plus bottom-up, algorithms. to 60 cores with 4-way simultaneous multi-threading (SMT) We focus on the exploitation of the vector unit by devel- for a maximum of 240 logical cores [21]. As well as this oping an improved highly vectorized OpenMP parallel al- thread-level parallelism, each core has a 512 bit vector unit gorithm, using vector intrinsics, and understanding the use allowing data-level parallelism to be extracted by using Sin- of data alignment and prefetching. In addition, we inves- gle Instruction Multiple Data (SIMD) programming. We tigate the impact of hyperthreading and thread affinity on believe that by exploiting both of these forms of parallelism, performance, a topic that appears under researched in the the Xeon Phi is an interesting platform for exploring parallel literature. As a result, we achieve what we believe is the implementations of graph algorithms. fastest published top-down BFS algorithm on the version of The goal of this study is to demonstrate through experi- Xeon Phi used in our experiments. The vectorized BFS top- ments and analysis the impact of using the Xeon Phi archi- down source code presented in this paper can be available tecture to accelerate BFS in a single Xeon Phi device, as a on request as free-to-use software. step towards the multi-device solutions that will be needed to tackle very large graph-based datasets. As a starting point, we took the description of the top-down BFS algo- arXiv:1604.02844v1 [cs.DC] 11 Apr 2016 Keywords rithm in [9] which is summarised in Section 3. Although this graph algorithms; BFS; large scale; irregular; graph travers- offers a good starting point on the Xeon Phi, there are still ing; hyperthreading; thread affinity; prefetching some architectural features that need to be well understood in order to be exploited. The contribution of this paper is twofold; first, we present the results of a series of exper- 1. INTRODUCTION iments demonstrating the benefit of successfully exploiting architectural features of the Xeon Phi focusing on the vector unit (programmed via vector intrinsics), with data align- Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed ment and prefetching. In addition, we present the results for profit or commercial advantage and that copies bear this notice and the full cita- of investigations into the impact of hyperthreading (Intel's tion on the first page. Copyrights for components of this work owned by others than term for SMT) and thread affinity when the Phi is under- ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or re- publish, to post on servers or to redistribute to lists, requires prior specific permission populated with threads. and/or a fee. Request permissions from [email protected]. The structure of the paper is as follows. The Xeon Phi WOODSTOCK ’97 El Paso, Texas USA architecture is presented in Section 2. We present the pro- c 2016 ACM. ISBN 123-4567-24-567/08/06. $15.00 cedure of vectorizing the BFS algorithm on the Xeon Phi, DOI: 10.475/123 4 Core Core the compiler to vectorize certain parts in the code. Manual vectorization can be set by using SIMD pragma directives L2 L2 supported in the compiler. The compiler also supports a wide range of intrinsics which allow a programmer low-level control of the vector unit. TD TD 3. BREADTH FIRST SEARCH TD TD GDDR 5 The Breadth First Search (BFS) algorithm is one of the GDDR 5 building blocks for graph analysis algorithms including be- tweenness centrality, shortest path and connected compo- L2 L2 nents [4]. Given a graph G and a starting vertex s, the BFS systematically explores all the edges E of G to trace all the Core Core vertices V that can be reached from s [7], where jV j is the number of vertices and jEj is the number of edges. The Figure 1: The Intel R Xeon Phi Microarchitecture. output is a BFS spanning tree (bfs tree of all the vertices encountered from s. starting with the serial BFS algorithm in Section 3.1, fol- The simplest sequential BFS algorithm consists in having lowed by an initial parallel version in Section 3.2, which a queue that contains a list of vertices waiting to be pro- we call non-simd version. Next, we discuss the simd ver- cessed. Enqueue and dequeue operations on the queue can sion that exploits the vector unit on the Xeon Phi in Sec- be implemented in constant time Θ(1) [16]. Thus, the run- tion 4. We then present performance comparisons between ning time of this serial BFS algorithm is O(V + E). Despite the non-simd and simd versions and explore the impact of the queue being simple and efficient, during parallelization, parallelism at two levels of granularity (OpenMP threading it has some drawbacks. The queue can be a bottleneck since and the vector unit), achieving better results for the native it implies a vertex processing order because the dequeue op- BFS top-down algorithm on the Xeon Phi than those previ- eration typically takes out the vertex that has been added ously presented in [10] and [24]. In Section 4 we explore the first. To address this problem vertices can be partitioned impact of data prefetching and discuss the effect of thread into layers. A layer consists of a set of all vertices with the 1 affinity mapping, leading us to key optimizations, and moti- same distance from the source vertex. Processing vertices vating future work on the possibility of using helper threads by layers avoids the order restriction imposed by the queue, to improve performance. The experimental setup and anal- allowing vertices to be explored in any order as long as they ysis of our results are shown in Sections 5 and 6. Related are in the same layer. However, each layer has to be pro- work is briefly discussed in Section 7 and conclusions and cessed in sequence; that is all vertices with distance k, (layer future work are discussed in Section 8. Lk) are processed before those in layer Lk+1. Traversing the graph from the vertices in the top layer 2. THE XEON PHI ARCHITECTURE down to their children in the bottom layer is a conventional approach known as top-down whereas traversing from the R TM The Intel Xeon Phi coprocessor used in this work is vertices in the bottom layer to find their parent in the top composed of 60, 4 way-SMT Pentium-based cores [21] and a layer is known as bottom-up approach. The bottom-up con- main memory of 8 GB. Each core contains a powerful 512- cept was introduced by [3] in a hybrid approach that explores bits vector processing unit and a cache memory divided into the graph in both directions, first traversing the graph with L1 (32KB) and L2 (512KB) kept fully coherent by a global- the top-down and then swaping to the bottom-up approach distributed tag directory (TD), coordinated by the cache co- in the middle layers to finally swap back to the top-down.