Half-Precision Floating-Point Formats for Pagerank: Opportunities and Challenges

Half-Precision Floating-Point Formats for PageRank: Opportunities and Challenges Molahosseini, A. S., & Vandierendonck, H. (2020). Half-Precision Floating-Point Formats for PageRank: Opportunities and Challenges. In 2020 IEEE High Performance Extreme Computing Conference, HPEC 2020 [9286179] (IEEE High Performance Extreme Computing Conference: Proceedings). Institute of Electrical and Electronics Engineers Inc.. https://doi.org/10.1109/HPEC43674.2020.9286179 Published in: 2020 IEEE High Performance Extreme Computing Conference, HPEC 2020 Document Version: Peer reviewed version Queen's University Belfast - Research Portal: Link to publication record in Queen's University Belfast Research Portal Publisher rights Copyright 2020 IEEE. This work is made available online in accordance with the publisher’s policies. Please refer to any applicable terms of use of the publisher. General rights Copyright for the publications made accessible via the Queen's University Belfast Research Portal is retained by the author(s) and / or other copyright owners and it is a condition of accessing these publications that users recognise and abide by the legal requirements associated with these rights. Take down policy The Research Portal is Queen's institutional repository that provides access to Queen's research output. Every effort has been made to ensure that content in the Research Portal does not infringe any person's rights, or applicable UK laws. If you discover content in the Research Portal that you believe breaches copyright or violates any law, please contact [email protected]. Download date:28. Sep. 2021 Half-Precision Floating-Point Formats for PageRank: Opportunities and Challenges Amir Sabbagh Molahosseini Hans Vandierendonck EEECS School EEECS School Queen’s University Belfast Queen’s University Belfast Belfast, United Kingdom Belfast, United Kingdom [email protected] [email protected] Abstract—Mixed-precision computation has been proposed as exponent and 9 bits mantissa is customized for deep learning a means to accelerate iterative algorithms as it can reduce the applications [12]. Although these 16-bit data formats have been memory bandwidth and cache effectiveness. This paper aims for reported to be efficient for deep learning training, and [11] further memory traffic reduction via introducing new half- predicted the industry-wide adoption of BFloat16 format for precision (16 bit) data formats customized for PageRank. We many applications, narrow floating-point formats are highly develop two formats. A first format builds on the observation that sensitive to the application characteristics. For instance, [13] the exponents of about 99% of PageRank values are tightly proposed a 16-bit format for PageRank that captures the top two distributed around the exponent of the inverse of the number of bytes of FP64. This makes conversion to/from FP64 efficient. vertices. A second format builds on the observation that 6 However, they could execute only the first iteration of the power exponent bits are sufficient to capture the full dynamic range of iteration method using this format. We conclude that narrow PageRank values. Our floating-point formats provide less precision compared to standard IEEE 754 formats, but sufficient floating-point formats must be customized to the application. dynamic range for PageRank. The experimental results on various This paper defines novel half-precision floating-point size graphs show that the proposed formats can achieve an formats customized to PageRank. This goal involves two accuracy of 1e-4, which is an improvement over the state of the aspects. First, we need to understand the requirements of art. Due to random memory access patterns in the algorithm, PageRank on dynamic range and precision. Analysis of performance improvements over our highly tuned baseline are PageRank values, as well as the PageRank values divided by 1.5% at best. vertex degree, demonstrates that the exponents of these values Keywords—Graph Processing, Customized Floating-Point Data are tightly clustered around a specific value that relates to the Format, PageRank, Transprecision Computing. size of the graph. Based on this observation, we propose a half- precision format where the exponent is encoded tightly in 3 bits, I. INTRODUCTION leaving 13 bits for the mantissa (a sign bit is not required for storing PageRank values). This format poses challenges, as a Graph processing plays an important role in big data analysis very small fraction of values cannot be encoded in this format such as applied to social networks [1] and web search [2]. and require a wider representation. We present also a second Processing of large graphs poses significant performance format as we demonstrate that 6 bits of exponent are sufficient challenges. Inherent seemingly-random memory access to encode all PageRank values, even in very large graphs. patterns, exacerbated by skewed-degree distributions, cause poor cache utilization and significant exposure to memory A second concern in the use of custom floating-point formats bandwidth limits and memory access delay [3, 4]. For instance, is the efficient implementation of arithmetic. Floating-point over 60 percent of energy consumed in a communication-bound computation is fairly complex to emulate as bit-wise operations graph algorithm such as PageRank (PR) is due to memory [5]. on a general-purpose instruction set [14]. It is more efficient to For iterative algorithms like PageRank, which are based on convert compact storage formats to a format for which floating-point computations, reduced-precision and mixed- arithmetic is supported by hardware, typically FP32 or FP64 [7, precision computation has been shown to reduce memory 13]. Conversions include changes in the bit width of the pressure [6, 7]. exponent and rounding of the mantissa. Including bit shifting and masking, conversion operations can easily add 10--16 For many applications, standard floating-point (FP) formats instructions on a general-purpose instruction set. Our formats such as the IEEE-754 FP32 and FP64 formats are wider than are designed to support efficient conversion in as few as 4 x86- necessary. The IEEE FP16 format, implemented by NVidia 64 and AVX assembly instructions. We discuss the challenges since CUDA7.5 [8] and more recently by Intel [9], provides a that need to be overcome to achieve this. suitable format for workloads such as image processing. However, FP16 sacrifices both dynamic range and precision The remainder of this paper is organized as follows. Section compared to FP32. Moreover, conversion between FP16 and II briefly reviews the PageRank algorithm and mixed-precision FP32 may incur under/overflow problems. BFloat16 [10] was PageRank. Section III introduces novel number formats. Section introduced for deep learning applications [11] and has the same IV presents an evaluation of convergence and performance. exponent size as the FP32. This eliminates the under/overflow Finally, Section V concludes the paper. concerns during conversion. The DLFloat16 format with 6 bits II. BACKGROUND the estimate using increasingly higher precision. Mixed- PageRank models the behavior of a random surfer on the precision arithmetic could potentially incrementally increase the internet as a stochastic process whereby links from one webpage precision by adding 1 or 2 bits to the mantissa at a time [19]. to another are followed at random [15]. This formulation results Mixed-precision arithmetic was previously investigated for in a linear equation where the eigenvector corresponding to the PageRank [13]. Efficient arithmetic is ensured through efficient largest eigenvalue of a column-stochastic matrix describes the conversion to hardware-supported arithmetic, in this case FP64. relative importance or relevance of each web page. This Three reduced-precision formats are proposed containing the top eigenvector is the solution of the PageRank problem. It is most 16, 32 or 48 bits of the FP64 number: the sign bit, 11 exponent commonly computed using the power iteration method. Here, an bits and a slice of the most significant mantissa bits to fill up the initial estimate of the eigenvector is iteratively improved by format. It is, however, worthwhile to use this 16-bit format only multiplying the column-stochastic matrix with the previous during the first power iteration [13]. After this, the format estimate of the eigenvector. This iteration is repeated until the switches to 32 bits. We will show that 6 exponent bits are residual error (the norm-1 difference between subsequent sufficient to hold PageRank values for very large graphs. As estimates) is less than a pre-defined threshold. such, the 32-bit format with 11 exponent bits is fundamentally Calculating the PageRank vector is highly sensitive to inferior to FP32 with 8 exponent bits for the PageRank problem. rounding errors due to finite-precision calculations. Gleich [16] As we find that convergence is possible with FP32 using presents an account of the challenges that occur. We use his Gleich’s algorithm, there is no need to consider the 48-bit of recommended algorithm as our baseline algorithm, depicted as FP64 formats for PageRank. Algorithm 1. This algorithm takes a graph G=(V,E) as input, Apart from a few works which have implemented PageRank where V is a set of vertices and E is a set of edges. The with applications-specific integrated circuits (ASIC) [20] and parameters d and Threshold represent the damping factor and field-programmable-gate arrays

Half-Precision Floating-Point Formats for Pagerank: Opportunities and Challenges

Exploring C Semantics and Pointer Provenance

Timur Doumler @Timur Audio

Effective Types: Examples (P1796R0) 2 3 4 PETER SEWELL, University of Cambridge 5 KAYVAN MEMARIAN, University of Cambridge 6 VICTOR B

Aliasing Restrictions of C11 Formalized in Coq

SATE V Ockham Sound Analysis Criteria

A Formal C Memory Model for Separation Logic

Low−Level C Topic for CS2263 Alumni −−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−− People Who Have Taken CS2617 Should Instead Read the Full Set of Notes (Cnotes)

Exploring C Semantics and Pointer Provenance

Lenient Execution of C on a JVM How I Learned to Stop Worrying and Execute the Code

C Memory Object and Value Semantics: the Space of De Facto and ISO Standards

Overload Journal

Anglia Ruskin University a Type-Safe Apparatus