Accurate and Efficient Sums and Dot Products in Julia

Total Page:16

File Type:pdf, Size:1020Kb

Accurate and Efficient Sums and Dot Products in Julia Accurate and Efficient Sums and Dot Products in Julia Chris Elrod, François Févotte To cite this version: Chris Elrod, François Févotte. Accurate and Efficient Sums and Dot Products in Julia. 2019. hal- 02265534v2 HAL Id: hal-02265534 https://hal.archives-ouvertes.fr/hal-02265534v2 Preprint submitted on 14 Aug 2019 (v2), last revised 21 Aug 2019 (v3) HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non, lished or not. The documents may come from émanant des établissements d’enseignement et de teaching and research institutions in France or recherche français ou étrangers, des laboratoires abroad, or from public or private research centers. publics ou privés. Accurate and Efficient Sums and Dot Products in Julia Chris Elrod Franc¸ois Fevotte´ Eli Lilly TriScale innov Email: [email protected] Palaiseau, France Email: [email protected] Abstract—This paper presents an efficient, vectorized im- all cases. For programs that already use double precision (64- plementation of various summation and dot product algo- bit FP numbers), increasing the precision is not an option, rithms in the Julia programming language. These implemen- since quadruple precision (128-bit FP numbers) remains too tations are available under an open source license in the 1 AccurateArithmetic.jl Julia package. expensive for most purposes . On the other hand, hitting the Besides naive algorithms, compensated algorithms are im- memory wall means that many algorithms (and in particular plemented: the Kahan-Babuska-Neumaierˇ summation algorithm, BLAS1-like algorithms such as summation and dot product) and the Ogita-Rump-Oishi simply compensated summation and are limited by the memory bandwidth in a wide range of vector dot product algorithms. These algorithms effectively double the lengths on modern systems. In turn, this means that CPUs working precision, producing much more accurate results while incurring little to no overhead, especially for large input vectors. are idling for a fraction of the computing time, which could This paper also tries and builds upon this example to make instead be used to perform additional operations for free, as a case for a more widespread use of Julia in the HPC com- long as these operations do not involve additional memory munity. Although the vectorization of compensated algorithms accesses. Much work has therefore been devoted to computing is no particularly simple task, Julia makes it relatively easy sums and dot products in a way that is both fast and accurate, and straightforward. It also allows structuring the code in small, composable building blocks, closely matching textbook leading to efficient algorithms implemented in a variety of algorithms yet efficiently compiled. programming languages. Index Terms—Vectorization, Compensated Algorithms, Julia The work presented here comes from the observation2 that, programming language as of November 2018, no such algorithm was currently imple- mented in an efficient way in the Julia programming language. I. INTRODUCTION This paper describes an attempt made to implement accurate Computing the dot product of two vectors and, perhaps to a summation and dot product algorithms in Julia, in a way that is lesser extent, summing the elements of a vector, are two very as efficient as possible, and in particular makes use of SIMD common basic building blocks for more complex linear algebra vector units. All the code described here is available under algorithms. As such, any change in their performance is likely an open source license in the AccurateArithmetic.jl to affect the overall performance of scientific computing codes; package. This paper will also try to make the argument any change in their accuracy is likely to induce a loss of that writing such efficient implementations of compensated reproducibility in overall computed results. algorithms is a relatively easy and straightforward task thanks Such a lack of reproducibility is becoming more and more to the nice properties of the Julia language. common as the performance of supercomputers increase, due Part II of this paper takes the viewpoint of computational to a combination of multiple factors. First, an increase in sciences and discusses some aspects related to the use of computational power allows for ever more expensive models vector instructions in fast implementations of summation al- (such as, for example, finer grids for the discretization of gorithms. Part III then takes the point of view of floating- partial differential equations), which in turn generates larger point arithmetic, and reviews a subclass of accurate summation vectors, whose summation is often more ill-conditioned. Sec- and dot product algorithms: compensated algorithms. Nothing ond, the order in which operations are performed in modern particularly new is presented in these parts, except maybe systems is far from being deterministic and reproducible: some specific details about interactions between vectorization vectorization, parallelism, and compiler optimizations are only and compensation (paragraph III-C), that have to be dealt with a few examples of possible causes for unexpected changes in order to gain the full benefit of both worlds: speed and in the execution order of a given program. Since floating- accuracy. Part IV gives some insight about the Julia imple- point (FP) arithmetic is not associative, such non-determinism mentation of such compensated algorithms, and discusses the in execution order entails non-reproducibilities in computed performance obtained on some test systems. These results are results. For these reasons, accurate summation and dot product algorithms are sometimes useful. 1at least on x86-64 architectures, where quadruple precision is implemented However, increasing the algorithms accuracy at the price in software of an increase in their computing cost is not acceptable in 2https://discourse.julialang.org/t/floating-point-summation/17785 Algorithm 1 Scalar, naive summation Algorithm 2 Vectorized, naive summation 1: fInitialization:g 1: fInitialization:g 2: a 0 2: a 0 3: fLoop on vector elements:g 3: fLoop on full packs:g N 4: for i 2 1 : N do 4: for j 2 1 : W do 5: a a ⊕ xi 5: i W (j − 1) + 1 6: end for 6: p (xi; xi+1; xi+2; xi+3) 7: a a ⊕ p 7: return a 8: end for 9: fReduction of SIMD accumulator:g summarized in part V, along with some closing notes and 10: a vsum(a) outlooks. 11: fLoop on remaining elements:g II. THE COMPUTATIONAL SCIENCE POINT OF VIEW N 12: for j 2 W W + 1 : N do For the sake of simplicity, let us only discuss here the case 13: a a ⊕ xi of the summation algorithm: 14: end for N X 15: return a sum(x) = xi; i=1 N vsum where x 2 R . Since a dot product can also be expressed as to be reduced to a scalar: this is the purpose of the a summation, no loss of generality is entailed by this choice: function, which sums all W components of a SIMD register: N vsum(a) = a[1] ⊕ a[2] ⊕ ::: ⊕ a[W ]; X dot(x; y) = x y ; i i where a[i] represents the i-th element in a SIMD pack a. i=1 Although such an implementation is already much more where x 2 RN and y 2 RN . complex than its scalar version, it is not enough to obtain op- Algorithm 1 presents the most naive possible algorithm timal performance on modern hardware. Because of the loop- for a summation, in its scalar version. An accumulator a carried dependency of the accumulator and instruction latency stores partial sums. During each loop iteration, a new vector in superscalar hardware architectures, full performance can element xi is added to this accumulator. Note that this accu- only be obtained if multiple instructions can be running at the mulation is performed using the computer arithmetic addition same time. This means breaking that sequential dependency operation ⊕, which operates on FP numbers, rounds its result, between one iteration and the next; a traditional way to do that and is therefore only an approximation of the mathematical + consists in partially unrolling the main loop. An illustration operator which operates on real numbers. In the remainder of of this technique is given in Algorithm 3, where the loop is this paper, similar notations will be used for other operators, unrolled twice (U = 2). A consequence of this technique is such as denoting the FP subtraction. that there are now as many SIMD accumulators as the level of In Algorithm 2, this algorithm is refined to make use partial unrolling U (2 in this case). These accumulators need of vector instructions. In the following, the system will be to be summed together and reduced before proceeding with assumed to efficiently implement SIMD instructions on vectors the accumulation of remaining scalar elements. N (hereafter referred to as “packs” so that the “vector” term If the number of packs W is not a multiple of U, an can unambiguously refer to mathematical vectors such as the additional type of remainder needs to be handled: the full operand x) of width W . For x86_64 architectures (which are packs that were not handled in the partially unrolled loop. In the main target of the algorithms described here), the AVX the algorithm presented here, SIMD accumulator a1 is used extension for example supports efficiently operating on 256- to accumulate these. bit registers holding W = 4 double-precision FP numbers. Figure 1 illustrates the gains to be expected from such All variables denoted in bold font represent such packs, and a technique: for various vector sizes, the total summation are meant to be stored and manipulated as SIMD registers. time (normalized by the vector size) is plotted against the The main loop in Alg.
Recommended publications
  • Finding Rare Numerical Stability Errors in Concurrent Computations
    Finding Rare Numerical Stability Errors in Concurrent Computations Karine Even Technion - Computer Science Department - M.Sc. Thesis MSC-2013-17 - 2013 Technion - Computer Science Department - M.Sc. Thesis MSC-2013-17 - 2013 Finding Rare Numerical Stability Errors in Concurrent Computations Research Thesis Submitted in partial fulfillment of the requirements for the degree of Master of Science in Computer Science Karine Even Submitted to the Senate of the Technion — Israel Institute of Technology Iyar 5773 Haifa April 2013 Technion - Computer Science Department - M.Sc. Thesis MSC-2013-17 - 2013 Technion - Computer Science Department - M.Sc. Thesis MSC-2013-17 - 2013 This thesis was completed under the supervision of Assistant Prof. Eran Yahav (Department of Computer Science, Technion) and Dr. Hana Chockler (IBM Haifa Research Labs) in the Department of Computer Science. The generous financial support of the Technion is gratefully acknowledged. Technion - Computer Science Department - M.Sc. Thesis MSC-2013-17 - 2013 Technion - Computer Science Department - M.Sc. Thesis MSC-2013-17 - 2013 Contents Abstract 1 1 Introduction 2 2 Overview 6 3 Preliminaries 12 3.1 Programs as Transition Systems . 12 3.2 Cross-Entropy Method . 14 3.3 Cross-Entropy Method in Optimization Problems . 17 3.4 Cross-Entropy Method Over Graphs . 18 4 Cross-Entropy Based Testing for Numerical Stability Problems 22 4.1 Performance Function for Floating Point Summation and Subtrac- tion Algorithms . 23 4.2 CE-Based Testing with Abstraction . 25 5 Evaluation 31 5.1 Implementation . 31 5.2 Benchmarks . 32 5.3 Methodology . 33 5.3.1 Description of Benchmarks . 33 5.3.2 Generating Benchmarks Inputs .
    [Show full text]
  • The Accuracy of Floating Point Summation* Nicholas J
    SIAM J. ScI. COMPUT. () 1993 Society for Industrial and Applied Mathematics Vol. 14, No. 4, pp. 783-799, July 1993 OO3 THE ACCURACY OF FLOATING POINT SUMMATION* NICHOLAS J. HIGHAMt Abstract. The usual recursive summation technique is just one of several ways of computing the sum of n floating point numbers. Five summation methods and their variations are analyzed here. The accuracy of the methods is compared using rounding error analysis and numerical experiments. Four of the methods are shown to be special cases of a general class of methods, and an error analysis is given for this class. No one method is uniformly more accurate than the others, but some guidelines are given on the choice of method in particular cases. Key words, floating point summation, rounding error analysis, orderings AMS subject classifications, primary 65G05, secondary 65B10 1. Introduction. Sums of floating point numbers are ubiquitous in scientific com- puting. They occur when evaluating inner products, means, variances, norms, and all kinds of nonlinear functions. Although, at first sight, summation might appear to offer little scope for algorithmic ingenuity, the usual "recursive summation" (with various or- derings) is just one of several possible techniques. Most of the other techniques have been derived with the aim of achieving greater accuracy of the computed sum, but pair- wise summation has the advantage of being particularly well suited to parallel computa- tion. In this paper we examine a variety of methods for floating point summation, with the aim of answering the question, "which methods achieve the best accuracy?" Several authors have used error analysis to compare summation methods; see, for example, [1], [2], [32], and [38].
    [Show full text]
  • Error Analysis and Improving the Accuracy of Winograd Convolution for Deep Neural Networks
    ERROR ANALYSIS AND IMPROVING THE ACCURACY OF WINOGRAD CONVOLUTION FOR DEEP NEURAL NETWORKS BARBARA BARABASZ1, ANDREW ANDERSON1, KIRK M. SOODHALTER2 AND DAVID GREGG1 1 School of Computer Science and Statistics, Trinity College Dublin, Dublin 2, Ireland 2 School of Mathematics, Trinity College Dublin, Dublin 2, Ireland Abstract. Popular deep neural networks (DNNs) spend the majority of their execution time computing convolutions. The Winograd family of algorithms can greatly reduce the number of arithmetic operations required and is present in many DNN software frameworks. However, the performance gain is at the expense of a reduction in floating point (FP) numerical accuracy. In this paper, we analyse the worst case FP error and prove the estimation of norm and conditioning of the algorithm. We show that the bound grows exponentially with the size of the convolution, but the error bound of the modified algorithm is smaller than the original one. We propose several methods for reducing FP error. We propose a canonical evaluation ordering based on Huffman coding that reduces summation error. We study the selection of sampling \points" experimentally and find empirically good points for the most important sizes. We identify the main factors associated with good points. In addition, we explore other methods to reduce FP error, including mixed-precision convolution, and pairwise summation across DNN channels. Using our methods we can significantly reduce FP error for a given block size, which allows larger block sizes and reduced computation. 1. Motivation Deep Neural Networks (DNNs) have become powerful tools for image, video, speech and language processing. However, DNNs are very computationally demanding, both during and after training.
    [Show full text]
  • Accuracy, Cost and Performance Trade-Offs for Streaming Set-Wise Floating Point Accumulation on Fpgas Krishna Kumar Nagar University of South Carolina
    University of South Carolina Scholar Commons Theses and Dissertations 2013 Accuracy, Cost and Performance Trade-Offs for Streaming Set-Wise Floating Point Accumulation on FPGAs Krishna Kumar Nagar University of South Carolina Follow this and additional works at: https://scholarcommons.sc.edu/etd Part of the Computer Engineering Commons Recommended Citation Nagar, K. K.(2013). Accuracy, Cost and Performance Trade-Offs of r Streaming Set-Wise Floating Point Accumulation on FPGAs. (Doctoral dissertation). Retrieved from https://scholarcommons.sc.edu/etd/3585 This Open Access Dissertation is brought to you by Scholar Commons. It has been accepted for inclusion in Theses and Dissertations by an authorized administrator of Scholar Commons. For more information, please contact [email protected]. Accuracy, Cost and Performance Trade-offs for Streaming Set-wise Floating Point Accumulation on FPGAs by Krishna Kumar Nagar Bachelor of Engineering Rajiv Gandhi Proudyogiki Vishwavidyalaya 2007 Master of Engineering University of South Carolina 2009 Submitted in Partial Fulfillment of the Requirements for the Degree of Doctor of Philosphy in Computer Engineering College of Engineering and Computing University of South Carolina 2013 Accepted by: Jason D. Bakos, Major Professor Duncan Buell, Committee Member Manton M. Matthews, Committee Member Max Alekseyev, Committee Member Peter G. Binev, External Examiner Lacy Ford, Vice Provost and Dean of Graduate Studies © Copyright by Krishna Kumar Nagar, 2013 All Rights Reserved. ii Dedication Dedicated to Dearest Maa, Paa and beloved Prakriti iii Acknowledgments It took me 6 years, 2 months and 27 days to finish this dissertation. It involved the support, patience and guidance of my friends, family and professors.
    [Show full text]