Parallel Arbitrary-Precision Integer Arithmetic

Western University Scholarship@Western Electronic Thesis and Dissertation Repository 3-19-2021 1:30 PM Parallel Arbitrary-precision Integer Arithmetic Davood Mohajerani, The University of Western Ontario Supervisor: Moreno Maza, Marc, The University of Western Ontario A thesis submitted in partial fulfillment of the equirr ements for the Doctor of Philosophy degree in Computer Science © Davood Mohajerani 2021 Follow this and additional works at: https://ir.lib.uwo.ca/etd Part of the Numerical Analysis and Scientific Computing Commons, and the Theory and Algorithms Commons Recommended Citation Mohajerani, Davood, "Parallel Arbitrary-precision Integer Arithmetic" (2021). Electronic Thesis and Dissertation Repository. 7674. https://ir.lib.uwo.ca/etd/7674 This Dissertation/Thesis is brought to you for free and open access by Scholarship@Western. It has been accepted for inclusion in Electronic Thesis and Dissertation Repository by an authorized administrator of Scholarship@Western. For more information, please contact [email protected]. Abstract Arbitrary-precision integer arithmetic computations are driven by applications in solving sys- tems of polynomial equations and public-key cryptography. Such computations arise when high precision is required (with large input values that t into multiple machine words), or to avoid coecient overow due to intermediate expression swell. Meanwhile, the growing demand for faster computation alongside the recent advances in the hardware technology have led to the development of a vast array of many-core and multi-core processors, accelerators, programming models, and language extensions (e.g., CUDA and OpenCL for GPUs, and OpenMP and Cilk for multi-core CPUs). The massive computational power of parallel processors makes them attractive targets for carrying out arbitrary-precision integer arithmetic. At the same time, developing parallel algorithms, followed by implementing and optimizing them as multi-threaded parallel programs imposes a set of challenges. This work explains the current state of research on parallel arbitrary-precision integer arithmetic on GPUs and CPUs, and proposes a number of solutions for some of the challenging problems related to this subject. Keywords: Arbitrary-precision integer arithmetic, FFT, GPU, Multi-core ii Summary Arbitrary-precision integer arithmetic computations are driven by applications in solving sys- tems of polynomial equations and public-key cryptography. Such computations arise when high precision is required. Meanwhile, the growing demand for faster computation alongside the recent advances in the hardware technology have led to the development of a vast array of many-core and multi-core processors, accelerators, programming models, and language extensions. The massive computational power of parallel processors makes them attractive targets for carrying out arbitrary-precision integer arithmetic. At the same time, developing parallel algorithms, followed by implementing and optimizing them as multi-threaded parallel programs imposes a set of challenges. This work explains the current state of research on parallel arbitrary-precision integer arithmetic on GPUs and CPUs, and proposes a number of solutions for some of the challenging problems related to this subject. iii Co-Authorship Statement • Chapter2 is a joint work with Liangyu Chen, Svyatoslav Covanov, and Marc Moreno Maza. This work is published in [1]. The contributions of the thesis author include: a new GPU implementation, optimization of arithmetic operations on GPU, experimental results (tables, diagrams, proling information) for comparing the presented algorithms against computa- tionally equivalent solutions on CPUs and GPUs. • Chapter3 is a joint work with Svyatoslav Covanov, Marc Moreno Maza, and Linxiao Wang. This work is published in [2]. The contributions of the thesis author include: adaptation of convolution code (originally developed by Svyatoslav Covanov as part of BPAS library) for multiplying arbitrary elements of big prime eld, specialized arithmetic for the CRT, implementation and tuning of base-case DFT functions, parallelization of the code using Cilk, and collection of experimental results. • Chapter4 is a joint work with Alexander Brandt, Marc Moreno Maza, Jeeva Paudel, and Linxiao Wang. A preprint of this work is published in [3]. The contributions of the thesis author include: processing of the annotated CUDA kernel code, generation of instrumented binary code using the LLVM Pass Framework, development of a customized proler using NVIDIA’s EVENT API within the CUPTI API, and collection of experimental results. • Chapter5 is a joint work with Marc Moreno Maza. The contributions of the thesis author include: algorithm design and analysis, implementation, code optimization and collection of experimental results. iv Acknowlegements First and foremost, I would like to thank my supervisor Professor Marc Moreno Maza. I am grateful for his advice, support, and providing me the opportunity to work on multiple fascinating problems. Through our discussions, I have had the chance to learn a handful of lessons, including but not limited to, the emphasis on the rst principles, clear and concise expression of ideas, and trying to systematically model the problems with algebraic structures. There were multiple times that I assumed a problem could not be further studied, or optimized and more than once I was proven wrong. This in essence has taught me to distinguish the real barriers versus the ones that are a creation of my thinking. Everything that I have worked on has taught me another important lesson that I must always remember and that is what Voltaire once said it best: "Perfect is the enemy of the good." I am grateful for the insightful comments and questions of thesis examiners Professor Michael Bauer, Professor Mark Daley, Professor Arash Reyhani-Masoleh, and Professor Éric Schost. I would like to thank the members of Ontario Research Center for Computer Algebra (OR- CCA) and the Computer Science Department of the University of Western Ontario. I am thankful to my colleagues and co-authors; I have learned a little bit of what to do and what not to do from each of you. Specically, I am thankful to Svyatoslav Covanov for laying the theoretical foundation and further contributions to our joint work on big prime eld FFT. I also would like to thank Alexander Brandt for his help with proofreading chapter 5 of this thesis as well as our ISSAC 2019 paper (FFT on multi-core CPUs) although he was not a co-author of that work. Last but not least, I am very thankful to my family and friends for their endless support. v Contents Abstract ii Summary iii Co-Authorship Statement iv Acknowlegementsv List of Figures viii List of Tables ix 1 Introduction1 1.1 Background and motivation ............................ 1 1.2 Challenges and objectives.............................. 2 2 Big Prime Field FFT on GPUs4 2.1 Introduction ..................................... 4 2.2 Complexity analysis................................. 5 2.3 Generalized Fermat numbers............................ 8 2.4 FFT Basics ...................................... 13 2.5 Blocked FFT on the GPU .............................. 14 2.6 Implementation ................................... 16 2.7 Experimentation................................... 17 2.8 Conclusion...................................... 22 2.9 Appendix: modular methods and unlucky primes ................ 22 3 Big Prime Field FFT on Multi-core Processors 26 3.1 Introduction ..................................... 26 3.2 Generalized Fermat prime elds .......................... 28 3.3 Optimizing multiplication in generalized Fermat prime elds.......... 29 3.4 A generic implementation of FFT over prime elds................ 32 3.5 Experimentation................................... 37 3.6 Conclusions and future work............................ 44 4 KLARAPTOR: Finding Optimal Kernel Launch Parameters 45 vi 4.1 Introduction ..................................... 45 4.2 Theoretical foundations............................... 48 4.3 KLARAPTOR: a dynamic optimization tool for CUDA.............. 54 4.4 An algorithm to build and deploy rational programs............... 55 4.5 Runtime selection of thread block conguration for a CUDA kernel . 57 4.6 The implementation of KLARAPTOR ....................... 59 4.7 Experimentation................................... 62 4.8 Conclusions and future work............................ 64 5 Arbitrary-precision Integer Multiplication on GPUs 66 5.1 Introduction ..................................... 66 5.2 Essential denitions for designing parallel algorithms.............. 67 5.3 Choice of algorithm for parallelization....................... 67 5.4 Problem denition.................................. 68 5.5 A ne-grained parallel multiplication algorithm ................. 69 5.6 Complexity analysis................................. 70 5.7 Experimentation................................... 75 5.8 Discussion ...................................... 80 6 Conclusion 81 Bibliography 82 Curriculum Vitae 89 vii List of Figures 2.1 Speedup diagram for computing the benchmark for a vector of size N Ke 63 34 8 (K 16) for P3 : ¹2 + 2 º + 1.......................... 21 2.2 Speedup diagram for computing the benchmark for a vector of size N Ke 62 36 16 (K 32) for P4 : ¹2 + 2 º + 1. ........................ 21 4 2.3 Running time of computing DFTN with N K on a GTX 760M GPU. 21 3.1 Ratio (t/tgmp) of average time spent in one multiplication operation measured

Load more