A Comparison of Performance Between a CPU and a GPU on Prime Factorization Using Eratosthene's Sieve and Trial Division

DEGREE PROJECT IN TECHNOLOGY, FIRST CYCLE, 15 CREDITS STOCKHOLM, SWEDEN 2017 A Comparison of Performance Between a CPU and a GPU on Prime Factorization Using Eratosthene's Sieve and Trial Division CAROLINE W. BORG ERIK DACKEBRO KTH ROYAL INSTITUTE OF TECHNOLOGY SCHOOL OF COMPUTER SCIENCE AND COMMUNICATION A Comparison of Performance Between a CPU and a GPU on Prime Factorization Using Eratosthene’s Sieve and Trial Division CAROLINE W. BORG ERIK DACKEBRO Bachelor in Computer Science Date: June 11, 2017 Supervisor: Mårten Björkman Examiner: Örjan Ekeberg Swedish title: En jämförelse i prestanda mellan en CPU och en GPU avseende primtalsfaktorisering med hjälp av Eratosthenes såll och Försöksdivison School of Computer Science and Communication iii Abstract There has been remarkable advancement in Multi-cored Processing Units over the past decade. GPUs, which were originally designed as a specialized graphics processor, are today used in a wide variety of other areas. Their ability to solve parallel problems is unmatched due to their massive amount of simultaneously running cores. Despite this, most algorithms in use today are still fully sequential and do not utilize the processing power available. The Sieve of Eratosthenes and Trial Division are two very naive algorithms which combined can be used to find a number’s unique combinataion of prime factors. This paper sought to compare the performance of a CPU and a GPU when tasked with prime factorization on seven different data sets. Five different programs were created, two running both algorithms on the CPU, two running both algorithms on the GPU and one program uti- lizing both. Each data set was presented multiple times to each program in different sizes ranging from one to half a million. The result was uniform in that the CPU greatly outperformed the GPU in every test case for this specific implementation. iv Sammanfattning Flerkärniga processorer har under det senaste årtiondet utvecklats markant. Grafikkorten, designade för att excellera i grafiktunga beräkning- ar, används idag inom flera andra områden. Kortens förmåga att lösa paralleliserbara problem är oöverträffad tack vare deras massiva an- tal kärnor. Trots detta är majoriteten av algoritmer idag fortfarande helt sekventiella. Eratosthenes såll och Försöksdivision är två väldigt naiva algoritmer som tillsammans kan användas för att finna ett tals unika uppsättning primtalsfaktorer. Det här arbetet strävade efter att jämföra prestandan mellan en CPU och en GPU vad gäller uppgiften att faktorisera tal från sju olika uppsättningar data. Fem implementationer skrevs, varav två var be- gränsade till CPU:ns processorkraft, två begränsade till GPU:ns processorkraft och en som utnyttjade båda. Varje uppsättning data före- kom i olika storlekar i omfånget ental till en halv miljon. Resultatet var entydigt på så sätt att CPU:n markant överträffade GPU:n i samtliga testfall. Contents 1 Introduction1 1.1 Purpose.............................1 1.2 Problem Statement......................2 1.2.1 Constraints......................2 2 Background3 2.1 Prime Numbers........................3 2.2 Finding Primes........................4 2.2.1 Sieve..........................4 2.2.1.1 Sieve of Eratosthenes...........4 2.3 Factorization..........................5 2.3.1 Trial Division.....................6 2.4 CUDA.............................6 3 Method7 3.1 Approach............................7 3.2 Algorithms...........................8 3.3 Implementations.......................9 3.3.1 Sequential Algorithm with Sieve..........9 3.3.2 Sequential Algorithm without Sieve........ 10 3.3.3 Parallel Algorithm with Sieve............ 10 3.3.4 Parallel Algorithm without Sieve.......... 10 3.3.5 Algorithm with Sieve on CPU and Trial Division on GPU........................ 10 3.4 Data............................... 11 3.5 Hardware........................... 11 3.5.1 Central Processing Unit - CPU........... 11 3.5.2 Graphical Processing Unit - GPU.......... 12 3.6 Compilation Tools and Runtime Environments...... 12 v vi CONTENTS 4 Results 13 4.1 Small Prime Numbers.................... 14 4.2 Large Numbers that are Products of Many Small Prime Numbers................. 15 4.3 Large Prime Numbers.................... 16 4.4 Numbers that are Products of Two Larger Prime Numbers 17 4.5 Small Random Numbers................... 18 4.6 Large Random Numbers................... 19 4.7 Mixed Small and Large Random Numbers........ 20 4.8 Baseline Test.......................... 21 5 Discussion 22 5.1 The Parallel Baseline Cost.................. 22 5.2 The Growth of Algorithm D................. 23 6 Conclusion 24 6.1 Future Work.......................... 24 Bibliography 25 Chapter 1 Introduction Multi-core processing units are nowadays considered commonplace in everyday computers. Despite this, relatively few software fully utilize their power. A reason for this could be that many algorithms in use today are either old algorithms themselves or closely related to older algorithms. Multi-cored units are a fairly recent phenomenon, the first dual-core processor being introduced as late as 2001. Algorithms, on the other hand, have been developed and used for thousands of years. These algorithms, as victims of their time, are typically very sequential in their nature. There exists many algorithms for determining whether a number is prime or not. Factorization, more specifically prime factorization, is however a more difficult problem and although there exists some algorithms more efficient than exhaustive search, there are no currently available algorithms more efficient than the complexity class SUBEX- PTIME[1]. 1.1 Purpose The purpose of this report is to investigate if simpler tasks, like prime factorization, can be parallelized on GPUs with any gain in performance. It is interesting to find out if it is possible to improve performance by exchanging a more advanced and faster core for a great number of less advanced, slower cores. Especially since this some- times can be achieved without making major changes to the code. 1 2 CHAPTER 1. INTRODUCTION 1.2 Problem Statement This report investigates the usage of GPUs compared to CPUs for the prime factorization of seven different data sets of numbers ranging from in the tens to upwards a couple trillions. The prime factorization is to be performed using the rather simple approach of Trial Divi- sion. Prime factorization by Trial Division do not necessarily require precalculated primes as input but the exclusion of composite numbers should prove beneficial. These primes can be found in many ways, one of the more straight forward being the Sieve of Eratosthenes. Both the sieve and the actual Trial Division can be parallelized and the results then calculated on multiple cores. The parallelization could in theory give a performance gain if im- plemented correctly. The theory is that if enough threads run in parallel it could make up for the difference in clock speed between a CPU and a GPU. This performance gain has the possibility to grow even further if executed on a GPU with its specialization on massive paral- lelism on simple calculations. This leads up to the research question: Using the naive methods, Sieve of Eratosthenes and Trial Divi- sion, how does a parallel implementation running on a GPU fare against a sequential implementation running on a CPU? 1.2.1 Constraints The report is limited by the following constraints. • The integers that we will try to factorize will be smaller than 1:455e+13. The number is a middle ground between integer size and bit length. For comparison, 244 is approximately 1:759e+13. • The data set is limited to seven different types of test data, each with its own way of testing the implementations. See 3.4. • The length of each test it limited by an upper bound. This limit is 500 000 for all test cases except for 4.4 which has a limit of 1 000. Chapter 2 Background This chapter introduces the reader to concepts necessary to under- stand this report. Two major concepts, factorization and prime numbers, are introduced and then built upon with some background on finding primes. The latter includes general theory about sieves and an introduction to the two algorithms used in this paper. Namely, the Sieve of Eratosthenes and Trial Division. The background concludes with a description of CUDA. 2.1 Prime Numbers The fundamental theorem of arithmetic states that every integer larger than one is either a prime itself or the product of a unique combination of primes. A prime number is a positive integer which has exactly two positive divisors, one and itself [2]. Prime number theory has been around since at least 300 B.C., when Euclid described the greatest common divisor and the least common multiple [3], both of which are based on prime numbers. In the same time period Eratosthenes described a method for calculating all prime numbers up to a user-chosen upper bound, called the sieve of Eratosthenes [4]. After the ancient greeks, little happened in the prime number field until the 17th century, when Pierre de Fermat presented Fermat’s little theorem [5], a well-known primality test. Later in the 18th and 19th centuries, more progress was accomplished in the prime number field by mathematicians such as Leonhard Euler [6], Édouard Lucas and Derrick Henry Lehmer [7]. Until the 1970s, prime numbers were believed to have small appli- 3 4 CHAPTER 2. BACKGROUND ance outside of the mathematical field. However, as first predicted by Jevons [8], large primes can be used in a cryptographic system to per- form a one-way encryption. This is a key concept of an important and common operation in today’s computer systems, namely asymmetric encryption. One of the better known asymmetric protocols is named RSA, which is based on the multiplication of two large primes [9]. The largest factored RSA number to date is 768 bits long, far beyond the scope of this report[10]. Although the public and private key in mod- ern systems are often reused [11], the setup part of the algorithm relies on finding these two large primes to multiply together.

Load more