DEGREE PROJECT IN TECHNOLOGY, FIRST CYCLE, 15 CREDITS STOCKHOLM, SWEDEN 2017

A Comparison of Performance Between a CPU and a GPU on Prime Using Eratosthene's Sieve and

CAROLINE W. BORG

ERIK DACKEBRO

KTH ROYAL INSTITUTE OF TECHNOLOGY SCHOOL OF COMPUTER SCIENCE AND COMMUNICATION A Comparison of Performance Between a CPU and a GPU on Prime Factorization Using Eratosthene’s Sieve and Trial Division

CAROLINE W. BORG ERIK DACKEBRO

Bachelor in Computer Science Date: June 11, 2017 Supervisor: Mårten Björkman Examiner: Örjan Ekeberg Swedish title: En jämförelse i prestanda mellan en CPU och en GPU avseende primtalsfaktorisering med hjälp av såll och Försöksdivison School of Computer Science and Communication

iii

Abstract

There has been remarkable advancement in Multi-cored Processing Units over the past decade. GPUs, which were originally designed as a specialized graphics processor, are today used in a wide variety of other areas. Their ability to solve parallel problems is unmatched due to their massive amount of simultaneously running cores. Despite this, most in use today are still fully sequential and do not utilize the processing power available. The and Trial Division are two very naive algorithms which combined can be used to find a ’s unique combinataion of prime factors. This paper sought to compare the performance of a CPU and a GPU when tasked with prime factorization on seven different data sets. Five different programs were created, two running both algorithms on the CPU, two running both algorithms on the GPU and one program uti- lizing both. Each data set was presented multiple times to each pro- gram in different sizes ranging from one to half a million. The result was uniform in that the CPU greatly outperformed the GPU in every test case for this specific implementation. iv

Sammanfattning

Flerkärniga processorer har under det senaste årtiondet utvecklats mar- kant. Grafikkorten, designade för att excellera i grafiktunga beräkning- ar, används idag inom flera andra områden. Kortens förmåga att lösa paralleliserbara problem är oöverträffad tack vare deras massiva an- tal kärnor. Trots detta är majoriteten av algoritmer idag fortfarande helt sekventiella. Eratosthenes såll och Försöksdivision är två väldigt naiva algoritmer som tillsammans kan användas för att finna ett tals unika uppsättning primtalsfaktorer. Det här arbetet strävade efter att jämföra prestandan mellan en CPU och en GPU vad gäller uppgiften att faktorisera tal från sju olika uppsättningar data. Fem implementationer skrevs, varav två var be- gränsade till CPU:ns processorkraft, två begränsade till GPU:ns pro- cessorkraft och en som utnyttjade båda. Varje uppsättning data före- kom i olika storlekar i omfånget ental till en halv miljon. Resultatet var entydigt på så sätt att CPU:n markant överträffade GPU:n i samtliga testfall. Contents

1 Introduction1 1.1 Purpose...... 1 1.2 Problem Statement...... 2 1.2.1 Constraints...... 2

2 Background3 2.1 Prime ...... 3 2.2 Finding Primes...... 4 2.2.1 Sieve...... 4 2.2.1.1 Sieve of Eratosthenes...... 4 2.3 Factorization...... 5 2.3.1 Trial Division...... 6 2.4 CUDA...... 6

3 Method7 3.1 Approach...... 7 3.2 Algorithms...... 8 3.3 Implementations...... 9 3.3.1 Sequential with Sieve...... 9 3.3.2 Sequential Algorithm without Sieve...... 10 3.3.3 Parallel Algorithm with Sieve...... 10 3.3.4 Parallel Algorithm without Sieve...... 10 3.3.5 Algorithm with Sieve on CPU and Trial Division on GPU...... 10 3.4 Data...... 11 3.5 Hardware...... 11 3.5.1 Central Processing Unit - CPU...... 11 3.5.2 Graphical Processing Unit - GPU...... 12 3.6 Compilation Tools and Runtime Environments...... 12

v vi CONTENTS

4 Results 13 4.1 Small Prime Numbers...... 14 4.2 Large Numbers that are Products of Many Small Prime Numbers...... 15 4.3 Large Prime Numbers...... 16 4.4 Numbers that are Products of Two Larger Prime Numbers 17 4.5 Small Random Numbers...... 18 4.6 Large Random Numbers...... 19 4.7 Mixed Small and Large Random Numbers...... 20 4.8 Baseline Test...... 21

5 Discussion 22 5.1 The Parallel Baseline Cost...... 22 5.2 The Growth of Algorithm D...... 23

6 Conclusion 24 6.1 Future Work...... 24

Bibliography 25 Chapter 1

Introduction

Multi-core processing units are nowadays considered commonplace in everyday computers. Despite this, relatively few software fully utilize their power. A reason for this could be that many algorithms in use today are either old algorithms themselves or closely related to older algorithms. Multi-cored units are a fairly recent phenomenon, the first dual-core processor being introduced as late as 2001. Algorithms, on the other hand, have been developed and used for thousands of years. These algorithms, as victims of their time, are typically very sequential in their nature. There exists many algorithms for determining whether a number is prime or not. Factorization, more specifically prime factorization, is however a more difficult problem and although there exists some al- gorithms more efficient than exhaustive search, there are no currently available algorithms more efficient than the complexity class SUBEX- PTIME[1].

1.1 Purpose

The purpose of this report is to investigate if simpler tasks, like prime factorization, can be parallelized on GPUs with any gain in perfor- mance. It is interesting to find out if it is possible to improve per- formance by exchanging a more advanced and faster core for a great number of less advanced, slower cores. Especially since this some- times can be achieved without making major changes to the code.

1 2 CHAPTER 1. INTRODUCTION

1.2 Problem Statement

This report investigates the usage of GPUs compared to CPUs for the prime factorization of seven different data sets of numbers ranging from in the tens to upwards a couple trillions. The prime factoriza- tion is to be performed using the rather simple approach of Trial Divi- sion. Prime factorization by Trial Division do not necessarily require precalculated primes as input but the exclusion of composite numbers should prove beneficial. These primes can be found in many ways, one of the more straight forward being the Sieve of Eratosthenes. Both the sieve and the actual Trial Division can be parallelized and the results then calculated on multiple cores. The parallelization could in theory give a performance gain if im- plemented correctly. The theory is that if enough threads run in paral- lel it could make up for the difference in clock speed between a CPU and a GPU. This performance gain has the possibility to grow even further if executed on a GPU with its specialization on massive paral- lelism on simple calculations. This leads up to the research question:

Using the naive methods, Sieve of Eratosthenes and Trial Divi- sion, how does a parallel implementation running on a GPU fare against a sequential implementation running on a CPU?

1.2.1 Constraints The report is limited by the following constraints.

• The integers that we will try to factorize will be smaller than 1.455e+13. The number is a middle ground between integer size and bit length. For comparison, 244 is approximately 1.759e+13.

• The data set is limited to seven different types of test data, each with its own way of testing the implementations. See 3.4.

• The length of each test it limited by an upper bound. This limit is 500 000 for all test cases except for 4.4 which has a limit of 1 000. Chapter 2

Background

This chapter introduces the reader to concepts necessary to under- stand this report. Two major concepts, factorization and prime num- bers, are introduced and then built upon with some background on finding primes. The latter includes general theory about sieves and an introduction to the two algorithms used in this paper. Namely, the Sieve of Eratosthenes and Trial Division. The background concludes with a description of CUDA.

2.1 Prime Numbers

The fundamental theorem of states that every integer larger than one is either a prime itself or the product of a unique combination of primes. A is a positive integer which has exactly two posi- tive , one and itself [2]. Prime has been around since at least 300 B.C., when Euclid described the greatest common and the least common multiple [3], both of which are based on prime numbers. In the same time period Eratosthenes described a method for calculating all prime numbers up to a user-chosen upper bound, called the sieve of Eratosthenes [4]. After the ancient greeks, little happened in the prime number field until the 17th century, when Pierre de Fermat presented Fermat’s little theorem [5], a well-known . Later in the 18th and 19th centuries, more progress was accomplished in the prime number field by mathematicians such as Leonhard Euler [6], Édouard Lucas and Derrick Henry Lehmer [7]. Until the 1970s, prime numbers were believed to have small appli-

3 4 CHAPTER 2. BACKGROUND

ance outside of the mathematical field. However, as first predicted by Jevons [8], large primes can be used in a cryptographic system to per- form a one-way encryption. This is a key concept of an important and common operation in today’s computer systems, namely asymmetric encryption. One of the better known asymmetric protocols is named RSA, which is based on the multiplication of two large primes [9]. The largest factored RSA number to date is 768 bits long, far beyond the scope of this report[10]. Although the public and private key in mod- ern systems are often reused [11], the setup part of the algorithm relies on finding these two large primes to multiply together. For this part of the algorithm to have decent performance the two prime numbers needs to be found within a reasonable time.

2.2 Finding Primes

There are multiple ways of finding large prime numbers. The most common method is not to calculate them, but instead to select a num- ber of desired length and consequently performing a primality test. A new number, close to the previous one, can be chosen if the former is proven composite. Initial testing can also be used to ensure that the selected number does not contain any small prime factors. The above is however an inefficient method for finding smaller primes up to a few millions. The easiest and the most efficient known way of finding primes in these ranges is to compute them, e.g. using one of many available sieves.

2.2.1 Sieve A sieve is traditionally a method, or tool, with the purpose of separat- ing wanted elements from unwanted ones. Sieve methods have seen some usage in modern where the study of such methods is called . They have been used to prove conjectures such as Brun’s theorem [12] but are probably more commonly associated, by outsiders, with the .

2.2.1.1 Sieve of Eratosthenes Eratosthenes of Cyrene presented this method of finding primes more than two thousand years ago and it is still, though in a more refined CHAPTER 2. BACKGROUND 5

form, used today in number theory research [13]. It is a beautifully simplistic algorithm utilizing a very basic pattern to iteratively cross off integers from a set of potential primes. Its initial condition is the set of all numbers from two to a finite n where each element has three possible states. It can either be a prime, a prime candidate or a com- posite number. Initially, all elements are considered prime candidates. Each iteration begins by setting the state of the lowest prime candidate to prime and then iteratively setting all other multiples of that element to composite numbers. Eventually it will reach a state where no more prime candidates exists. All elements will then be either a definite prime or a .

2.3 Factorization

The topic of factorization is well-explored. There are a large number of different algorithms relevant to the problem. A common method is to combine different algorithms since their running time might depend on different factors. One usually start out with an algorithm depen- dent on the integer’s smallest prime factor until those are depleted, following up with an algorithm dependent on the integer’s size. These two types of algorithms are generally referred to as First Category and Second Category respectively. Both algorithms used in this paper, the Sieve of Eratosthenes and Trial Division, are First Category algorithms. The former even acting as a basis for the Category in the first place. In addition to factoriza- tion algorithms, a probabilistic primality test can be used to confirm whether a number is prime or composite in polynomial time prior to factorization. This currently reduces the amount of calculations neces- sary by a large margin since no factorization algorithm of polynomial time has yet to be published. Factoring clearly lies in the complexity class NP but as of yet no solution in P has been published apart from Shor’s quantum algorithm [14]. The current state of the art for prime factorization of large num- bers is the Second Category algorithm General Number Field Sieve (GNFS). It utilizes discrete mathematics, smooth numbers and squares. GNFS is generally not used for numbers less than 10100[15]. For smaller numbers than that, the is still the premier algorithm to use. 6 CHAPTER 2. BACKGROUND

2.3.1 Trial Division Perhaps the easiest method of finding a number n’s all divisors is to attempt division with every single number it possibly could be. This naive method, appropriately named Trial Division, loses its effective- ness linearly as n gets bigger but is more than adequate for the smaller s numbers used in this paper. The set of√ possible divisors for a number n is defined as s = {x|x ∈ Z, 2 ≤ x ≤ n} [16]. By iterating through these numbers, keeping n constant throughout the run, all divisors can be found. Since they are not necessarily prime a slight change must be made to the algorithm. The number n should for every found divisor d be set to n = n/d and for every found divisor, the same divisor should be tested again until it no longer is one. Furthermore, s should be iterated through in order from lowest to highest. This order of iteration ensures, accord- ing to the fundamental theorem of arithmetic [17], that prime divisors are found before their composite counterparts. The repeated trials of found divisors is necessary to find the unique combination of the num- ber’s prime factors. Otherwise, the final remainder could be a compos- ite of prime numbers of which the original number has duplicates.

2.4 CUDA

A parallel computing platform and programming model released by NVIDIA in 2007. The Graphics Processing Unit, or GPU, was origi- nally designed for intensive real-time graphics processing. Yet in re- cent years, much thanks to tools like CUDA and OpenCL, the usage of GPUs for general purpose processing has seen a great surge [18]. With a large amount of specialized cores at its disposal the GPU can, despite a typically lower clock speed, greatly outperform regular CPUs if tasked with a parallelizable problem. Whereas a deep un- derstanding of graphics programming was previously needed to fully utilize its power, CUDA enables C and C++ code to be executed di- rectly on GPUs [18]. This allows for the same code snippets to run on both components with only minor changes. Previously one had to use a low-level assembly language to utilize the GPU. The CUDA model distinguishes between CPUs and GPUs by calling the former hosts and the latter devices. Chapter 3

Method

This chapter explains the method used to arrive at the result. It in- cludes a discussion of the approach, pseudocode for the Sieve of Er- atosthenes and Trial Division and a declaration of hardware and soft- ware used during the testing.

3.1 Approach

It can be difficult to satisfactorily compare different systems because of how they differ. Different requirements can lead to whole different starting points and it can therefore be beneficial to even out as many of them as possible. As mentioned in section 2.4, using CUDA allows for executing very similar code on both architectures. The two used algorithms are fairly simple. Both the Sieve of Er- atosthenes and Trial Division are naive algorithms which utilizes ex- haustive search to solve their respective problems. The idea is that an abundance of simple calculations, that can easily be made in both par- allel and sequential fashion, should fare equally well between a CPU and a GPU. The higher clock speed of a CPU versus the better paral- lelization of a GPU. Five different programs are to be run. Two purely sequential pro- grams running on a CPU and two purely parallel programs running on a GPU. The difference within each set being the presence, or ab- sence, of a sieve to find all necessary primes to reduce the amount of possible divisors. Then there is also a program where the sieve is done on a CPU and the factorization consequently done on a GPU.

7 8 CHAPTER 3. METHOD

3.2 Algorithms

This section presents the pseudocode of the two algorithms used in this paper. Further introduction can be found in section 2.3.1.1 for the Sieve of Eratosthenes and in section 2.3.1.1 for the Trial Division.

Data: An integer max_num, the highest number that needs to be factorized. Result: An array of length m where prime indexes are one and the rest zero, √ m ← max_num; prime ← oned array of length m; n ← 2; while n ≤ m do if prime[n] == 1 then eliminate ← n + n; while eliminate ≤ m do prime[eliminate] ← 0; eliminate ← eliminate + n; end end n ← n + 1; end Algorithm 1: Sieve of Eratosthenes CHAPTER 3. METHOD 9

Data: An array of length m where prime indexes are one and others zero, prime. The number to factorize, number. An array of length log number, divisors. Result: divisors, an array where the combination of non-zero elements equals the unique prime representation of number √ ; m ← number; p ← 0; index ← 0; while p ≤ m do if number < p then return divisors; end if prime[p] = 1 then while number ≡ 0 (mod p) do divisors[index] ← p; numbers ← numbers/p; index ← index + 1; end end end Algorithm 2: Trial Divison

3.3 Implementations

The data has been ran through 5 different implementations of the basic algorithm, as described below.

3.3.1 Sequential Algorithm with Sieve √ This algorithm computes which numbers between 2 and n (where n is the largest number in the data set) is prime. The implementation thereafter utilizes the primes found during sieving to perform Trial Division on each element in the data set. The implementation√ runs until either all prime factors are found or all integers up to n has been tried. The implementation is entirely preformed on the CPU and only utilizes one core and one thread. This implementation is referred to as Algorithm A. 10 CHAPTER 3. METHOD

3.3.2 Sequential Algorithm without Sieve This algorithm does not compute any prime numbers prior to per- forming the Trial Division, but tries all numbers instead. The imple- mentation√ runs until either all prime factors are found or all integers up to n (where n is the largest number in the data set) has been tried. The implementation is entirely preformed on the CPU and only uti- lizes one core and one thread. This implementation is referred to as Algorithm B.

3.3.3 Parallel Algorithm with Sieve This algorithm begins similar to algorithm A by sieving numbers, but on the GPU. However, it does unlike algorithm A not wait with the Trial Division until the sieve has completed. The implementation keeps track of how far the sieve has reached, and only performs Trial Divi- sion using confirmed prime numbers. Each Trial Division task runs on a separate thread on the GPU and runs until√ either all prime factors per element are found or all integers up to n has been tried. Except for the setup stage which is executed on the CPU, the majority of the im- plementation is executed on the GPU. This implementation is referred to as Algorithm C.

3.3.4 Parallel Algorithm without Sieve Much like algorithm B, this implementation does not perform any sieve prior to the Trial Division and thus tries all numbers instead. Each Trial Division task is like algorithm C performed on a separate thread on the GPU and ran until√ either all prime factors per element are found or all integers up to n has been tried. The setup stage of the implementation is executed on the CPU but the algorithm is solely executed on the GPU. This implementation is referred to as Algorithm D.

3.3.5 Algorithm with Sieve on CPU and Trial Division on GPU

This implementation√ begins much like algorithm A by sieving num- bers between 2 and n, where n is the largest number in the data set. CHAPTER 3. METHOD 11

The sieved numbers are then transferred to the device, which com- putes each Trial Division task on a separate thread, only trying the sieved numbers as possible prime factors. The setup and sieve is per- formed on the CPU, while the Trial Division, which is performed once the sieve has completed, is executed on the GPU. This implementation is referred to as Algorithm E.

3.4 Data

The data consists of the cartesian product of seven different types of constraints and twelve different lengths. The only exception being the data set in 4.4. The running time for this test case escalated too quickly and due to its unambiguous result we decided against further testing. In total, there are 79 data points to discuss and analyze. The data sets in 4.1 and 4.3 consists of non-composite numbers, the former exclusively made up of small primes between 2 and 97 and the latter of primes between 1e+7 and 1e+8. The set in 4.2 is comprised of numbers larger than 1.5e+11 that are products of primes less than or equal to 97. Similarly, the data set in 4.4 consists of numbers that are products of two randomly selected primes between 5e+6 and 6e+6. The data set in 4.5 is made up of randomized numbers between 100 and 220 while the one in 4.6 is comprised of randomized numbers be- tween 220 and 231 − 1. The final data set in 4.7 consists of randomized numbers between 100 and 231 − 1

3.5 Hardware

The hardware used for running these tests were not the most powerful hardware available on the market. However, there is no conceptual difference.

3.5.1 Central Processing Unit - CPU The central processing unit of the hardware chosen is the Intel Core i7 4710HQ. The processor has a clock speed of 2.50 GHz and 6MB level 3 cache. The memory bus speed is 5 GT/s (gigatransfers per second). 12 CHAPTER 3. METHOD

3.5.2 Graphical Processing Unit - GPU The graphical processing unit used is the NVIDIA GeForce GTX 850M. The GPU has 640 CUDA cores, each one with a clock speed 936 MHz. The memory bandwidth is 80 GB/s.

3.6 Compilation Tools and Runtime Environ- ments

Relevant software used during production and testing was CUDA v8.0.60 and GCC v5.3.0. The operating system all implementations are exe- cuted on is Windows 10 build number 14393.1066. Chapter 4

Results

This chapter presents the results of the testing. Each section consist of a graph where the x-axis shows the length of the data set and the y-axis the time in seconds that it took the program to terminate. Each graph is preceded by a text introducing the data set and some observations about the graph.

13 14 CHAPTER 4. RESULTS

4.1 Small Prime Numbers

The graph below shows the results of the five implementations when faced with the first data set: prime numbers between 2 and 97 on the various input data lengths. As visualized in the graph, the baseline cost is much higher for the parallel implementations executed on GPU (algorithm C, D and E) compared to the sequential implementations (algorithm A and B) only executed on the CPU. An example of this would be that the run time of algorithm C was 0.14 seconds for 10 000 numbers and 0.61 seconds for 50 000 with the baseline cost negated. This compared to 0.01 seconds and 0.03 seconds for the same data set ran through algorithm A is still higher, however the difference in percentage terms is a lot lower. When faced with the longer lengths of input data, the lower clock speed of the GPU becomes more prominent in algorithm C, D and E. At the same time, it is notable that both sequential versions also has to work harder when faced with longer lengths of the input data.

Algorithm A Algorithm B Algorithm C Algorithm D Algorithm E

3

2

1 Time(s)

0 100 101 102 103 104 105 106 Data Size CHAPTER 4. RESULTS 15

4.2 Large Numbers that are Products of Many Small Prime Numbers

The graph below shows the results of the five implementations when faced with the second data set: the product of prime numbers between 2 and 97 that are larger than 150 000 000 000 on the various input data lengths. It is clear that algorithm C is slower for the tested input data. This is due to the high upper limit of the sieve in combination with the de- vice’s lower clock speed. It is also apparent that with this particular data set, it goes a lot faster not to sieve prime numbers before per- forming the Trial Division for lower lengths of input data, as shown by algorithm B and D. This is due to the prime factors being quite small in contrast to the high limit that needs to be sieved to. However, when faced with the longer lengths of the input data, the sieving im- plementations (algorithm A, C and E) are beginning to approach the performance of the non-sieving implementations (algorithm B and D). As an example, algorithm E already outperformed algorithm D when faced with the data set of length 500 000. The difference in per- formance between algorithm C and D is 6.24 seconds for the data set of length 500 000. In comparison, the difference is 14.02 seconds for the data set of length 100 000.

Algorithm A Algorithm B Algorithm C Algorithm D Algorithm E

20

10 Time(s)

0 100 101 102 103 104 105 106 Data Size 16 CHAPTER 4. RESULTS

4.3 Large Prime Numbers

The graph below shows the results of the five implementations when faced with the third data set: prime numbers between 10 000 000 and 100 000 000 on the various input data lengths. Algorithm B, which does not sieve before performing the Trial Di- vision, is as shown in the graph not much slower than algorithm A, which performs the sieve. On one hand, algorithm A has to perform a lot of work before the Trial Division can begin, but reduces the amount of Trial Divisions that have to be made. Algorithm B on the other hand, don’t have to compute the sieve and has to perform many more Trial Divisions instead. In the end both of these tasks seems to even out each other when faced with this data set. When comparing the parallel versions (algorithm C and D), the dif- ference stands out more and more as data set length increases. This is likely due to the lower clock speed of the GPU cores not being able to make up for the theoretical lower amount of work that needs to be performed when only performing Trial Division with prime numbers, which gives algorithm D the longer run time. When comparing al- gorithm C and algorithm E, it is noticeable that the run time is equal independently of the length of the data set. This is further indication that it is not the sieve that takes the longer time, but rather the Trial Division.

Algorithm A Algorithm B Algorithm C Algorithm D Algorithm E

400

200 Time(s)

0 100 101 102 103 104 105 106 Data Size CHAPTER 4. RESULTS 17

4.4 Numbers that are Products of Two Larger Prime Numbers

The graph below shows the results of the five implementations when faced with the fourth data set: the product of two prime numbers be- tween 5 000 000 and 6 000 000, on the various input data lengths. An important note to make in this test case is that the input data length is at its maximum 1 000. This due to the huge amount of Trial Divisions that has to be made, especially in absence of sieving, in com- bination with the lower clock speed of the GPU. The time taken for algorithm D is increasing rapidly even for these small lengths of data. It is clear that though the sequential implemen- tations (algorithm A and B) were faster than the parallel ones utilizing sieving (algorithm C and E), it was not by a large percentual margin. For example, algorithm B, the corresponding implementation exe- cuted on the CPU took 51.83 seconds to terminate given the data set of length 1000. This is 1430.65 seconds (about 24 minutes) shorter than algorithm D.

Algorithm A Algorithm B Algorithm C Algorithm D Algorithm E

1,500

1,000

500 Time(s)

0 100 101 102 103 Data Size 18 CHAPTER 4. RESULTS

4.5 Small Random Numbers

The graph below shows the results of the five implementations when faced with the fifth data set: randomized numbers between 100 and 220, on the various input data lengths. It is easy to distinguish that both sequential implementations (algo- rithm A and B) terminates much quicker than the parallel implemen- tations (algorithm C, D and E). The run time of algorithm D increases more rapidly on the data sets with longer length. This is due to the massive amount of Trial Divisions that have to be calculated in ab- sence of the sieve. There is however not much difference in run time in between all algorithms in the lower lengths of the input data sets. As an example the run time of all implementations for data set length of 1 000 are in order 0.03, 0.03, 1.91, 1.93 and 1.73 seconds. The majority of the difference between the sequential and parallel imple- mentations is due to the baseline cost of the parallel algorithms.

Algorithm A Algorithm B Algorithm C Algorithm D Algorithm E

100

50 Time(s)

0 100 101 102 103 104 105 106 Data Size CHAPTER 4. RESULTS 19

4.6 Large Random Numbers

The graph below shows the results of the five implementations when faced with the sixth data set: randomized numbers between 220 and 231 − 1, on the various input data lengths. On the lower input data lengths there is not much difference in performance of the implementations. However, when the input data length reaches 10 to the power of 4, the difference in speed of the al- gorithms become more and more obvious. The run time difference between the fastest and slowest implementations, algorithm A and D, is at this point 11.50 seconds. The run time of algorithm D increases very rapidly as the data set length increases beyond 10 to the power of 4, as visualized in the graph. The final result for algorithm D comes in at about 4500 seconds for 500 000 numbers, or 1 hour and 15 minutes. On the other hand, algorithms C and E comes in at about 10 minutes, algorithm B finishes within 3 minutes and the run time of algorithm A ends up at 91.61 sec- onds. This means that algorithm A terminates almost 51 times faster than algorithm D on this particular data set.

Algorithm A Algorithm B Algorithm C Algorithm D Algorithm E

4,000

2,000 Time(s)

0 100 101 102 103 104 105 106 Data Size 20 CHAPTER 4. RESULTS

4.7 Mixed Small and Large Random Numbers

The graph below shows the results of the five implementations when faced with the seventh and final data set: randomized numbers be- tween 100 and 231 − 1, on the various input data lengths. Very similar to the results of large random numbers presented in 4.6, there is not a comparable difference in speed between the algo- rithms on the lower lengths of the input data. The difference in speed between the algorithms become more apparent when the data set size increases to the higher levels, where algorithm D once again increases way faster than the other implementations.

Algorithm A Algorithm B Algorithm C Algorithm D Algorithm E

4,000

2,000 Time(s)

0 100 101 102 103 104 105 106 Data Size CHAPTER 4. RESULTS 21

4.8 Baseline Test

The final test is to examine the baseline cost of the implementations fully or partially executed on the device and compare these to the im- plementations executed on the host. The following function calls were looked at carefully: cudaMalloc Allocating memory on the device cudaMemcpy Copying data from host to device cudaMemcpy Copying data from device to host cudaFree Freeing memory on the device malloc Allocating memory on the host

The results of the small amount of test code executed shows that the first CUDA call executed takes about 1.8 seconds in average, inde- pendent of the type of call. The second call takes lower time, ranging from 0.0 to 0.1 seconds depending on type of command and size of the data processed. In comparison malloc took between 0 and 0.1 seconds depending on the size of data. As an example of the above, the case where cudaFree was first called, then cudaMalloc, will be further explained. The command cud- aFree takes 1.85 seconds, whereafter cudaMalloc only takes 0.0 sec- onds. This is the same kind of pattern that appeared during the pre- vious tests, but with cudaMalloc and cudaMemcpy. From this it is possible to conclude that it is not the actual function call that takes time, but rather the initialization process of the device and the CUDA library. Chapter 5

Discussion

The sequential versions of the algorithms were faster, independent of the type of input data or the length of the input data. Algorithm C was in every test case but 4.2, faster than Algorithm D. The reason for this is most likely that, whilst the potential upper bound for a prime factor was upwards a couple hundred thousands, the true upper bound was only 97. This meant that the sieve had to go through quite the excessive workload before considering itself done. The program was not able to make up for the loss of time even at data lengths of 500 000. Perhaps this could have been mitigated by introducing a more intelligent sieve, one that would regularly check whether further sieving was necessary or not. From the graphs it was also apparent that algorithm E, the one which sieves on the CPU whilst using Trial Division on the GPU, was only slightly faster than the algorithms running solely on the GPU. This was probably the result of division being a more expensive oper- ation than addition. The time saved by sieving on the CPU diminished in comparison to the propagation of the slower execution time of divi- sion being executed many times on the GPU.

5.1 The Parallel Baseline Cost

The baseline cost for all parallel algorithms was higher than their se- quential counterparts. This, according to our tests, was due to the ini- tialization process of CUDA. As seen in 4.8, it took on average 1.85 seconds to establish the context. The cost of allocating memory on the device was comparatively low ranging from 0.00 seconds for a byte to

22 CHAPTER 5. DISCUSSION 23

0.10 seconds for a gigabyte. this is still plenty more than the time it takes to allocate the same size of memory on the host. Since the tests were ran in isolation from each other, this additional cost was applied each time. For lower data set lengths this accounted for the majority of time taken. But as can be seen in every test case, it diminished quickly for greater lengths. With most of the baseline cost being constant and only a small fraction of it being dependent on the length of data, it can be disregarded from in a theoretical case where the program is not running in isolation. This still would not affect the result as the run time for all programs utilizing the GPU diverged quicker than the ones not utilizing it.

5.2 The Growth of Algorithm D

Algorithm D showed a similar growth pattern in all test cases, diverg- ing more rapidly than all the others. This could depend on many dif- ferent factors but is most likely due to the extreme amount of Trial Divisions. In the case of smaller numbers, the absence or presence of sieving is not as noticeable. When the test numbers and especially the prime factors get higher, the performance loss can more easily be noticed as the amount of Trial Divisions are so much higher when compared to not sieving. The lower clock speed of the GPU also con- tributes to the more rapid rise, like a constant in a mathematical func- tion. It was expected that the implementation would have a longer run time in all cases but 4.2. But as visualized, the implementation show the same pattern and began to catch up to algorithm C. It was ex- pected that it would grow faster than the four other implementations in all cases but 4.2. It was however unexpected that it would grow this rapidly. This shows that the process of sieving can pay of in cases where it substantially reduces the amount of necessary Trial Divisions. Chapter 6

Conclusion

The purely sequential implementation running solely on a CPU was faster in every tested case and all graphs are uniform in that the al- gorithms utilizing GPUs are diverging quicker than the ones that are not. We can therefore draw the conclusion that, for this specific im- plementation of the two naive algorithms Sieve of Eratosthenes and Trial Division, the usage of a GPU was not beneficial. Perhaps the re- sult would have been different had the algorithms been implemented differently.

6.1 Future Work

If anything, the result of this paper proves that there is more research to be done in this area. There are a great deal of parallelizable algo- rithms relevant to factorization that could hopefully be efficiently ran on a GPU. Alternative implementations of the algorithms used in this paper could also lead to other results.

24 Bibliography

[1] Hendrik W Lenstra Jr. “Factoring integers with elliptic curves”. In: Annals of mathematics (1987), pp. 649–673. [2] Richard Crandall and Carl Pomerance. Prime numbers: a compu- tational perspective. Vol. 182. Springer Science & Business Media, 2006. [3] Wladyslaw Narkiewicz. The development of prime number theory: from Euclid to Hardy and Littlewood. Springer Science & Business Media, 2013. [4] Samuel Horsley. “KOΣ KINON EPATOΣΘ ENOY Σ. or, The Sieve of Eratosthenes. Being an Account of His Method of Find- ing All the Prime Numbers, by the Rev. Samuel Horsley, FRS”. In: Philosophical Transactions (1683-1775) 62 (1772), pp. 327–347. [5] Michael Sean Mahoney. The mathematical career of Pierre de Fermat, 1601-1665. Princeton University Press, 1994. [6] William Dunham. Euler: The master of us all. 22. MAA, 1999. [7] David V Chudnovsky and Gregory V Chudnovsky. “Sequences of numbers generated by addition in formal groups and new primality and factorization tests”. In: Advances in Applied Mathe- matics 7.4 (1986), pp. 385–434. [8] W Stanley Jevons. The principles of science. 1877. [9] Ronald L Rivest, Adi Shamir, and Leonard Adleman. “A method for obtaining digital signatures and public-key cryptosystems”. In: Communications of the ACM 21.2 (1978), pp. 120–126. [10] Thorsten Kleinjung et al. “Factorization of a 768-bit RSA mod- ulus”. In: Annual Cryptology Conference. Springer. 2010, pp. 333– 350.

25 26 BIBLIOGRAPHY

[11] Burt Kaliski. “Privacy enhancement for internet electronic mail: Part IV: key certification and related services”. In: (1993). [12] Viggo Brun. “Uber das Goldbachsche Gesetz und die Anzahl der Primzahlpaare”. In: Arch. Mat. Natur. B (1915). [13] J. J O’Connor and E. F Robertson. Eratosthenes Biography. 1999. URL: http://www-groups.dcs.st-and.ac.uk/history/ Biographies/Eratosthenes.html (visited on 03/08/2017). [14] Peter W Shor. “Polynomial-time algorithms for prime factoriza- tion and discrete logarithms on a quantum computer”. In: SIAM review 41.2 (1999), pp. 303–332. [15] Carl Pomerance. “A tale of two sieves”. In: Biscuits of Number Theory 85 (2008), p. 175. [16] Hans Riesel. Prime numbers and computer methods for factorization. Vol. 126. Springer Science & Business Media, 2012. [17] Friedrich Gauss. “Fundamental theorem of arithmetic”. In: (). [18] CUDA Nvidia. Programming guide. 2017. www.kth.se