<<

Assessing the Performance of Optimised Primality Tests

Cameron Curry

August 19, 2016

MSc High Performance Computing The University of Edinburgh Year of Presentation 2016 Abstract

The performance of the Fermat implementation in the appli- cations Genfer, LLR and OpenPFGW was assessed in terms of absolute speed and peak hardware capabilities. Speed was defined as the number of Fermat Test iterations completed per second, subsequently giving an indication of FFT perfor- mance. The Roofline Model was also used to determine the extent to which the applications made use of given hardware in terms of floating point operations per second and peak main memory bandwidth.

It was determined that Genefer is superior in terms of absolute speed, and that the Roofline performance increased for all applications with FFT size. It was also shown that overhead in each application was due to the FFT, and thus little scope for performance enhancement was found. Following this, the residue calcu- lation of Genefer was modified such that it conformed to the residues of LLR and OpenPFGW.

i Acknowledgements

I would like to thank Mr Iain Bethune from EPCC for his supervision and guidance of this project and EPCC as a whole for access to ARCHER, the UK National Supercom- puting Service.

ii List of Figures

1 Theoretical Roofline Model displaying various performance regions in log-log scale...... 14 2 Time per iteration of Fermat Test with n = 13, and b in the range: 105 ≤ b ≤ 3 × 107...... 20 3 Fermat Test of n = 13 with b in the range: 104 ≤ b ≤ 106...... 21 4 Fermat Test of n = 16 with b in the range:105 ≤ b ≤ 107...... 22 5 Fermat Test of n = 19 with b in the range:105 ≤ b ≤ 107...... 22 6 Measured STREAM bandwidth in terms of L3 cache size...... 24 7 STREAM Bandwidth using 24 OpenMP threads...... 25 8 Roofline Model of Genefer, LLR, and PFGW...... 30 9 Roofline Model of n = 16, 19...... 31

List of Tables

1 Stream Kernels...... 15 2 HWPC Groups for Intel Ivy Bridge...... 18 3 PFGW FFT lengths in terms of b for n = 13...... 21 4 STREAM Results for recommended array size...... 25 5 Single core STREAM Results...... 26 6 Triad results of STREAM tracing experiment...... 27 7 Micro benchmark trace results...... 29 8 Sampled data from Genefer for a GFN of n = 19 and b = 106...... 31 9 LLC Misses and References Hardware Counter Data...... 32 10 TLB Misses and Walk Duration...... 33 11 LLR sampling experiment data with GFN n = 19, b = 106...... 34 12 PFGW sampling experiment data with GFN n = 19, b = 106...... 34 13 Residue of all Applications for a GFN of n = 13, b = 100 for Fermat Base 2...... 38 14 Residue of all Applications for a GFN of n = 13, b = 104 for Fermat Base 3...... 38 15 Valid and Invalid Large Integer Representations...... 39 16 Bit manipulation of binary array inserted into an unsigned integer. . . . 41 17 GMP Residue and Radix to Binary Residue produced by Genefer using Fermat Base 3...... 42

iii Contents

1 Introduction 1 1.1 Prime Numbers and their Applications ...... 1 1.2 Association with HPC ...... 2 1.3 Project Aims ...... 2

2 Background & Literature Review 4 2.1 Primality Tests ...... 4 2.1.1 Fermat Test ...... 4 2.1.2 ...... 5 2.1.3 FFT Multiplication & The Z-Transform ...... 6 2.1.4 GFN Primes & The Discrete Weighted Transform ...... 8 2.2 Applications & Hardware ...... 11 2.2.1 LLR ...... 11 2.2.2 OpenPFGW ...... 11 2.2.3 Genefer ...... 12 2.2.4 Hardware ...... 12 2.3 Performance Benchmarking ...... 13 2.3.1 Application Speed ...... 13 2.3.2 The Roofline Model ...... 13 2.3.3 STREAM ...... 15 2.3.4 Cray Performance Measurement and Analysis Tools ...... 16

3 Performance Analysis 19 3.1 Application Speed ...... 19 3.2 Peak Bandwidth & Flops/s ...... 23 3.3 Roofline Performance ...... 26 3.4 Sources of Overhead ...... 31

4 Residue Modification 35 4.1 Current Residue Calculations ...... 35 4.2 Residue Modification with the GMP Library ...... 36 4.3 Radix to Binary Residue Calculation ...... 38

5 Conclusion & Further Work 43

APPENDICES 46

A STREAM Hardware Counter Data 46

B Genefer Hardware Counter Data 47

C LLR Hardware Counter Data 48

D PFGW Hardware Counter Data 48

iv Chapter 1: Introduction

1.1 Prime Numbers and their Applications

The concept of a is very simple to grasp. A number n, which is a positive integer greater than 1, is a prime number if, and only if, the only two positive divisors of n are the numbers 1 and n itself. Should any other positive divisors exist, the number n is not prime and is therefore labelled composite. With such a simple definition, it seems rather counter-intuitive that prime numbers seem to reveal an infinite extent of interesting mathematical properties. For instance, the Fundamental Theorem of Arith- metic states that all composite numbers can be expressed in terms of a unique product of primes:

α1 α2 αm n = p1 p2 . . . pm (1)

Since there exist infinitely many primes (as proven by Euclid) the Fundamental Theo- rem of Arithmetic then leads to the notion that prime numbers are the elementary units of all natural numbers. As a consequence of the Fundamental Theorem of Arithmetic and the infinite extent of prime, it can also be shown that the Euler Product is equivalent to the Riemann zeta function:

∞ X 1 Y 1 ζ(s) = = (2) ns 1 − p−s n=1 p∈P

For p in the infinite set of primes P. By extension, the relationship of primes to the Riemann zeta function also establishes an association of primes to the Riemann Hy- pothesis, one of the Millennium Prize Problems. Some have even gone so far as to suggest a relationship between the Riemann Hypothesis and the statistical distribution of energy levels in heavy nuclei [1]. Thereby proving a deep connection between prime numbers and the field of Quantum Mechanics. Prime numbers have also found practical use in encryption software used to encrypt and decrypt much of the information routed over the Internet today. One example of such an encryption strategy is RSA encryption [2], which relies on the product of two prime numbers to encrypt and decrypt a message with public and private keys. The security of such an operation then, depends on the time required to factorise the product:

n = p × q (3)

Where p and q are two large “random” prime numbers, the factorisation of which should take an impractically long amount of time on today’s computer hardware.

1 1.2 Association with HPC

In common with all aspects of prime numbers discussed above is the implied knowl- edge of some infinite list of primes. While there exist infinitely many primes, therefore rendering such a list impossible to compile, one could still find the knowledge of ever more prime numbers useful. This is especially true with RSA encryption where new, larger prime numbers could enable the composition of more secure public and private keys, as the factorisation of large numbers becomes faster with advances in hardware. Computational techniques for finding new primes are now common place. While in the past the search for prime numbers took place on supercomputers, in recent history focus has shifted to distributed computing and task parallelism. Two projects in particular are leveraging this power of distributed computing in their search for prime numbers. The Great Internet Search (GIMPS) [3] collectively searches for prime numbers with 160,000 volunteer users. Dedicated to the search of a specific form of prime, Mersenne Primes 2p − 1 (where p is also prime), GIMPS is responsible for finding the largest known prime as of January 2016 (274,207,281 − 1). To search for Mersenne primes, volunteer users are delegated tasks in a master-worker style. Using software called , tasks for testing Mersenne Prime candidates for primality are sent to users and take on the order on one month to complete. In competition with GIMPS, PrimeGrid [4] implements the same concept of distributed computing using the open source distributed computing framework BOINC [5]. Vol- unteer users contribute CPU cycles through this framework to many different projects which operated through BOINC, including the popular search for extraterrestrial intel- ligence: SETI@home. PrimeGrid uses BOINC in much the same way GIMPS uses Prime95. It delegates tasks to its 91,000 users which contribute to many different prime number search areas, including a search for Generalised Fermat Primes. With these volunteer users, PrimeGrid’s distributed computing scheme achieves an average perfor- mance of 1.28 PetaFlops with 269,000 CPU hosts [6]. By comparison, this performance figure would rival the 66th place of the Top 500 list of fastest supercomputers as of June 2016 [7].

1.3 Project Aims

With PrimeGrid’s vast computing resource through BOINC, it is essential that the soft- ware conducting the search for prime numbers achieves high performance to make best use of the volunteer CPU cycles it is afforded. In this project, the performance of three software applications (Genefer [8], LLR [9] and OpenPFGW [10]) used by PrimeGrid will be assessed. Specifically, the performance of the Fermat Test implementation (dis- cussed in Section 2.1) in all three applications will be analysed. This performance anal- ysis will categorically measure the brute speed of each application in comparison with the others. Further performance analysis will also determine the extent to which each application makes use of the maximum hardware performance using a commonplace HPC performance model, the Roofline Model [11].

2 Following a performance analysis of each application, the modification of one applica- tion, Genefer, will take place. This modification will attempt to modify the residue calculation (discussed in Section 4) in order to achieve congruency with LLR and OpenPFGW, as at present these applications both report a residue different to that of Genefer. While some emphasis of this project is focused on original software creation, the main body of work is concerned with performance profiling of the applications. This will require the build of the applications themselves and any dependencies as well as the use of external tools. As a result, a mass of original code will not necessarily be produced, rather the content of this project is advertised as the original analysis of the performance data gathered and discussed in the following chapters.

3 Chapter 2: Background & Literature Review

2.1 Primality Tests

The use of a primality test is to produce a result of prime or composite given an arbitrary integer n. Many forms of primality test exist, the simplest of which being . Using the definition of a prime number, trial division simply tests all integers greater than 1 and less n for any divisors. Should a trial division test the full range of integers without discovering a divisor, the integer n is proven to be as√ prime. Certain optimisa- tion exist√ to trial division, such as testing integers from 2 to n or testing only primes less than n, however this method√ still remains slow. While trial division suggests an arithmetic running time of O( N), in terms of bitwise operations, the time complexity is exponential. For large integers such as the largest known prime 274,207,281 − 1, trial division becomes extremely impractical.

2.1.1 Fermat Test

The Fermat Test [12, p. 73] is a primality test which is algorithmically superior to trial division, however not deterministic. While a trial division can determine without a doubt the primality of an integer, the Fermat Test indicates only a . The exact probability of primality is indeed quantifiable however it is not relevant within the scope of this project. The test itself makes use of Fermat’s Little Theorem:

ap−1 ≡ 1 (mod p) (4)

To determine if p is a probable prime for given a in the range 2 ≤ a < p. The notation in (4) indicates that ap−1 is congruent to 1 modulo p. Simply put, if ap−1 modulo p is equal to 1, then p is a probable prime. Some technicalities of which are worthy of discussion are associated with the Fermat Test described above. Should a Fermat Test of base a reveal a result other than 1, a is then labelled as a Fermat Witness of p. Fermat Witnesses are not a technicality as such, however there exist cases where the Fermat Test wrongly identifies p as a probable prime for base a. In this case, base a should witness the compositeness of p but instead confirms the Fermat Test. Should this occur, Fermat base a is known as a Fermat-Liar. For example, consider the integer 341 which is factorisable as 341 = 11 × 31 and hence composite. In a base 2 Fermat Test of 341, the results is 2340 mod 341 = 1, indicating 341 is probable prime. Base 2 is therefore a Fermat Liar of 341 [12, p. 74]. A further technicality of note is that there exist composite integers for which every possible base a is a Fermat Liar. These are known as Carmichael Numbers [12, p. 76]. These technicalities of the Fermat Test are to illustrate that though the Fermat test is somewhat of an indicative test of primality, there are cases where it fails. In practice any

4 successful Fermat Test is immediately followed by a deterministic test for any probable prime p.

2.1.2 Modular Exponentiation

To perform the Fermat Test, one could naïvely compute the exponentiation of ap−1, re- quiring p−1 multiplications of a and then proceed with the modulo operation. However as prime number candidates become large, the naïve multiplication method becomes cumbersome. It is possible to optimise this using a Square and Multiply . In what is known as binary ladder exponentiation [13, p. 458], the computation of xy mod n is implemented by looping over a bitwise representation of the exponent y and successively squaring and multiplying the base x depending on bits of y. Assuming the bitwise representation of y in little endian (least significant digit in the first position):

D X i y = yi2 (5) i=0

For a binary representation length D, xy mod n is computed specifically by a left-to- right binary ladder. For each bit, starting from the second to last (or second bit in big endian representation), the base x is repeatedly squared and multiplied if bit yi is equal to 1. This is demonstrated in the following pseudocode: z = x for i from D-2 to 0: z = z*z mod n if y_i == 1: z = x*z mod n return z

Observing the following distributive property of the modulo operation:

(ab) mod n = [(a mod n)(b mod n)] mod n (6)

A modulo is inserted at each iteration rather than computed following exponentiation. This results in superior numerical stability as the exponentiation xy is never fully com- puted, rather the algorithm returns exactly xy mod n. Clearly this exponentiation scheme is much more efficient than naïve multiplication. This conclusion can be proven in earnest by observing that a loop from D − 2 to 0 requires bitlength(n) − 1 iterations with at least 1 multiplication per iteration. Given that the number of bits required to represent decimal integer n is:

5 bitlength(n) = blog2(n)c + 1 (7)

In the worst case where n is represented entirely by bits of 1, the exponentiation of xy would require 2 multiplications per iteration and therefore 2blog2(n)c total multiplica- tions. One could overlay a plot of f1(n) = 2blog2(n)c with f2(n) = n and observe that modular exponentiation is superior to naïve multiplication, however this result is considered in the realm of common knowledge.

2.1.3 FFT Multiplication & The Z-Transform

Prime candidates of interest can scale to several thousand, possibly millions, of decimal digits. Representation of thousands of decimal digit integers is beyond the capabilities of most programming language’s primitive integer type, as these are limited to at most 64 bits. Therefore to achieve a representation of integers above this limit, a positional notation system is used. This system requires an array of integers which represent the digits of a large integer in an arbitrary base B:

X i x = diB (8) i

With this positional notation system, basic mathematical operations now introduce a non-trivial overhead. One could imagine a protocol for the multiplication of two large integers in much the same way as multiplication was taught in grammar school. The so called “grammar school multiply”, would introduce an O(N 2) asymptotic run time and hence be detrimental to the performance of the left-to-right binary ladder previously dis- cussed. Rather, a Fast Fourier Transform (FFT) multiplication scheme is implemented for increased algorithmic performance. In the following sections regarding FFT mul- tiplication, a loose but not definite distinction is made between the Discrete Fourier Transform (DFT) and FFTs. While DFTs usually refer to an O(N 2) Fourier Transform algorithm and FFTs an O(NlogN), certain optimised DFT are presented as FFTs to emphasise their superior time complexity. The generic DFT and inverse DFT-1, respectively, are computed as follows:

D−1 D−1 X 1 X X = x g−jk , x = X gjk (9) k j j D k j=0 k=0 √ Where most commonly the “twiddle factors” g refer to e2πi/D, where i = −1. In an FFT, these twiddle factors are only calculated once, followed by a divide-and-conquer algorithm which calculates each sum. Hence FFTs lead to a superior algorithmic com- plexity of O(NlogN).

6 The multiplication of large integers via FFTs is implemented in its most fundamen- tal form using Schönhage-Strassen algorithm [14] (see [13, p. 490] in English). The Schönhage-Strassen algorithm multiplies two large integers x and y, both of which contain D digits as follows: Schönhage-Strassen multiplication for product z = xy 1. Zero-pad arrays x and y to length 2D 2. Compute FFTs: X = FFT (x) ,Y = FFT (y) 3. Compute dyadic product: Z = X · Y 4. Take Inverse FFT: z = FFT −1(Z) 5. Round digits: z = round(z) 6. Add and Carry digit adjustment

The dyadic product indicates an element-by-element multiplication Zn = XnYn and the rounding step is necessary as FFTs employ floating point arithmetic and result z must be in integer. Note that rounding does not equate to a truncation of the mantissa’s fractional bits (eg. integer cast). Specific strategies for rounding floating point values are usually associated with specific FFT multiplication algorithms. The zero-padding step in the algorithm described above indicates a group of 0s ap- pended to the arrays x and y. This is required to satisfy a property of cyclic and acyclic convolutions imperative to the Schönhage-Strassen algorithm. For arrays x and y of length D representing large integers, each element of the cyclic convolution of length D is defined as:

X (x ∗ y)n = xjyk (10) j+k≡n (mod D)

With similar x and y, the acyclic convolution is also defined as:

X (x ∗A y)n = xjyk (11) j+k=n

Where the length of the acyclic convolution has length 2D (n ∈ {0,..., 2D − 1}). It is straightforward to show that the product of large integers xy is equivalent to the acyclic convolution of length 2D (this is indeed the grammar school multiply previously discussed):

D−1 X n xy = (x ∗A y)nB (12) n=0

7 It follows that the cyclic convolution of zero-padded arrays x and y to length 2D is equivalent to the acyclic convolution, however this operation remains O(N 2). FFT multiplication is then applied using the Convolution Theorem, which reduces multipli- cation to O(NlogN). The Convolution Theorem states that the cyclic convolution of x and y is equivalent the following FFT operation:

(x ∗ y) = FFT −1FFT (x) · FFT (y) (13)

Where · indicates a dyadic product. Indeed this implementation of the Convolution Theorem describes steps 1 through 4 in the FFT previously described. Using this strategy of FFT mutliplication for large integers x and y, the Schönhage-Strassen algorithm is considered to have O(NlogN) time complexity. An enhancement to the FFT multiplication algorithm defined above is the replace- ment of the classical divide-and-conquer FFT with a Z-Transform FFT [15]. The Z- Transform, used for discrete time signal analysis, is in general represented by the fol- lowing power series:

∞ X F [z] = f[k]z−k (14) k=0

Where z is a complex number and f[k] is any function of k. For the purposes of large integer multiplication, f[k] is represented as a series of digits x[0], x[1], . . . , x[n]. An FFT can be implemented using the Z-Transform by applying a specific filter which produces tree-based decomposition of the DFT. While the classical FFT also employs a tree decomposition, it requires complex multiplication which is equivalent to two floating point multiplications when implemented on a computer. The filter applied in the Z-Transform however arranges multiplications such that the floating point multipli- cations are halved. Therefore the Z-Transform can implement FFT multiplication in O(N/2logN/2), effectively halving the time required for each large integer multiplica- tion.

2.1.4 GFN Primes & The Discrete Weighted Transform

It has been shown that when testing large prime candidates, the Fermat Test can be optimised with modular exponentiation and FFT multiplication. While the Fermat Test can be applied to all integers (not necessarily reserved to large integer representation), further optimisations become feasible when testing integers of a certain form. Generalised Fermat Numbers (GFNs) are integers which have the form:

2n Fn = b + 1 (15)

8 2n GFNs are a generalised form of Fermat Number Fn = 2 + 1, and require b ≥ 2 and n ≥ 1 where b and n are both integers. A further generalised form of GFN has also 2n 2n been defined [16, p. 102-103] to have the form: Fn = a + b . However this form of GFN is irrelevant to the primality test under consideration in this project. This form of GFN is simply stated for completeness and further reference to GFNs will refer to form stated in (15). When considering a GFN in the Fermat Test, it is possible to halve the run time of a classical FFT multiplication with a Discrete Weighted Transform (DWT) [17]. The DWT itself is analogous to the DFT however with the addition of a weight signal a[0], a[1], ..., a[D − 1]. The DFT and inverse DFT-1 have the form:

D−1 D−1 X −jk 1 X jk Xk = ajxjg , xj = Xkg (16) Daj j=0 k=0

The significant algorithmic speedup is due to a property of this weight signal which allows the neglect of zero-padding. With zero-padding neglected, the FFT length is therefore halved, as is run time. Multiplication via a DWT can be implemented in a similar fashion to classical FFT multiplication. Analogous to the cyclic convolution is the weighted cyclic convolution defined as:

1 (x ∗a y) = (ax) ∗ (ay) (17) a For which the Convolution Theorem can be applied to such a weighted cyclic convolu- tion. This leads to:

(x ∗a y) = DWT −1(D, a)DWT (N, a)x · DWT (N, a)y (18)

It follows that a similar DWT multiplication algorithm can then be derived:

1. Define digit representation of large integers x and y together with weight signal a of length D. 2. Compute DWT: X = DWT (D, a)x , Y = DWT (D, a)y 3. Dyadic product: Z = X · Y 4. Compute inverse DWT: z = DWT −1(D, a)Z 5. Round digits of z 6. Add and Carry digit adjustment.

As discussed, for specific choice of weight signal a, the need for zero-padding can be alleviated. In the case of DWT multiplication modulo GFN, large integers x and y may

9 2n/D −πij/D be represented in base W = b with chosen weight signal aj = e . This leads to the following weighted cyclic convolution:

D−1 −1 πin/D X 2πikn/D (x ∗ y)n = D e XkYke (19) k=0

2n Given that base W divides the GFN Fn−1 = b , it is derived in [17] that the product xy is equivalent to the negacyclic convolution (x • y) mod Fn, which in turn is equivalent to the weighted cyclic convolution described above. Where the negacyclic convolution is defined as:

X X (x • y)n = xiyj − xiyj (20) i+j=n i+j=D+n

Hence:

D−1 X k xy = (x • y)kW mod Fn (21) k=0 It follows that with this specific weighted cyclic convolution, zero-padding is not re- quired thus halving the run time. Also, given the properties of the negacyclic convolu- tion above, the modulo operation is in a sense “free” as the dyadic product in base W consequently computes this operation. Another case of optimised FFT multiplication exists for large integer representation of non-uniform base. Used generally for the FFT multiplication of Mersenne numbers: p = 2q − 1 (however possibly generalised to p = kbq + c), such a base is based on a generalised binary representation:

D−1 X qj/D x = xj2 (22) j=0

For such large integers of generalised base, the cyclic convolution is also defined as:

D−1 X qn/D xy = (x ∗ y)n2 (mod p) (23) n=0 To satisfy practicalities of digit representation in hardware, the above representation is computed in practice as:

D−1 X dqj/De xy = xj2 (24) j=0

10 It is then derived that the Irrational Base Discrete Weighted Transforms (IBDWT) adopts the following weight signal and weighted cyclic convolution:

dqj/De−qj/N aj = 2 (25)

D−1 X a dqn/De xy = (x ∗ y)n2 (mod p) (26) n=0

Where the digits of the weighted cyclic convolution (x ∗a y) become the irrational base of xy (mod p). It follows that while implementing the IBDWT in FFT multiplication zero-padding is also unnecessary and therefore faster than classical FFT multiplication for Mersenne numbers.

2.2 Applications & Hardware

The implementation of the Fermat Test in the applications: Genefer [8], LLR [9], and OpenPFGW (PFGW) [10] will be used. The question is posed using these three appli- cations, how fast is it possible to test GFNs for primality? Details of the inner workings of the applications are presented in order to give context to the subsequent performance of each.

2.2.1 LLR

LLR is capable of many other primality tests in addition to the Fermat Test. Using the C programming language, LLR’s code in general creates the structure of each primality test algorithm while relying on an external library for large integer arithmetic. In the case of the Fermat Test, LLR creates a modular exponentiation loop using a left-to-right binary ladder and then makes subsequent calls to the gwnum library [18]. The gwnum library implements highly optimised IBDWT arithmetic routines particularly tuned for the modulo of numbers of the form kbn + c. With a C interface, gwnum’s implementa- tion of this highly optimised arithmetic is written in Assembly. The use of this library can be thought of as a group functions made available to operate on a gwnum type (in reality a gwnum type is a pointer to FFT data). Functions of interest for implementing the Fermat test are namely gwsquare(gwnum x) and gwsetnormroutine().

2.2.2 OpenPFGW

PFGW also implements multiple primality tests and is written in C++. It’s Fermat Test though is similar to LLR, since it uses the same gwnum library. However for modular exponentiation less than 1000 iterations, it uses the GMP library [19] rather

11 than gwnum. As a result of the somewhat dynamic use of libraries, PFGW implements class wrappers for both. The C++ classes: Integer and GWInteger together provide a wrapper for GMP’s mpz_t type and the gwnum type, respectively. These wrappers make use C++ friend functions to conform to a common interface and enable conversion between types.

2.2.3 Genefer

In contrast to LLR and PFGW, Genefer’s sole purpose is to implement the Fermat Test for GFNs. Genefer does not require the use of external libraries for computation, with the exception of BOINC to enable the properties of distributed computing previously discussed. Instead it contains original C++ code which uses the Z-Transform to com- pute FFTs rather than the IBDWT. CPU as well as GPU versions of the Z-Transform exist in Genefer, specifically versions which makes use of CUDA and OpenCL for GPU computation. The GPU code will not be assessed in this project. Within the CPU code are different implementations of the Z-Transform for specific instruction sets. Specifi- cally, implementations for AVX, FMA, SSE2, SSE4, and x87 as well as a default plain C++ version. Depending on hardware support, one of the available vector instruction sets is dynamically selected, and in certain cases the non-vectorised x87 is selected for extended precision in order to reduce error due to the FFT.

2.2.4 Hardware

These applications will be built and run on the ARCHER supercomputer [20]. The ap- plications are serial in implementation since they are task based and parallelised through distributed computing and hence will only required a single ARCHER compute node. The primary use of ARCHER though is focused on the availability of Cray Performance Analysis Tools software which is discussed in Section 2.3.4. An ARCHER compute node contains two 12-core Intel Xeon E5-2697 v2 (Ivy Bridge) processors [21]. Given that the application are serial, performance will be assessed using only a single proces- sor of a compute node. Each processor core has AVX SIMD Instructions with a vector length of 256 bits, capable of 4 double precision floating point operations per cycle. Including an AVX add and AVX multiply Floating Point Units and a clock speed of 2.7 GHz, the peak floating point performance is therefore:

2.7 GHz × 4 AV X Instructions/cycle × 2 F P Us = 21.6 GF lops/s

Also of interest on a single processor is shared L3 cache of size 20480 KB, along with a 256 KB L2 cache and 32 KB L1 cache per core. These cache sizes will come into effect when considering FFT length. Some ambiguity is associated with the notion of Flops. For the purposes of this project, hereafter, Flops will refer to the plural form of “floating point operation” and Flops/s will refer to floating point operations per second.

12 2.3 Performance Benchmarking

2.3.1 Application Speed

The most compelling measurement of performance for each application is simply the speed of the primality test itself. Given an arbitrary GFN, how much time does each ap- plication require to determine the primality of the GFN in question? Generally, the total time of a primality test scales with the size of the GFN, regardless of the application. This is due to the increasing number of iterations (and hence time) required in modular exponentiation, as discussed in Section 2.1. Therefore the total time per primality test is not a meaningful measurement of performance as it is somewhat more of a property of the GFN rather than the application under consideration. It is also highly impractical as GFNs of a high order (eg. n = 19 and above) require several hours (or days) of computation. To provide an inter-application speed comparison, the time per iteration will be measured for common GFNs and compared on an application-by-application basis. The time per iteration provides a much more meaningful measurement of performance as it is a direct indicator of the FFT performance in each application. As discussed in Section 2.1.2, each iteration of the Fermat Test’s exponentiation requires an FFT multiplication operation. The time per iteration will reveal the speed of each FFT mul- tiplication, and hence any performance advantages or disadvantages of the Z-Transform and IBDWT, as well as the speed of the applications as a whole.

2.3.2 The Roofline Model

In addition to an assessment of application speed, the Roofline Model [11] will be used to observe each application’s use of hardware capabilities. This will determine in which cases each application makes most efficient use of peak machine performance. The Roofline Model is based upon the assumption that off-chip memory traffic and Floating point computation are the limiting constraints of application performance. One could find this assumption plausible by comparing the effects of Moore’s Law with the relative increase in DRAM bandwidth in today’s hardware. To associate application performance with memory traffic, the Roofline Model defines a term operational inten- sity. The operational intensity is equivalent to the number of Flops per byte of DRAM bandwidth. One detail of note is that the operational intensity is concerned only with traffic from main memory to cache, and not cache to processor. This is to enforce that the Roofline model should contain results pertinent to the programmer of the applica- tion and not the implementation of cache protocols. Particularly, a programmer has little capacity to affect cache to processor bandwidth, whereas memory access patterns within the programmer’s application garner control of traffic into cache. The Roofline Model itself is a two-dimensional plot of the highest achievable perfor- mance for given hardware. Divided into two regions, the roofline curve shows the

13 attainable performance of an application which is either limited by floating point per- formance or memory bandwidth.

Figure 1: Theoretical Roofline Model displaying various performance regions in log-log scale.

Shown in Figure 1 is the roofline of a theoretical system with peak floating point perfor- mance of 48 GFlops/s and peak memory bandwidth of 10 GB/s. Consider first the bold region labelled by “peak floating point performance” and “peak memory bandwidth”. The horizontal area of the roofline indicates the limit of hardware floating point per- formance, of which applications executed on such hardware cannot physically surpass. The sloped area indicates a limit in bandwidth, where the processor is capable of faster computation than memory traffic can supply. This conclusion can be justified by noting the units of operational intensity and Attainable performance. Given operation intensity in terms of GFlops/GByte and Attainable performance in terms of GFlops/s, it follows that bandwidth (GB/s) is a unit of slope:

GF lops/s Rate of Change = = GB/s (27) GF lops/GB Observing that peak bandwidth is equivalent to the rate of change of the Roofline Model, the Attainable performance is therefore defined as (where OI indicates Oper- ation Intensity):

 peak F lops/s  Attainable GF lops/s = min (28) bandwidth ∗ OI

Thus for measured bandwidth and floating point performance of a given application, the Roofline Model demonstrates a limiting factor of either floating point performance or memory bandwidth.

14 Certain instances exist where the Roofline Model assembled solely of peak floating point performance and bandwidth would not justifiably represent the performance lim- itations of a given application. For instance, as previously discussed, Genefer uses a collection of vectorised instruction sets as well as the non-vectorised x87 extended pre- cision. A Roofline comparison of x87 performance with vectorised AVX for example would not be justified. The Roofline Model can be extended to compensate for such complexities. Shown in Figure 1 are two additional rooflines. One labelled “non-vectorised perfor- mance limit”, whose sloping bandwidth is shared by the peak memory bandwidth. Another labelled “prefetching disabled” with floating performance equivalent that of the machine’s peak performance. The non-vectorised limit demonstrates that an ap- plication implementing non-vectorised code will naturally exhibit a lesser potential for attainable Flops/s. In the theoretical case shown in Figure 1, peak floating point perfor- mance was assumed to implement four vector operations per cycle, therefore leaving a non-vectorised roofline of one quarter of peak performance (shown as half in a log scale). An example of a memory bandwidth limitation was taken to be the lack of cache prefetching. With this limitation, bandwidth is reduced thus causing a lesser slope in the roofline. Hence this is shown as a right shift in a log-log scale. Using the Roofline Model as described above, the operational intensity and Flops/s of each primality test application will be shown as a scatter plot under the Roofline of an ARCHER processor. This will reveal the performance of the applications in question relative to the capabilities of the Intel Xeon E5-2697 v2 processor.

2.3.3 STREAM

The popular bandwidth benchmark STREAM [22] will be used to find the peak band- width of an ARCHER processor which is required by Roofline Model. STREAM tests the speed of four different kernels in a loop over a predefined number of elements, as shown in Table 1. The kernels: copy, scale, add, and triad are tested in separate loops and while considering the bytes/iteration and Flops/iteration with elapsed time, it is possible to determine the bandwidth performance and attained Flops/s.

Name Kernel Bytes/iter Flops/iter copy a(i) = b(i) 16 0 scale a(i) = q*b(i) 16 1 add a(i) = b(i) + c(i) 24 1 triad a(i) = b(i) + q*c(i) 24 2

Table 1: Stream Kernels.

Stream was developed specifically for the purpose of achieving a bandwidth effectively equivalent to the machine performance. Therefore the results from the Stream bench- mark will be considered as peak bandwidth when benchmarking the Primality test ap-

15 plications. However, the attained Flops determined by Stream will not be used in a comparison of the applications. Rather, the processor specifications will be considered as peak Floating point performance.

2.3.4 Cray Performance Measurement and Analysis Tools

To determine where on the Roofline Model the applications under consideration lie requires a measurement of their Flops/s and memory bandwidth. Measurements of these will be made using the Cray Performance Measurement and Analysis Tools (CrayPat) [23]. CrayPat enables the analysis of performance data through available hardware counters. By instrumenting an executable of an application under consideration, the data from performance counters can be revealed through certain environment variables. Specifi- cally, when instrumenting with CrayPat on ARCHER the Cray compiler wrapper is used (with appropriate programming environment selected) to compile the source code of an application. Following this, the executable is then instrumented using pat_build. This is completed by the following commands: module load perftools cc source.c -o exe pat_build exe -o exe+pat

Of note is that the -o exe+pat option is not required, it is simply shown in order to demonstrate that a new executable exe+pat is created after instrumentation. CrayPat also requires the perftools module to be loaded prior to the compilation stage, and in some cases requires the separation of compilation and link: cc -c source.c -o source.o cc source.o -o exe

With the instrumented executable created, performance data can be gathered from the application by running the executable exe+pat. This data is then analysed once the executable has terminated successfully using outputted data files produced by CrayPat in the form .xf. The .xf files are identified with the executable’s name, along with the processor id, and node number which the executable was run on, and viewed using pat_report: pat_report exe+pat+{PID}-{node}t.xf

By default, the progression described above performs what is known as an Automatic Profiling Analysis (APA) experiment. This experiment, which will be referred to as a sampling experiment, collects performance data at specific time intervals throughout the execution of the application. This experiment is favourable as it results in mini- mum overhead when profiling an application. Upon viewing performance data with pat_report, the sampled data is shown in text. This consists of an assessment of the

16 number of samples per function in the application, and indicates sources of overhead in addition to the metrics from relevant hardware counters. Two additional files are cre- ated, a .ap2 file which contains portable data for visualisation of performance, and a .apa file. The .apa file is an assessment of the Automatic Profiling Analysis sampling exper- iment. It contains pat_build arguments that CrayPat suggests for further analysis of the performance of the application. Such suggestions appear in plain text and are editable by the user. With these pat_build arguments it is possible to re-instrument the executable for further performance analysis. This is achieved by: pat_build -O exe+pat+[PID]-[node]t.apa -o exe+apa

The resulting executable exe+apa, when executed, now produces a tracing experi- ment. A tracing experiment, rather than collecting data at specific time intervals, ac- cumulates performance data within specified functions of the application. Hence these functions are “traced”. This results in somewhat more accurate results however pro- duces a non-negligible overhead in some cases. The APA produces a tracing experiment suggested by CrayPat. It is therefore usually referred to if the performance behaviour of an application is unknown. It is possible however to define a tracing experiment exclusive to the user’s intentions, and hence not rely on the APA. Specific functions within an application can be traced using the following pat_build syntax: pat_build -w -T func1,func2 exe

Where func1 and func2 are two functions that exist within the application which the user wishes to trace. Another means of creating an exclusive tracing experiment is the use of the CrayPat API. This provides tracing capabilities beyond that of the function level. Using specific function calls in the API, a user can create trace regions within source code which is then compiled and hence instrumented. In an effort to maintain source code portability, each API call must however be conditional upon the perftools module at compile time: #ifdef CRAYPAT #endif

Where the compiler macro CRAYPAT is created following the load of perftools module. Within source code, the API header file pat_api.h must be included in such a conditional directive as well as the following API function calls to create and end a trace region: PAT_region_begin(int id, const char *label); PAT_region_end(int id);

17 Additional CrayPat API functions exist for a more detail tracing experiment, but will not be discussed. As mentioned, the performance data made available for observation by CrayPat acces- sible through environment variables. For the purposes of the Roofline Model, data from specific hardware counters is of interest and is selected using the environment variable PAT_RT_PERFCTR. This variable can be set to an integer, which specifies a hardware counter group or a comma separated list of specific hardware counter names. Under the assumption that CrayPat in the case if this project will be used on an ARCHER compute node, the hardware counter groups of interest pertain to the groups available on Intel Ivy Bridge Processors. These are found via the man page of hwpc:

Intel Ivy Bridge Event Sets Group Description 0 D1 with instruction counts 1 Summary with floating-point and cache metrics (default) 2 D1, D2, and L3 metrics 6 Micro-op queue stalls 7 Back-end stalls 8 Instructions and branches 9 Instruction cache 10 Cache hierarchy 19 Prefetches 23 Summary with floating-point and cache metrics (same as 1)

Table 2: HWPC Groups for Intel Ivy Bridge.

The selection of a group described above will produce data from individual hardware counters within each respective group. As mentioned, individual hardware counters can also be monitored rather than complete groups. These are found as per the Performance API commands: papi_avail and papi_native_avail. However as these indi- vidual hardware counters pertain very specifically to the processor under consideration, on a system like ARCHER these commands must be executed on the “back end”. This is possible by submitting a script to an ARCHER compute node similar to the following: #!/bin/bash module load perftools papi_avail papi_native_avail

With these tools, it is possible to monitor to a varying degree the Flops/s and main mem- ory bandwidth of a given application. The applications Genefer, LLR, and PFGW will subsequently be instrumented in a similar fashion using CrayPat to develop a Roofline Model.

18 Chapter 3: Performance Analysis

3.1 Application Speed

The application speed is measured by the time required per iteration of the Fermat Test. This has been shown to require two FFTs per iteration, hence this sense of time is an indication of the FFT performance within each application. In discussions of FFT multiplication, emphasis was placed on reducing the number of operations required to compute the FFT. This was completed by eliminating zero-padding with an IBDWT and applying specific filters with a Z-Transform, both halving the number of operations required. However the number of operations required for either FFT strategy are not necessarily equivalent. Genefer performs the Z-Transform with array length based upon a given GFN. For a GFN b2n + 1, Genefer uses a large integer representation of base b and length m = 2n. Therefore the iteration rate of Genefer is largely dependant on length m, but will also increase for b for the purposes of numerical precision. The FFT length applied in LLR and PFGW are both determined internally by the gwnum library. The relationship between a GFN b2n + 1 and FFT length is non-trival however length, as well as time, will consistently rise with increasing b and n to retain an accurate FFT computation. The gwnum library will determine an FFT length based on both the supplied GFN and considerations for accuracy, however the FFT itself is also changed in certain cases. The IBDWT used in the gwnum is either implemented as an all-complex FFT or generic modular reduction, the details of which are concerned with complex number multiplica- tion and is internal to gwnum. The generic reduction is used in cases where components of the GFN are too large for an internal threshold of gwnum. Specifically, when k×a or c×a are too large, where a is the Fermat base ap−1 ≡ 1 (mod p) and k and c correspond to gwnum’s generalised integer representation kbn +c, which is made to represent given GFNs. In a similar fashion to gwnum, Genefer will also change its implementation of the Z- Transform to retain FFT accuracy. As discussed, Genefer employs various vectorised instruction sets depending on hardware capabilities, and in the case where greater pre- cision is required uses x87 extended precision instructions. The point at which Gene- fer uses x87 rather than vector instructions is also non-trivial. It is indeed possible to change intermittently between vector instructions and x87 while performing a single Fermat Test. The instruction set used depends only on accumulated errors in the FFT due to rounding. To demonstrate the effects of application speed described above, Genefer, LLR and PFGW will be tested for GFNs of n = 13, 16, 19. This range is chosen somewhat con- sidering the total execution time of the Fermat Test. GFNs of n = 13 are in general computed in reasonable time (less than 1 minute), n = 16 then requiring several min- utes, and n = 19 a requiring sufficiently long time of several hours. This range of total execution time will demonstrate the necessary performance factors of each application.

19 Since n is the decisive factor in the time per iteration, application performance will be compared while varying b for each separate value of n. The applications are modified such that they only compute several minutes of the Fermat Test, rather than running to completion. This is achieved by breaking out of the modular exponentiation loop, once each application’s inherit measurement of time surpasses a predetermined number of seconds. Though relatively small GFNs will be tested to completion should they be computed in less than the “cut-off” threshold. Each application is supplied a list of GFNs to compute in an input file such that they are all computed sequentially in a single batch job on ARCHER. Once the cut-off time for each GFN has been reached, the applications are further modified to output the number of completed iterations as well as total elapsed time at the point of the cut off. The applications, once broken out of the exponentiation loop, will then continue onto the next GFN specified in the input file. The time per iteration for the Fermat Test of n = 13 GFNs is as follows:

Figure 2: Time per iteration of Fermat Test with n = 13, and b in the range: 105 ≤ b ≤ 3 × 107.

From Figure 2, it is clear where each application implements a change in FFT type by the sharp rise in time. The steep rise in Genefer’s time per iteration, as discussed, is the change from AVX instructions to x87 extended precision. The x87 range of Genfer’s performance shows a factor of 10 slower time per iteration over the AVX range (AVX: ∼ 3×10−5, x87: ∼ 3×10−4). However this slow-down is necessary for the accuracy of the computation and cannot be avoided. Though Genefer’s performance is fairly stable for all values of b, which proves that the speed of Genefer’s Fermat Test implementation is independent of b. The increase in time shown for LLR and PFGW are, as discussed, a change from an all-complex FFT to generic reduction. PFGW shows a change to generic reduction in

20 much smaller values of b than LLR however. This is perhaps a consequence PFGW’s class wrapping of the gwnum library, where GFNs under consideration are not directly interfaced with the library causing gwnum to make sub-optimal decisions. For large values of b though PFGW does perform as expected, showing a gradual rise in time with an increase in FFT length. Specifically, the gradual increments in time per iteration correspond to the following FFT length used in PFGW for the generic reduction (only quoted for values of b used in the benchmark, not FFT length in general):

b range (millions) FFT length 1 ≤ b ≤ 2 16K 3 ≤ b ≤ 12 18K 12 ≤ b ≤ 30 20K

Table 3: PFGW FFT lengths in terms of b for n = 13.

For reasons unknown, LLR shows consistent poor performance in the generic reduction range. The FFT length rises in a similar fashion to PFGW, therefore the same effects in time per iteration should have been seen. However LLR’s times are much high than PFGW and much more volatile. This is possibly due to the fact that Fermat Tests for GFNs of n = 13 are simply too fast and therefore time measurements become unreliable. The performance of LLR in the generic reduction range will be investigated with high values of n to determine if these measurements are unreliable. Figure 2 references a majority of x87 and generic reduction range of b. Prior to inves- tigating higher values of n, a smaller range of b for n = 13 is shown in Figure 3. This range demonstrates the performance of vectorised AVX instructions as well as the fast all-complex FFT.

Figure 3: Fermat Test of n = 13 with b in the range: 104 ≤ b ≤ 106.

21 It is clear in Figure 3 that PFGW under-utilises the gwnum library, as LLR exclusively uses the all-complex FFT for the range 104 ≤ b ≤ 106 while PFGW uses the generic reduction. The performance of Genefer with AVX instructions is also similar to that of LLR in this range. This suggests that the generic reduction measurements of LLR in a higher range of b may be unreliable since it is able to match the performance of Genefer.

Figure 4: Fermat Test of n = 16 with b in the range:105 ≤ b ≤ 107.

Figure 5: Fermat Test of n = 19 with b in the range:105 ≤ b ≤ 107.

The performance of Genefer, LLR, and PFGW for n = 16, 19 show a similar trend to n = 13 (see Figures 4 and 5). Since time per iteration of n = 13 was fairly consistent up to 3 × 107, larger values of n are only plotted to 107.

22 Noticeably, LLR is well-behaved for these values of n, showing almost identical per- formance to PFGW and Genefer in the generic reduction range of b. The volatility associated with n = 13 is thus attributed to the small total run time of n = 13 Fer- mat Tests. Analysis of the Roofline performance of LLR will attempt to prove this conclusively. PFGW, again, shows the use of generic reduction for small b in both n = 16 and n = 19, where the faster all-complex FFT is used in LLR. Modification of PFGW to ensure it uses the gwnum to the same extent as LLR could show significant performance improvement for small b. The performance of Genefer remains predictable for n = 16 and n = 19, showing an increase in time only due to x87 instructions at b ∼ 2 × 106 for both n = 16, 19. In both the AVX and x87 range, the time per iteration is also extremely stable, again, showing performance is b independent and any small variations shown in n = 13 are simply due to generally fast total run time. LLR and PFGW show good agreement for n = 16 in the generic reduction range, showing that FFT length increase is in fact responsible for a slow-down in speed. For n = 19 however, the increase in time is not as predictable. This is due to FFT length not only increasing, but also possibly hardware effects such as an increased number of cache misses for this size of FFT. Effects such as this will be investigated in Section 3.3. It is concluded however that Genefer is the best performing application in terms time per iteration as it is in most cases faster and much more stable than both LLR and PFGW.

3.2 Peak Bandwidth & Flops/s

To assess the performance of the applications with respect to the hardware capabilities, the peak memory bandwidth and Flops/s are subsequently derived. As discussed, the peak Floating point performance of a single core within an ARCHER compute node is 21.6 GFlops/s. This performance is attributed to AVX vectorised instructions as well as the capabilities of multiple FPUs and will represent the peak performance of Genefer’s AVX Z-Transform as well as gwnum’s IBDWT. However, as x87 instructions in Genefer are not vectorised, they must be held to a separate performance standard. In addition to the lack of vectorisation of x87 instructions, they are also incapable of using multiple FPUs within a single CPU cycle, therefore peak x87 performance is limited to 1 Flop per cycle. This amounts to 2.7 GFlops/s on an ARCHER compute node given clock speed of 2.7 GHz. As discussed, the STREAM benchmark is used to measure peak memory bandwidth. Specifically, the bandwidth of the triad kernel will be used, however others will also be investigated. Since it is the most Flop intensive kernel, requiring both a floating point add and multiply per iteration, triad is the most suitable kernel to describe the capabilities of a processor within an ARCHER compute node.

23 The Roofline Model is concerned with bandwidth from main memory into L3 cache. Since this level of cache is shared among 12 cores in ARCHER’s Intel E5 processor, multithreading effects will be accounted for when running the STREAM benchmark. STREAM can enable OpenMP threads to measure peak bandwidth while employing all cores of a processor. For the purposes of this Roofline Model, 12 OpenMP threads will be used to quote maximum bandwidth of a single processor. The full 24 cores of an ARCHER compute node are not used as this would incorporate bandwidth into two separate L3 caches as well as include possible latency due to the QPI interconnect. For a sustainable memory bandwidth measurement, STREAM recommends employing an array size of 4 times the size of shared cache. In this case, the shared L3 cache on ARCHER is 20480 KB. Given that STREAM will use an array of double precision floats (8 bytes per double), the number of required array elements is therefore:

20480 KB × 1024 bytes/KB 4 × = 10, 485, 760 elements (29) 8 bytes/element

The purpose of this recommendation is to eliminate cache effects on the measured band- width. As the purpse of this benchmark is to measure bandwidth from main memory into L3 cache, any caching of array elements would speed-up the kernel calculation and hence measure some sort of inter-cache bandwidth. Therefore to ensure all array ele- ments are written to and from main memory, an array size of 4 times L3 cache is used. To illustrate these cache effects on the STREAM benchmark, bandwidth as a function of array size using 12 OpenMP threads is shown in Figure 6, for the kernels: scale, add, and triad.

Figure 6: Measured STREAM bandwidth in terms of L3 cache size.

Array size in Figure 6 is shown in units of MiB (10242 bytes) to easily associate with specific multiples of L3 cache. Bandwidth is subsequently quoted in units of GiB/s,

24 which requires the unit conversion GiB/s = 10002/10243 MB/s since STREAM quotes bandwidth in MB/s. A clear peak in bandwidth is observed for an array size smaller than L3 cache. This is followed by a series of consistent bandwidth measure- ments for arrays larger than cache, showing that cache effects do indeed skew STREAM bandwidth. Though measured bandwidth becomes fairly stable at an array size roughly equivalent to L3 cache, the STREAM recommendation will still be employed when quoting peak memory bandwidth. For a recommended array size of 10,485,760 el- ements using 12 OpenMP threads, the Stream benchmark shows the following peak main memory bandwidth:

Kernel Best Rate MB/s Avg time Min time Max time copy 32519.4 0.005174 0.005159 0.005193 scale 31619.3 0.005325 0.005306 0.005396 add 36194.2 0.006968 0.006953 0.006981 triad 36846.1 0.006837 0.006830 0.006848

Table 4: STREAM Results for recommended array size.

As STREAM runs each kernel multiple times, the minimum and maximum time re- quired to compute each kernel are quoted under Min time, and Max time. By default STREAM repeats each kernel 10 times, leading to the Avg time corresponding to the average time required for each kernel. The bandwidth measurement is that of total mem- ory traffic divided by the minimum time. The triad kernel shows a measured bandwidth of 36,846.1 MB/s, Using the conversion to GiB/s, the peak main memory bandwidth for a single processor of an ARCHER compute node is 34.3 GiB/s.

Figure 7: STREAM Bandwidth using 24 OpenMP threads.

To give context to a measured peak memory bandwidth of 34.3 GiB/s, STREAM results for an entire compute node are shown in Figure 7. Since both processors of an ARCHER

25 compute node are used in this STREAM benchmark of 24 threads, the effective cache size is doubled. The benchmark shows a similar trend to the single processor results, becoming stable at roughly the size of cache. The measured bandwidth for STREAM recommended 4 times L3 cache (20,971,520 elements, 160 MiB) is thus 69.1 GiB/s. This result is approximately double that of the 12 thread bandwidth as one should ex- pected. It is now determined that a single processor peak memory bandwidth of 34.3 GiB/s is credible.

3.3 Roofline Performance

With a peak floating point performance of 21.6 GFlops/s (2.7 GFlops/s non-vectorised x87 limit) and peak main memory bandwidth of 34.3 GiB/s, the Roofline performance of Genefer, LLR, and PFGW can be assessed. These applications will be instrumented with CrayPat and subsequent measurements of Flops/s and memory bandwidth are made using a combination of individual hardware counters and HWPC groups. Floating point performance measurements are made trivial with CrayPat as MFlops/s is produced in the default HWPC group (group 1: Summary with floating-point and cache metrics). The number of MFlops/s is derived in this group from the sum of individual floating point operation counters, specifically the counters FP_COMP_OPS_EXE: with SIMD equivalents. The sum of floating point counters is then divided by total time and hence quoted as aggregate MFlops/s. Measuring main memory bandwidth is slightly more involved. A direct hardware counter for memory bandwidth does not exist, however it is manually derived using the OFFCORE_REQUESTS:ALL_DATA_RD. This counter is described as “demand and prefetch read requests sent to uncore” which is simply a measurement of all main memory traffic due to cache misses and prefetched data. The absolute number of of- fcore requests corresponds to the number of cache line loads, therefore total memory traffic is derived by multiplying by 64 bytes (the size of a cache line). Bandwidth is hence derived by dividing by total elapsed time can converting to units of GiB/s by a division of 10243. To assess the validity of this method for deriving bandwidth, the STREAM benchmark is instrumented with CrayPat and the offcore requests while running STREAM are mea- sured. Should the bandwidth derived from the offcore requests agree with results from STREAM, this method of measuring main memory bandwidth is considered valid.

Kernel Best Rate MB/s Avg time Min time Max time copy 13170.0 0.012773 0.012739 0.012875 scale 13386.2 0.012548 0.012533 0.012570 add 13021.1 0.019349 0.019327 0.019371 triad 13162.1 0.019174 0.019120 0.019278

Table 5: Single core STREAM Results.

26 Table 5 shows the STREAM bandwidth for a single core, with a triad bandwidth of approximately 13 MB/s. It is this bandwidth figure that the offcore counter will attempt to replicate. A single core is used for the comparison to STREAM as the offcore counter did not produce reliable results when multiple cores were used. When instrumenting STREAM, specific functions for the add and triad kernels were traced to demonstrate the offcore requests, and hence bandwidth of each. The source code of STREAM by default runs all kernels within a single function but can enable the separation of kernels into functions for tuning purposes. It is these functions that are traced with CrayPat using the following: pat_build -w -T tuned_STREAM_Triad,tuned_STREAM_Add

Note that although the purpose of these functions is to tune the performance of STREAM for specific hardware, in this case no tuning took place and the separate functions simply implemented the generic kernels. In the case of the triad kernel, the tracing experiment revealed the following offcore requests data:

USER / tuned_STREAM_Triad Time% 24.5% Time 0.191797 secs Imb. Time – secs Imb. Time% – Calls 52.125 /sec 10.0 calls OFFCORE_REQUESTS:ALL_DATA_RD 26,841,407 User time (approx) 0.192 secs 518,182,137 cycles Average Time per Call 0.019180 secs CrayPat Overhead : Time 0.0%

Table 6: Triad results of STREAM tracing experiment.

As discussed the offcore requests and total time can be used to derive bandwidth. In this case however since agreement is desired with STREAM bandwidth in MB/s, the following operation is used to derive bandwidth:

OFCR × 64 bytes Bandwidth (MB/s) = (30) 10002 bytes/MB × time

Where OFCR is the number of offcore read requests. This conversion to MB/s reveals a triad bandwidth of 8947.1 MB/s, in clear disagreement with STREAM triad bandwidth of 13162.1 MB/s. The add Kernel data (shown in Appendix A) also shows large dis- agreement with STREAM, with 26,741,550 read requests in 0.194 seconds and hence 8821.9 MB/s. This discrepancy is likely due to only read requests contributing the the bandwidth derivation, while memory writes are excluded. Also shown in Appendix A is the total offcore read requests for the STREAM trace experiment. It shows 86,005,104

27 offcore requests in 0.782 seconds, hence a bandwidth of 7038.8 MB/s. This is approx- imately half of the STREAM bandwidth, which would be expected if the bandwidth derivation includes only half of the memory traffic by excluding writes. Adjustments must therefore be made to derive a correct memory bandwidth given only the offcore read requests. The assumption is made that to correctly measure both read and write memory band- width, each offcore read request must lead to an “offcore write request”. Hence the number of offcore read requests will be doubled to measure total bandwidth given the lack of a memory write hardware counter. The hardware counter MEM_UOPS_R ETIRED:ALL_STORES was attempted, but resulted in a CrayPat error. To further demonstrate this assumption, a micro benchmark is created which will force the offcore counter to trigger upon a memory write operation. This is achieved through the use of two simple loops, the first of which initialises an array of doubles to an arbitrary value and sums each element into an accumulation variable. This loop will trigger the offcore counter due to a compulsory cache miss of each array element. The second loop then writes the value of the accumulated sum back to each array element starting from the 0th element. With an array length longer than the cache size, this will again trigger the offcore counter for each memory write due to a capacity cache miss and hence measure memory write bandwidth. These two loops can then be traced using the CrayPat API. This is demonstrated in the following code: #ifdef CRAYPAT PAT_region_begin(1,"read requests") #endif

for(int i=0;i

#ifdef CRAYPAT PAT_region_end(1); PAT_region_begin(2,"write requests"); #endif

for(int i=0;i

#ifdef CRAYPAT PAT_region_end(2); #endif

Offcore requests in these two trace regions can subsequently be measured by instru-

28 menting this micro benchmark and running a trace experiment. Should the offcore requests show agreement in both regions, the assumption that write data is neglected and therefore offcore data must be doubled is considered correct. Shown in Table 7 is such trace region data for an array length of 4 times L3 cache, equivalent to STREAM array length of 10,485,760 doubles.

USER / #1.read requests Time% 81.9% Time 0.043484 secs Imb. Time – secs Imb. Time% – Calls 22.994 /sec 1.0 calls OFFCORE_REQUESTS:ALL_DATA_RD 2,734 User time (approx) 0.043 secs 117,465,282 cycles Average Time per Call 0.043484 secs CrayPat Overhead : Time 0.0%

USER / #2.write requests Time% 17.9% Time 0.009523 secs Imb. Time – secs Imb. Time% – Calls 104.952 /sec 1.0 calls OFFCORE_REQUESTS:ALL_DATA_RD 3,141 User time (approx) 0.010 secs 25,735,668 cycles Average Time per Call 0.009523 secs CrayPat Overhead : Time 0.0%

Table 7: Micro benchmark trace results.

These results clearly show good agreement between the read offcore requests and write offcore requests, showing 2734 and 3141 requests, respectively. Therefore it is con- cluded that with the lack of an appropriate memory write hardware counter, the offcore read requests will be doubled to approximate total main memory bandwidth. With a measurement for memory bandwidth established, as well as peak Flops/s, the Roofline Model for Genefer, LLR, and PFGW can be plotted. With all applications instrumented with CrayPat, the offcore read requests and aggregate MFlops/s is desired. As MFlops/s is attained through HWPC group 1 and bandwidth through the offcore counter, multiple runs of each application must be made, one run for each metric of interest. In similar fashion to the application speed benchmark, common GFNs will be compared on the Roofline Model. The GFNs under consideration will again have n = 13, 16, 19 and b varied to resolve the x87 and AVX regions of Genefer, and the generic reduction and all-complex region of LLR and PFGW. Shown in Figure 8 is the Roofline data of all applications with b of 8 million, 4 million, 1 million, and 500,000.

29 Figure 8: Roofline Model of Genefer, LLR, and PFGW.

There is is clear trend in Figure 8 that roofline performance increases with n. This trend is most notably shown in Genefer, where a consistently decreasing operational inten- sity is labelled with coresponding values of n. As the decreasing operational intensity moves towards, but not beyond the region of memory-bound performance, roofline per- formance is indeed increasing. With the exception of Genefer’s x87 region, and two points within n = 13 LLR, the absolute floating point performance shown in Figure 8 is also fairly consistent. For Genefer’s AVX region, floating point performance is ∼ 70% of peak hardware per- formance for n = 16, 19. While adjusting for the x87 limit, Genefer’s x87 region also shows ∼ 60% of peak x87 performance for n = 13, 16, 19. With such consistent floating point performance, the decrease in operational intensity, and hence increase in roofline performance, is due to an increase in bandwidth as one should expect. The same trend is somewhat true with LLR, where the operational intensity n = 13 is much greater than n = 16, 19, showing increased performance for large n. How- ever n = 16 and 19 are clustered together, as well as all of PFGW’s, therefore further increase of n may not yield increased performance. The points of high operational in- tensity for LLR and Genefer are both n = 13, and in the all-complex and AVX range, respectively. The unusually high operational intensity is due to small bandwidth, which itself is simply a consequence of a small GFN and may not be an accurate representation of performance. Shown in Figure 9 is the roofline model with n = 13 removed.

30 Figure 9: Roofline Model of n = 16, 19.

With the removal of n = 13, the near identical roofline performance of LLR and PFGW is shown. The additional points of b = 50, 000 are added for PFGW for n = 16, 19 which implement the all-complex FFT. The separation of the generic reduction (shown as GR in Figure 9) and all-complex regions is also shown. For both LLR and PFGW, the generic reduction shows lesser operational intensity than the all-complex FFT, yet the all-complex FFT demonstrates higher MFlops/s. This is expected as the all-complex FFTs were much faster than the generic reduction in application speed. Genefer’s AVX performance is also comparable to LLR and PFGW, however when implementing the x87 instructions, absolute floating point performance decreases substantially due to the non-vectorised instruction limit.

3.4 Sources of Overhead

It is possible to observe the sources of overhead which limit the Roofline Data shown in Figures 8 and 9 using a CrayPat sampling experiment. For the better performing AVX region of Genefer, a sampling experiment for a GFN of n = 19 and b = 106 is recorded. Genfer also remains modified such that computation terminates after several minutes. Shown in Table 8 is the data from such an experiment.

Samp% Samp Group Function 100% 30,397.0 Total 99.8% 30,336.0 USER 46.3% 14,088.0 ZTransformAVXIntel::Transform 28.0% 8,496.0 ZTransformAVXIntel:: BackwardForward16 25.1% 7,637.0 _ZN18ZTransformAVXIntel{...}

Table 8: Sampled data from Genefer for a GFN of n = 19 and b = 106.

31 It is clear that the Z-Transform is the source of overhead in Genefer as 99.4% (46.3% + 28.0% + 25.1%) of CrayPat samples occurred in the ZTransform class. As the Z- Transform implementation is highly optimised with vector instructions, there is likely little scope for performance improvement. This conclusion is also found while observing cache metrics of Genefer. Using the hwpc group 2, which demonstrates D1, D2, and L3 metrics, the ratio of L3 cache hit and misses is derived by recording the total Last Level Cache (LLC) misses and references, and hence calculating:

LLC misses hit% = 100 − miss% , miss% = × 100% (31) LLC references

Shown in Table 9 are such LLC misses and references as part of a sampling experiment for the same GFN of n = 19, b = 106 (full results of hwpc 2 are shown in Appendix B). Since it is the total LLC misses and references desired, a sampling experiment rather than a trace will suffice as to reduce CrayPat overhead.

Total LLC_MISSES 297,639 LLC_REFERENCES 12,973,324,952

Table 9: LLC Misses and References Hardware Counter Data.

These results subsequently show that LLC cache hits 99.9977% of the time, which is rounded to 100% in the hwpc group 2 results shown in Appendix B. This shows that cache misses are not a source of overhead in Genefer for this size of GFN, and indeed for all smaller GFNs since FFT length will reduce in size. The high ratio of cache hits also reiterates the conclusion drawn from the Roofline Model, where performance increases with n (essentially the size of FFT). In the absence of overhead due to cache, the Translation Lookaside Buffer (TLB) is investigated as another plausible source of memory overhead. The TLB itself is a cache of translations from virtual to physical memory addresses and is potentially a larger source of memory overhead than cache misses. Each TLB miss causes what is known as a page walk which determines these translations and requires multiple reads of memory, hence a high overhead. The absolute number of TLB misses, particularly those that cause a page walk is include in hwpc group 1. A further sampling experiment with yet again a GFN of n = 19, b = 106 is run while using hwpc group 1. With full results displayed in Appendix B, it is shown that the absolute number of TLB load misses and store misses are approximately 2 × 108, and 2 × 106, respectively. Though, these in essence do not reveal total overhead due to TLB misses as it is simply an absolute terms. To give context to the overhead of the TLB misses, a walk duration hardware counter is used in conjunction with the absolute number of TLB misses in the following sampling experiment:

32 Total DTLB_LOAD_MISSES:WALK_DURATION 3,472,699,228 DTLB_LOAD_MISSES:MISS_CAUSES_A_WALK 199,410,617 DTLB_STORE_MISSES:WALK_DURATION 57,001,691 DTLB_STORE_MISSES:MISS_CAUSES_A_WALK 3,771,982 User time (approx) 308.252 secs 832,587,696,798 cycles

Table 10: TLB Misses and Walk Duration.

The walk duration shows for how many cycles the Page Miss Handler (PMH) is “busy” performing a page walk. This when compared total CPU cycles gives context to the TLB overhead. By taking the sum of both load and store walk durations, overhead is quantifiable as a percentage of total CPU cycles:

Load Duration + Store Duration T LB overhead = × 100% (32) T otal Cycles

Using results from Table 10, the TLB overhead is then 0.4%. It can also be shown by simply division that each load miss and store miss require on average 17 and 15 CPU cycles, respectively. This overhead is therefore considered negligible. In fact, when running Genefer an estimate of time is produced throughout the Fermat Test. In the case of n = 19, b = 106, the estimate of total time on an ARCHER front end node is 9 hours. With a TLB overhead of 0.4%, the PMH would only contribute 2 minutes of runtime. While sampling LLR and PFGW with the GFN n = 19, b = 106, CrayPat illustrates the warning that a .apa is not generated in both cases. This is due to no samples occuring within “USER” functions. This is a strong indication that the gwnum is the highest source of overhead within these applications. This seems a logical conclusion as Genefer’s overhead was due to the FFT, it follows that the FFT in LLR and PFGW would also produce high overhead. Although a .apa file was not created, the sampling experiment remains functional, the results of which are shown in Tables 11 and 12. Both LLR and PFGW sampling data show unrecognisable function calls, presumably within the gwnum library. This further illustrates that the FFT is again the largest source of overhead. Analysis of cache behaviour (displayed in Appendix C and D) also shows an L3 cache hit ratio of 100% and 94.7% for LLR and PFGW, respectively. Without further rigorously memory performance analysis, it is determined that the sources of overhead for LLR and PFGW are similar to that of Genefer and reside within the FFT.

33 Samp% Samp Group Function 100 % 20,421.0 Total 99.9% 20,391.0 ETC 45.0% 9,185.0 ypass2_r4dwpn_12_levels_CORE 32.7% 6,667.0 yfft_r4dwpn_512K_ac_12_2_CORE 6.3% 1,280.0 yr3ecbCORE 4.0% 817.0 yr3ebCORE 2.2% 455.0 cftmdl1 1.7% 339.0 addsignal 1.3% 269.0 cftmdl2

Table 11: LLR sampling experiment data with GFN n = 19, b = 106.

Samp% Samp Group Function 100% 24,818.0 Total 99.9% 24,805.0 ETC 47.9% 11,892.0 yfft_r4dwpn_1120K_1280_2_CORE 35.4% 8,796.0 ypass2_r4dwpn_1280_CORE 4.0% 989.0 gwysubr3 2.7% 673.0 gwycopyzero3 2.4% 590.0 yr3CORE 1.5% 378.0 yr3cCORE 1.3% 322.0 yr3zCORE 1.0% 237.0 cftmdl1

Table 12: PFGW sampling experiment data with GFN n = 19, b = 106.

34 Chapter 4: Residue Modification

4.1 Current Residue Calculations

As discussed, each application computes the modular exponentiation ap−1 mod p to conduct the Fermat Test. Should the result of this operation equate to 1, the GFN p is a probable prime, otherwise p is definitely composite. For the purposes of error checking, in the case where p is composite the residue is produced, which is the numerical result of ap−1 mod p. Though, since large integer representation is used, the complete residue is unrealisable as output from the applications given the limitations of primitive integer types. Each application therefore computes a representation of the residue using the least significant digits of the large integer, subsequently packed into a 64 bit unsigned type. The need for error checking through residue is required as the Fermat Test of p has p − 1 possibilities of compositeness, whereas a probable prime p is unique the result ap−1 mod p = 1. Therefore the residue is used as a unique identifier for the compos- iteness of a Fermat Test. This unique identifier is then used to assess the correctness of Fermat Test results computed on a distributed computing network, where differing residues for a composite result of common GFNs indicate either a hardware of software flaw. LLR and PFGW both compute residues in agreement while Genefer’s differs. The dif- fering of Genefer’s residue is not an indication of incorrect results, it is simply a differ- ent representation of the least significant 64 bits. However it would be advantageous to produce a residue in agreement with LLR and PFGW for inter-application error check- ing. LLR uses the giants library which is embedded with gwnum to compute residue. It uses a conversion function from the gwnum type to giants type, which is essentially a base conversion from gwnum’s irrational base to base 232 used in the giants library. A further modulo operation is computed with p in giant using the giants library. For the case of a GFN Fermat test, this modulo operation will not have an effect as it was already computed in modular exponentiation. However it remains in LLR for general- ity as gwnum uses a general representation kbn + c rather than a strict GFN b2n + 1. Following the additional modulo operation, the residue is quoted in hexadecimal as follows: gwtogiant(gwdata,x,tmp); //x = gwnum, tmp = giant modg(N,tmp); // where N is p represented as giant printf("%08lX%08lX",(unsigned long)tmp->n[1], (unsigned long)tmp->n[0]);

Where tmp->n[i] are the large integer coefficients in base 232 internal to the giants library of type uint32_t. PFGW computes a residue in agreement with LLR while using the GMP library. Once the Fermat Test is complete, PFGW uses its class wrappers to convert from gwnum to a

35 GMP representation of type mpz_t which is internal to PFGW Integer class wrapper of GMP. PFGW also computes a modulo operation with p for the general case of kbn + c, which in this case again will not have an effect. The least significant bits are then found using a bitwise AND operation with ULLONG_MAX. As ULLONG_MAX in binary is simply a series of 1s, the AND operation has the effect of extracting the least significant 64 bits of the large integer. Of course, the bitwise AND operation is internal to the GMP library and is not simply a binary operation on primitive types. Following these operations the 64 bit residue is quoted in hexadecimal with two 32 bit halves, using a bit shift and a further bitwise AND as follows: X = gwX //X = GMP wrapper, gwX = gwnum wrapper x %= *N; //N is a pointer to p in GMP form uint64_t res = (X & ULLONG_MAX); printf("%08X%08X",(uint32_t)res>>32, (uint32_t)res&0xFFFFFFFF);

Note that though the operations shown above seem primitive, they are all indeed func- tion calls to PFGW’s class wrappers using C++ operator overrides. For instance X = gwX is not a simple assignment, it is in fact a call to a constructor which makes a conver- sion from gwnum to the mpz_t type. With this hexadecimal representation of residue, LLR and PFGW show agreement. It is of interest to modify the residue of Genefer such that it mimics either of the residue calculations defined above, rather than modifying LLR and PFGW for agreement with Genefer. Genefer currently calculates residue by extracting bits from the last 8 digits of the large integer representation and constructing a 64 bit unsigned integer as follows: unsigned long long res = 0; for(size_t i=8; i != 0; --i){ res = (res << 8) | (unsigned char) z[2*n -i]; }

Where z is an array of digits of large integer representation and n is half of the array length. The large integer representation employed in Genefer uses base b and length m = 2n for given GFN p = b2n . Since large integer base is dependant upon p (es- sentially arbitrary base), the least significant digits of Genefer’s representation are not necessarily the least significant bits of the completely formed large integer in binary. Therefore this residue will not agree with LLR and PFGW as it is fundamentally of a different base.

4.2 Residue Modification with the GMP Library

To circumvent the arbitrary base of Genefer, the array of digits can simply be summed into a large integer library. Once converted into a library, the residue calculation can mimic exactly that of LLR and PFGW. Within Genefer, the array of digits z is little endian, therefore the sum can be computed as:

36 2 sum = z0 + z1base + z2base + ... (33)

To avoid the computation of large powers however, Horner’s method is used to evaluate the polynomial shown in (33). Horner’s method is essentially a cleaver rearrangement multiplications required to evaluate polynomials. In this case, with array of digits z, and base b, Horner’s method representation of the sum is as follows:

sum = z0 + b(z1 + b(z2 + ... + b(zn−1 + bzn))) (34)

This alternative representation of the polynomial sum proves to be much less cumber- some in code than the naïve computation of all powers. The following pseudocode thus demonstrates this: sum = 0 for i in digits: sum = sum * base + i return sum

Where digits is an array of digit in big endian. For the residue computation of Genefer, the GMP library is elected, as the residue calculation can mimic that of PFGW following the Horner sum of digits. Using the GMP library, the sum of digits (for an array of m digits) is computed as follows: mpz_t sum; mpz_init(sum);

for(size_t i=0; i != m; i++){ mpz_mul_ui(sum,sum, base); mpz_add_ui(sum,sum, z[m-1-i]); }

Where Genefer, of course, employs little endian representation therefore digits are re- versed. Once the summation is complete with GMP, calculation of the residue can follow in the same fashion as PFGW. With then neglect of PFGW’s class wrappers the residue is calculated as follows: mpz_t llongmax; mpz_init_set_str(llongmax,"FFFFFFFFFFFFFFFF",16);

mpz_and(sum,sum,llongmax); unsinged long res = mpz_get_ui(sum); printf("%08X%08X", res>>32, res&0xFFFFFFFF);

Clearly this follows the residue calculation of PFGW, with the exception of the modulo operation. As previously discussed, the modulo operation in PFGW and LLR is for general cases, and will not have an effect in Genefer which employs strict GFNs b2n +1.

37 To illustrate the correctness of this residue, the results from Genefer are compared with PFGW and LLR. It is essential however that the Fermat Base used in each application is identical (where Fermat Base a is used in Fermat Test ap−1 mod p). In Genefer this base is hard-coded as 2 when instantiating the FFT (InitFFT(m,b,2), where m is 2n and b is the GFN b). LLR and PFGW must therefore use a Fermat Base of 2 to conform to a common residue value. This is achieved through command line arguments to the applications ./llr -oFBase=2 and ./pfgw -b2. For a Fermat Base of 2, the residue for Fermat Test with a GFN of n = 13, b = 100 is are follows:

Application Residue PFGW FDDE0CD432BD6D8C LLR (RES64) FDDE0CD432BD6D8C Genefer (GMP) FDDE0CD432BD6D8C Genefer (default) 561a094a410d4c46

Table 13: Residue of all Applications for a GFN of n = 13, b = 100 for Fermat Base 2.

LLR computes two forms of residue, one which it defines as “old64”, and the other “res64”. It is res64 which is in agreement with PFGW, and hence labelled as such in Table 13, while old64 is discarded. The default residue calculation of Genefer is also retained for comparative reasons. The newly calculated residue in Genefer not only conforms to LLR and PFGW for Fermat Base 2, but for arbitrary Fermat Base. Using the default Fermat Base of 3 in LLR and PFGW, and hence re-compiling Genefer with a hard-coded base of 3, the residue for a GFN of n = 13 b = 104 is as follows:

Application Residue PFGW A874DC2BD3F1B9C8 LLR (RES64) A874DC2BD3F1B9C8 Genefer (GMP) A874DC2BD3F1B9C8 Genefer (default) 4c20c42dc433e0b9

Table 14: Residue of all Applications for a GFN of n = 13, b = 104 for Fermat Base 3.

Further GFNs and Fermat Bases have been tested vigorously and it is concluded that the GMP implementation of Genefer residue conforms to that of LLR and PFGW.

4.3 Radix to Binary Residue Calculation

The residue calculations which employ a large integer library share a fundamental com- monality. In LLR, the conversion from gwnum to giant is essentially a base conversion from gwnum’s representation to base 232 in the giants library. The same occurrence also exists in PFGW, where the gwnum representation was converted to the GMP li- brary which internally uses base 264. This commonality was further emphasised when

38 the arbitrary base of Genefer was also converted to base 264 using GMP. Therefore the residue calculation is fundamentally a base conversion problem, with a truncation of least significant bits. It is indeed possible to bypass the need for a large integer library and compute the base conversion manually. For the purposes of residue calculation, such a manual base conversion could convert from arbitrary base to binary rather than base 264 and hence the least significant digits of a derived binary array would indeed conform to the least significant bits of of the large integer. Following a radix conversion algorithm originally described in The Art of Computer Programming Vol. 2 [24, sec. 4.4], the GMP library documentation itself [19, ch. 15.6.2], and a publicly available code implementation [25]. An arbitrary base array may be converted to a binary array using simple add-and- carry digit adjustment of digits multiplied by the arbitrary base. Note that the resultant binary array is quite literally an array of primitive integer types with the values 0 or 1, not a lower-level binary representation of digits. Add-and-carry, was briefly discussed in digit adjustment for FFT multiplication. As a consequence of the acyclic convolution, digit values possibly became larger than the base they represented which results in an invalid array representation. For example:

Decimal value 20 21 22 23 24 25 33 1 0 0 0 0 1 Valid 33 1 0 2 3 0 0 Invalid

Table 15: Valid and Invalid Large Integer Representations.

While the base 2 representation of 33 in Table 15 both sum to the correct decimal value, the latter representation is invalid as digits of 22 and 23 do not conform to 0 or 1. Values such as these are rectified using the following add-and-carry algorithm: carry = 0 for i in digits{ tmp = i + carry i = tmp % base carry = tmp / base }

Where base in this case is 2, tmp/base denotes integer division, and tmp%base is a modulo operation, for digits in little endian. This add-and-carry algorithm however does not convert base, it simply adjusts digits. To achieve a radix conversion, a multiplication stage of the original base is added to- gether with two stages of a slightly enhanced add-and-carry. Each add-and-carry stage is required to adjust the digits themselves and adjust an accumulation of multiplications. The following implementation shows a radix conversion from a base b array of digits z to a binary array res with multiplication accumulator acc:

39 z[base b digits] //already initialised values res[base 2 digits] //initially set to 0 acc[base 2 digits] //initially set to 0

acc[0] = 1 for i in range 0 < i < base b digits{ for j in range 0 < j < base 2 digits{ res[j] += acc[j] + z[j] tmp = res[j] k = j do: //add-and-carry res[j] = tmp % 2 k++ res[k] += tmp / 2 tmp = res[k] while tmp >= 2 } //accumulate multiplications for j in range 0 < j < base 2 digits{ acc[j] *= b } for j in range 0 < j < base 2 digits{ tmp = acc[j] k = j do{ //add-and-carry acc[k] = tmp % 2 k++ acc[k] += tmp / 2 tmp = acc[k] }while(tmp >=2) } }

Once the code above is complete, the array res in base 2 will contain digits of either 0 or 1, which in turn represent the arbitrary base representation z. This code can be interfaced with Genefer by supplying the z array and base b of the GFN, and array length m = 2n, where "base b digits" = m. To complete the residue calculation, the least significant 64 bits are only required. Therefore "base 2 digits" can be set to 64, where all greater bits are thrown away. With the addition of cleverly placed break statements (within add-and-carry do-while loops) array index errors are avoided with this “throwing away” of bits. The res array will therefore contain the least significant 64 bits of the large integer, which can subsequently be used to find residue. The challenge remains of converting the binary array res to a single unsigned integer

40 for conversion to hexadecimal. This is easily completed with bit shift and bitwise OR operations. Consider five bit array: [1,0,1,1,1] in little endian and a hypothetical 5-bit unsigned integer initially set to 0. It is possible to insert each digit of the array into the unsigned integer with a bitwise OR, then shift the unsigned integer by 1 bit for the insertion of the next digit in the array. This is demonstrated in the following table:

Array Index Value residue residue | val (residue|val) « 1 00000 0 1 00000 | 00001 00010 = 00001 00010 1 0 00010 | 00000 00100 = 00010 00100 2 1 00100 | 00001 01010 = 00101 01010 3 1 01010 | 00001 10110 = 01011 10110 4 1 10110 | 00001 01110 = 10111

Table 16: Bit manipulation of binary array inserted into an unsigned integer.

Notice that the result of (residue | val) of the last iteration is identical to the array of bits, but packed into an 5 bit unsigned integer. Since the array was in little endian, the bits must be reversed to achieve proper binary representation. This is easily achieved by reversing the array index order. The following code demonstrates the previous binary manipulation for a 64 bit unsigned long:

unsinged long residue = 0; for(size_t i=0; i!=63; i++){ residue = (residue | digits[63-i]) << 1; } residue = (reside | digits[0]); printf("%08X%08X",residue>>32, residue&0xFFFFFFFF);

The 63 iteration loop has the effect of pushing 63 digits from the array into the unsigned long, in the same fashion as iterations 0 through 3 in Table 16. The final bitwise OR is left outwith the loop as to not perform a final bit shift of 1. This step is crucial for correctness as a final bit shift would should subtract 1 unit in hexadecimal from the residue and hence show disagreement with other derived residues. With the final bitwise OR removed from the loop, the residue quoted in hexadecimal is considered correct in every case.

41 The newly calculated residue using GMP in Genefer was determined to show agreement with LLR and PFGW. Therefore to assess the correctness of the radix to binary residue, results need only be compared with Genefer itself. Shown below are a select few residue values quoted by Genefer. One which is the previously derived GMP residue and the other, the radix to binary residue.

GFN GMP Residue Radix to Binary Residue n = 9, b = 103 520183D3D4F11F1D 520183D3D4F11F1D n = 13, b = 102 C5FF6A4A68324D5A C5FF6A4A68324D5A n = 13, b = 104 A874DC2BD3F1B9C8 A874DC2BD3F1B9C8 n = 14, b = 5 × 104 D9722563BB8D4612 D9722563BB8D4612 n = 16, b = 105 8210B93FE6498C42 8210B93FE6498C42

Table 17: GMP Residue and Radix to Binary Residue produced by Genefer using Fermat Base 3.

Table 17 does indeed show a subset of all test cases observed with Genefer. Each case, though, is in agreement with the GMP residue and therefore an exhaustive list is considered redundant. The GFNs selected in Table 17 display a wide range of b and n in an effort to show that the correctness of residue can be extrapolated to all other GFNs. With the radix to binary residue showing agreement with the GMP residue, it is concluded that, by extension, the radix to binary residue also conforms to LLR and PFGW. The GMP library was introduced as a dependency for the purposes of calculating residue conforming the LLR and PFGW. With residue now derived in Genefer without the need of the GMP library, portability of the application is not effected. Further tests for the correctness of the residue calculation could be implemented in future, for instance a residue calculation given random z array, not necessarily related to specific GFNs. The addition of a command line argument for the Fermat Base would also enable further comparison to LLR and PFGW. Following any further correctness tests, the addition of radix to binary residue in Genefer is potentially production ready for use in PrimeGrid.

42 Chapter 5: Conclusion & Further Work

The performance of the of Generalised Fermat Numbers imple- mented in Genefer, LLR, and PFGW was assessed in terms of absolute speed and peak hardware capabilities. This was achieved by varying n and b for test GFNs and com- paring the performance of each application. The application speed was defined as the number of Fermat Test iterations completed per second. As each iteration requires an FFT multiplication, this gave an indication of FFT performance within each applica- tion. It was determined that Genefer was superior in terms application speed for GFNs of n = 13, 16, 19 and 104 ≤ b ≤ 3 × 107. The Roofline Model was used to relate the performance of the applications with peak hardware capabilities. Using a single processor of an ARCHER compute node (In- tel Xeon E5-2697 v2), the Roofline Model showed increasing performance as n in- creased. It also showed that the application’s floating point performance could consis- tently achieve ∼ 70% of peak performance. The source of overhead which limited the Roofline performance was also shown to be due to the FFT computation. As the FFT were previously written with vector instruction in Genfer, and optmised assembly in the gwnum library, no performance improvement was achieved. Following the performance assessment, the residue calculation of Genfer was modified such that is conformed to LLR and PFGW. This was completed in order to improve the scope for error checking of Fermat Test results, and was first acheived using the GMP library. The GMP library, though, introduced an unnecessary dependency to the Gene- fer code, and therefore a second solution to the residue modification was implemented. This solution, using a radix conversion of large integer representation, replicated exactly the results of the LLR and PFGW residue without impacting Genefer’s portability. Fol- lowing further tests of correctness, this addition to Genefer is possibly qualified for use in PrimeGrid. Further work could entail the performance analysis of a GPU implementation of Gen- fer with CUDA and OpenCL. PrimeGrid has recently discontinued the use of Gene- ferCUDA for the better performing GeneferOCL. Performance analysis of both could definitively determine which version is superior. A command line argument for the Fer- mat Base of Genefer could also be implemented in future as it is presently hard-coded in the FFT. This would ease the comparison of residues between Genefer, LLR and PFGW as they each use differing default Fermat Bases. Also further correctness testing of the residue calculation presented in Section 4 could certify the claims that the radix conversion code is correct.

43 References

[1] Kelly Devine Thomas. From Prime Numbers to Nuclear Physics and Beyond. In- stitute for Advanced Study, 2013. https://www.ias.edu/ideas/2013/ primes-random-matrices. [2] Ronald L. Rivest et al. A Method for Obtaining Digital Signatures and Public-Key Cryptosystems. Communications of the ACM, 21(2):120–126, February 1978. [3] Mersenne Research Inc. Great Internet Mersenne Prime Search. http://www. mersenne.org/, 1996–2016. [4] Rytis Slatkeviciusˇ & PrimeGrid community. PrimeGrid. http://www. primegrid.com, 2005–2016. [5] University of California. BOINC Open-Source Software for Volunteer Comput- ing. http://boinc.berkeley.edu/. [6] BOINC Stats. PrimeGrid detailed stats. http://boincstats.com/en/ stats/11/project/detail/overview. [7] Top500.org. Top500 List - June 2016. https://www.top500.org/list/ 2016/06/. [8] Iain Betune et al. Genefer Source Code. https://www.assembla.com/ spaces/genefer/subversion/source. [9] Jean Penné. LLR Source Code. http://jpenne.free.fr/index2.html. [10] Mark Rodenkirch. OpenPFGW Source Code. https://sourceforge. net/projects/openpfgw/. [11] Samuel Williams et al. Roofline:An Insightful Visual Performance Model for Multicore Architectures. Communication of the ACM, 52(4):65–76, April 2009. [12] Martin Dietzfelbinger. Primality Testing in Polynomial Time. Lecture Notes in Computer Science 3000. Springer-Verlag Berlin Heidelberg, first edition, 2004. [13] Richard Crandall & . Prime Numbers A Computational Perspec- tive. Springer Science+Business Media, second edition, 2005. [14] Arnold Schönhage & Volker Strassen. Schnelle Multiplikation großer Zahlen. Computing, 7:281–292, 1971. [15] Georg Bruun. z-Transforms DFT Filters and FFT’s. IEEE Transactions on Acous- tics, Speech and Signal Processing, 26(1):56–63, February 1978. [16] Hans Riesel. Prime Numbers and Computer Methods for Factorization. Birkhäuser, second edition, 1994. [17] Richard Crandall & Barry Fagin. Discrete Weighted Transforms and Large- Integer Arithmetic. Mathematics of Computation, 62(205):305–324, January 1994.

44 [18] George Woltman. gwnum library. https://github.com/rudimeier/ mprime/tree/master/gwnum. [19] Free Software Foundation. The GNU Multiple Precision Arithmetic Library. https://gmplib.org/, 2000–2016. [20] The ARCHER Supercomputer. http://archer.ac.uk/. [21] Intel ARK. Intel Xeon Processor E5-2697 v2 (30M Cache, 2.70 GHz). http://ark.intel.com/products/75283/Intel-Xeon- Processor-E5-2697-v2-30M-Cache-2_70-GHz. [22] John D. McCalpin. Memory Bandwidth and Machine Balance in Current High Performance Computers. IEEE Computer Society Technical Committee on Com- puter Architecture (TCCA) Newsletter, pages 19–25, December 1995. [23] Cray Inc. Cray Performance Measurement and Analysis Tools S-2376- 63. http://docs.cray.com/books/S-2376-63/S-2376-63.pdf, September 2015. [24] Donald Knuth. The Art of Computer Programming, volume 2. Addision-Wesley, third edition, 1998. [25] Base Conversion of Very Long Positive Integers. http://www. codeproject.com/Articles/16035/Base-Conversion-of- Very-Long-Positive-Integers, October 2006.

45 A: STREAM Hardware Counter Data

Total Time% 100% Time 0.781822 secs Imb. Time – secs Imb. Time% – Calls 28.139 /sec 22.0 calls OFFCORE_REQUESTS:ALL_DATA_RD 86,005,104 User time (approx) 0.782 secs 2,111,734,254 cycles Average Time per Call 0.035537 secs CrayPat Overhead : Time 0.0%

USER / tuned_STREAM_Add Time% 24.8% Time 0.193543 secs Imb. Time – secs Imb. Time% – Calls 51.667 /sec 10.0 calls OFFCORE_REQUESTS:ALL_DATA_RD 26,741,550 User time (approx) 0.194 secs 522,770,472 cycles Average Time per Call 0.019354 secs CrayPat Overhead : Time 0.0%

USER / tuned_STREAM_Triad Time% 24.5% Time 0.191797 secs Imb. Time – secs Imb. Time% – Calls 52.125 /sec 10.0 calls OFFCORE_REQUESTS:ALL_DATA_RD 26,841,407 User time (approx) 0.192 secs 518,182,137 cycles Average Time per Call 0.019180 secs CrayPat Overhead : Time 0.0%

46 B: Genefer Hardware Counter Data

HWPC Group 2 - Cache Metrics

Total CPU_CLK_UNHALTED:THREAD_P 967,170,468,322 CPU_CLK_UNHALTED:REF_P 28,698,893,502 L1D:REPLACEMENT 50,269,125,147 L2_RQSTS:ALL_DEMAND_DATA_RD 48,017,242,934 L2_RQSTS:DEMAND_DATA_RD_HIT 36,303,698,207 LLC_MISSES 297,639 LLC_REFERENCES 12,973,324,952 User time (approx) 306.979 secs 829,149,150,738 cycles CPU_CLK 3.37GHz D2 cache hit,miss ratio 76.7% hits 23.3% misses L3 cache hit,miss ratio 100.0% hits 0.0% misses D2 to D1 bandwidth 9,547.050MiB/sec 3,073,103,547,776 bytes

HPWC Group 1 - TLB misses

Total CPU_CLK_UNHALTED:THREAD_P 968,372,553,886 CPU_CLK_UNHALTED:REF_P 28,733,801,325 DTLB_LOAD_MISSES:MISS_CAUSES_A_WALK 200,024,036 DTLB_STORE_MISSES:MISS_CAUSES_A_WALK 2,882,535 L1D:REPLACEMENT 50,627,487,854 L2_RQSTS:ALL_DEMAND_DATA_RD 48,143,092,305 L2_RQSTS:DEMAND_DATA_RD_HIT 36,269,341,762 FP_COMP_OPS_EXE:SSE_SCALAR_DOUBLE 12,291,437 FP_COMP_OPS_EXE:X87 16,031,967 FP_COMP_OPS_EXE:SSE_FP_PACKED_DOUBLE 91,112 SIMD_FP_256:PACKED_DOUBLE 1,099,167,795,264 PM_ENERGY:NODE 126.788 /sec 38,860 J User time (approx) 306.497 secs 827,848,550,544 cycles CPU_CLK 3.37GHz HW FP Ops / User time 14,344.998M/sec 4,396,699,686,684 ops MFLOPS (aggregate) 14,345.00M/sec D2 cache hit,miss ratio 76.5% hits 23.5% misses D2 to D1 bandwidth 9,587.111MiB/sec 3,081,157,907,520 bytes

47 C: LLR Hardware Counter Data

HWPC Group 2 - Cache Metrics

Total CPU_CLK_UNHALTED:THREAD_P 650,586,421,420 CPU_CLK_UNHALTED:REF_P 19,304,118,742 L1D:REPLACEMENT 29,451,720,055 L2_RQSTS:ALL_DEMAND_DATA_RD 34,704,089,445 L2_RQSTS:DEMAND_DATA_RD_HIT 29,325,367,989 LLC_MISSES 955,925 LLC_REFERENCES 5,691,279,706 User time (approx) 204.701 secs 552,900,783,429 cycles CPU_CLK 3.37GHz D2 cache hit,miss ratio 81.7% hits 18.3% misses L3 cache hit,miss ratio 100.0% hits 0.0% misses D2 to D1 bandwidth 10,347.563MiB/sec 2,221,061,724,480 bytes

D: PFGW Hardware Counter Data

HWPC Group 2 - Cache Metrics

Total CPU_CLK_UNHALTED:THREAD_P 800,303,787,894 CPU_CLK_UNHALTED:REF_P 23,745,665,310 L1D:REPLACEMENT 64,529,991,079 L2_RQSTS:ALL_DEMAND_DATA_RD 60,251,296,428 L2_RQSTS:DEMAND_DATA_RD_HIT 45,877,803,473 LLC_MISSES 886,310,601 LLC_REFERENCES 16,689,336,606 User time (approx) 252.247 secs 681,319,258,044 cycles CPU_CLK 3.37GHz D2 cache hit,miss ratio 77.7% hits 22.3% misses L3 cache hit,miss ratio 94.7.0% hits 5.3% misses D2 to D1 bandwidth 14,578.753MiB/sec 3,856,082,971,392 bytes

48