<<

Getting to know the LLVM

Guobin YE

August 19, 2011

MSc in High Performance Computing The University of Edinburgh Year of Presentation: 2011

Abstract

Fast , reducing time of program, providing useful error message and reducing size of file generated are the main purposes of a compiler. Following the development of compiler technologies, have become more and more powerful and efficient. The performance of some open-source compilers is good enough to catch our focus.

This thesis introduces an open-source compiler called LLVM (Low Level [1, 2, 3] and discusses how the LLVM multiple language front-ends collaborate with LLVM back-end to provide effective lifetime optimization for a program. This thesis compares the performance of LLVM compiler (version 2.9) with other two compilers which are compiler (PGI) [4] and the GNU Compiler Collection (GCC) [5] version 4.l.2 and version 4.5.2. We also experiment which optimisation flags of each compiler can provide best performance for a program, and compare as many relevant aspects of the compilers as possible such as compile time, run-time and size of code generated.

Three scientific benchmark codes are chosen for the performance testing. They are GADGET-2 [6], LAMMPS [7] and HELIUM [8]. The details of the benchmarks are introduced in Chapter 3.

Contents

Chapter 1 Introduction ...... 1

Chapter 2 Background Theory ...... 3

2.1 LLVM compiler infrastructure ...... 3

2.2 LLVM Intermediate Representation (IR) ...... 3

2.3 LLVM multi-stage optimisation ...... 4

2.3.1 Compile-time optimisation ...... 4

2.3.2 Link-time optimisation ...... 5

2.3.3 Run-time optimisation ...... 5

2.3.4 Idle time optimisation ...... 5

2.4 LLVM front-ends ...... 6

2.4.1 LLVM GCC 4.2.1 ...... 6

2.4.2 ...... 6

2.5 PGI compiler and GNU GCC compiler ...... 7

2.5.1 PGI compiler ...... 7

2.5.2 GNU GCC compiler ...... 8

2.6 and LLVM gold plugin ...... 8

2.7 LibDispatch library ...... 9

Chapter 3 Tools and methods for benchmarking ...... 10

3.1 Testing platform ...... 10

3.2 Benchmarks...... 10

3.2.1 GADGET-2 ...... 10

3.2.2 LAMMPS ...... 11

i 3.2.3 HELIUM ...... 12

3.3 Profiling tools ...... 12

3.4 Correctness checking...... 13

3.5 MPI wrapper ...... 13

3.6 Tools of LLVM compiler ...... 14

3.7 Issues and solutions for building LLVM and Clang on Ness ...... 15

Chapter 4 Performance Testing and Analysis ...... 16

4.1 GADGET-2 testing result ...... 17

4.1.1 Compile time comparisons ...... 17

4.1.2 Execution time comparisons ...... 18

4.1.3 size comparisons ...... 19

4.1.4 Sum up ...... 20

4.2 LAMMPS testing result ...... 20

4.2.1 Compile time comparisons ...... 20

4.2.2 Lennard-Jones problem execution time comparisons ...... 21

4.2.3 Rhodopsin problem execution time comparisons ...... 22

4.2.4 Executable size comparisons ...... 23

4.2.5 Sum up ...... 24

4.3 HELIUM testing result ...... 24

4.3.1 Compile time comparisons ...... 24

4.3.2 Execution time comparisons ...... 25

4.3.3 Executable size comparisons ...... 27

4.3.4 Sum up ...... 27

Chapter 5 Further Works ...... 28

Chapter 6 Conclusions ...... 29

Appendix A How to build LLVM and Clang on Ness ...... 30

A.1 Preparation jobs ...... 30

A.1.1 Set up GCC version 4.5.2 on Ness...... 31

ii A.1.2 Configure and build LLVM and Clang on Ness ...... 31

A.2 Build LLVM-GCC (version 4.2.1) on Ness ...... 33

A.2.1 Preparation jobs ...... 33

A.2.2 Configure and build LLVM-GCC on Ness ...... 33

Appendix B Performance timing results ...... 35

B.1 GADGET-2 simulation results ...... 35

B.2 LAMMPS simulation results ...... 38

B.3 HELIUIM simulation results ...... 43

Appendix Benchmarks input file setting ...... 46

C.1 GADGET-2 simulation input file setting ...... 46

C.2 LAMMPS ± Lennard Jones problem input file setting...... 49

C.3 LAMMPS ± Rhodopsin problem input file setting ...... 50

C.4 HELIUM simulation setting ...... 50

Appendix Profiling information ...... 51

D.1 GADGET-2 Profiling ...... 51

D.1.1 PGI Profiling ...... 51

D.1.2 GCC-4.1.2 Profiling ...... 52

D.1.3 GCC-4.5.2 Profiling ...... 53

D.1.4 LLVM-GCC Profiling ...... 54

D.2 LAMMPS Profiling ...... 55

D.2.1 PGI Lennard-Jones Profiling ...... 55

D.2.2 GCC-4.1.2 Lennard-Jones Profiling ...... 56

D.2.3 GCC-4.5.2 Lennard-Jones Profiling ...... 57

D.2.4 LLVM-GCC Lennard-Jones Profiling ...... 58

D.2.5 PGI Rhodopsin Profiling ...... 59

D.2.6 GCC-4.1.2 Rhodopsin Profiling ...... 60

D.2.7 GCC-4.5.2 Rhodopsin Profiling ...... 61

D.2.8 LLVM-GCC Rhodopsin Profiling ...... 62

iii D.3 HELIUM Profiling ...... 63

D.3.1 PGI Profiling (O3 flag enable) ...... 63

D.3.2 PGI Profiling (fast flag enable) ...... 64

D.3.3 GCC-4.1.2 Profiling ...... 65

D.3.4 GCC-4.5.2 Profiling ...... 66

D.3.5 LLVM-GCC Profiling ...... 67

Appendix E Project work plan and risk analysis ...... 68

References ...... 70

iv List of Tables

Table 2.1: Some standard PGI compiler flags ...... 7

Table 2.2: Some standard GCC compiler flags ...... 8

Table 3.1: Benchmarks correctness checking methods ...... 13

Table 3.2: MPI wrappers ...... 13

Table 3.3: Several tools of LLVM ...... 14

Table 3.4: Several LLVM front-ends ...... 14

Table 3.5: Several LLVM compiler flags ...... 14

Table 4.1: The optimisation flags of each compiler for comparisons ...... 16

Table 4.2: GADGET-2 profiling comparisons ...... 18

Table 4.3: LAMMPS compile time comparisons ...... 20

Table 4.4: Lennard-Jones problem profiling comparisons ...... 21

Table 4.5 Rhodopsin problem profiling comparisons ...... 23

Table 4.6: PGI profiling comparison in HELIUM ...... 25

Table 4.7: HELIUM profiling comparisons ...... 26

Table B.1 GADGET-2 compile time results...... 35

Table B.2 GADGET-2 execution time results ...... 35

Table B.3 GADGET-2 executable size results ...... 35

Table B.4 PGI compile time on GADGET-2...... 36

Table B.5 PGI execution time on GADGET-2 ...... 36

Table B.6 GCC-4.1.2 compile time on GADGET-2 ...... 36

Table B.7 GCC-4.1.2 execution time on GADGET-2 ...... 36

Table B.8 GCC-4.5.2 compile time on GADGET-2 ...... 37

v Table B.9 GCC-4.5.2 execution time on GADGET-2 ...... 37

Table B.10 LLVM-GCC compile time on GADGET-2 ...... 37

Table B.11 LLVM-GCC execution time on GADGET-2 ...... 37

Table B.12 Clang compile time on GADGET-2 ...... 38

Table B.13 Clang execution time on GADGET-2 ...... 38

Table B.14 LAMMPS compile time results ...... 38

Table B.15 LAMMPS ± Lennard Jones execution time results ...... 38

Table B.16 LAMMPS ± Rhodopsin execution time results ...... 39

Table B.17 LAMMPS executable size results ...... 39

Table B.18 PGI compile time on LAMMPS ...... 39

Table B.19 GCC-4.1.2 compile time on LAMMPS ...... 39

Table B.20 GCC-4.5.2 compile time on LAMMPS ...... 40

Table B.21 LLVM-GCC compile time on LAMMPS ...... 40

Table B.22 Clang compile time on LAMMPS ...... 40

Table B.23 PGI execution time on LAMMPS ± Lennard-Jones problem ...... 40

Table B.24 GCC-4.1.2 execution time on LAMMPS ± Lennard-Jones problem ...... 41

Table B.25 GCC-4.5.2 execution time on LAMMPS ± Lennard-Jones problem ...... 41

Table B.26 LLVM-GCC execution time on LAMMPS ± Lennard-Jones problem ...... 41

Table B.27 Clang execution time on LAMMPS ± Lennard-Jones problem ...... 41

Table B.28 PGI execution time on LAMMPS ± Rhodopsin problem ...... 42

Table B.29 GCC-4.1.2 execution time on LAMMPS ± Rhodopsin problem ...... 42

Table B.30 GCC-4.5.2 execution time on LAMMPS ± Rhodopsin problem ...... 42

Table B.31 LLVM-GCC execution time on LAMMPS ± Rhodopsin problem ...... 42

Table B.32 Clang execution time on LAMMPS ± Rhodopsin problem ...... 43

Table B.33 HELIUM compile time results ...... 43

Table B.34 HELIUM execution time results ...... 43

Table B.35 HELIUM executable size results ...... 43

Table B.36 PGI compile time on HELIUM ...... 44

vi Table B.37 PGI execution time on HELIUM ...... 44

Table B.38 GCC-4.1.2 compile time on HELIUM ...... 44

Table B.39 GCC-4.1.2 execution time on HELIUM ...... 44

Table B.40 GCC-4.5.2 compile time on HELIUM ...... 45

Table B.41 GCC-4.5.2 execution time on HELIUM ...... 45

Table B.42 LLVM-GCC compile time on HELIUM...... 45

Table B.43 LLVM-GCC execution time on HELIUM ...... 45

vii List of Figures

Figure 2.1: LLVM IR replaces GCC IR...... 4

Figure 4.1: GADGET-2 compile time comparisons ...... 17

Figure 4.2: GADGET-2 execution time comparisons ...... 18

Figure 4.3 GADGET-2 executable size comparisons ...... 19

Figure 4.4: LAMMPS Lennard-Jones problem execution time comparisons ...... 21

Figure 4.5: LAMMPS Rhodopsin problem execution time comparisons ...... 22

Figure 4.6: LAMMPS executable size comparisons ...... 23

Figure 4.7: HELIUM compile time comparisons ...... 24

Figure 4.8: HELIUM execution time comparisons ...... 25

Figure 4.9: HELIUM executable size comparisons ...... 27

Figure A.1 Preparation jobs before building LLVM and Clang ...... 30

Figure A.2 Load GCC version 4.5.2 to the Ness ...... 31

Figure A.3: How to configure LLVM and Clang on Ness ...... 32

Figure A.4: Preparation jobs for building LLVM-GCC on Ness ...... 33

Figure A.5: How to configure the LLVM-GCC on the Ness ...... 34

viii Acknowledgements

First, I am thankful to my supervisor Mr Iain Bethune for his guidance and full support throughout this project. Without great help from Mr Iain Bethune, this project would have been a mission impossible for me.

Choosing to study high performance computing at EPCC was the biggest challenge in my life. Fortunately, all the tutors at EPCC had very high standard teaching skill, rich professional knowledge and expert experience. They provided a very good study environment and gave fully support when I had problems. All of these helped me to cross over all the barriers in study.

I would like to thank my classmate Jay Chetty who installed the GNU GCC (version 4.5.2) on Ness, so I could use the GCC 4.5.2 straightaway in my project for testing. This also helped me to solve a key problem when I built the LLVM-GCC and Clang front-end, because the native GCC version 4.1.2 on Ness is too low to build clang++.

Last, I have to thank my dear parents, my lovely wife and daughter for their unconditional love. With their love and encouragement, I am able to go through all the difficult times.

ix Chapter 1

Introduction

In high performance computing, the performance of supercomputers keeps on increasing every year. The fastest massively parallel supercomputer in 2011, the Japanese K supercomputer, which has 548352 cores and peak performance can reach 8.77 petaflop per second [9]. The performance increase on supercomputer hardware can shorten the execution time of scientific simulation, which reduces the research time for finding answer. However, improving the performance of computer hardware is not the only way; the improvement of compiler technology also provides a lot of benefits.

The development of software and applications is changing fast, but the technology of compiler does not have many changes. Old code generation technologies are still being used, the design of compiler is not modular and the expansibility of compiler is poor. A modern and flexible compiler is needed to catch up with the changing needs of new applications. An open-source research project called LLVM (Low Level Virtual Machine) [1, 2, 3] began in 2000 under direction of Vikram Adve and at the University of Illinois, which main purpose was to ³provide a modern, SSA-based compilation strategy capable of supporting both static and of DUELWUDU\SURJUDPPLQJ´ [1]. From the first release of the LLVM in 2003 until 2011, there were twenty versions have been released in eight years. In 2005, Chris Lattner was hired by Apple Inc. and LLVM compiler EHFDPHDSDUWRI$SSOH¶VGHYHORSPHQWHQYLURQPHQW

One of advantages of the LLVM compiler is that it provides lifetime long optimisation for a program. Lifetime long means that the LLVM compiler tries to provide optimisations for a program at any possible optimisation stages, including compile-time (static optimisation), link-time (interprocedural optimisation), run-time (dynamic optimisation), install-time (machine-dependent optimization) and idle-time (profile-guided optimisation). Chapter 2 of this thesis will introduce the unique technologies used in the LLVM compiler and how the lifetime long optimisation strategies are implemented.

1 A wide variety of front-ends is another advantage of the LLVM compiler infrastructure, which increases the capability of supporting and the ability of extension.The LLVM compiler able to support many different languages, such as Ada, C, C++, Object C, , Ruby and Python. In section 2.4, we introduce two front-ends of the LLVM compiler, one is the LLVM-GCC (version 4.2.1), and the other is the Clang.

Chapter 3 introduces the detail of three benchmarks used for performance testing in this project, which are GADGET-2 [6, 10], LAMMPS [7] and HELIUM [8]. These three codes are used for scientific research simulation, and also can be used for benchmarking at high performance computing. Chapter 3 also discusses the tools and methods used for testing and analysis, and how to correctly build the LLVM compiler on Ness.

In Chapter 4, we investigate and compare the performance of the LLVM-GCC and the Clang with the PGI compiler and GNU compiler. Three measurements are compile time, execution time and size of file generated. Chapter 5 discusses further work which could be done in this project. The final chapter contains some conclusions about the LLVM compiler.

2 Chapter 2

Background Theory

2.1 L L V M compiler infrastructure

The LLVM is a compiler infrastructure or a back-end, which able to provide a lifetime long optimisation for programs. In 2002, Chris Lattner stated a definition of the LLVM in his master¶s WKHVLV³Ddesign and implementation of a compiler infrastructure which supports a unique multi-stage optimization system. This system is designed to support extensive interprocedural and profile-driven optimizations, while being efficient enough for use in FRPPHUFLDOFRPSLOHUV\VWHPV´ [2].

Vikram Adve and Chris Lattner provided a further description of the LLVM compiler in ZKLFKLV³DFRPSLOHUIUDPHZRUNWKDWDLPVWRPDNHOLIHORQJSURJUDPDQDO\VLVDQG transformation available for arbitrary software and in a manner that is transparent to SURJUDPPHUV´ [3]. This statement pointed out the main purpose of LLVM compiler infrastructure project and presented the trend of future development.

2.2 L L V M Intermediate Representation (IR)

One of goals of the LLVM compiler framework is achieving multi-stage optimisation. This means that the LLVM compiler tries to provide aggressive optimisations throughout the lifetime of a program in five stages, which covers compile-time, link-time, run-time, install-time and idle time. To achieve this goal, LLVM needs a powerful Intermediate Representation (IR). Compared to the traditional compiler IR, Chris Lattner et al [3] described that the LLVM IR not only contains low-level representation as traditional compilers have, but also has high-level type information which is used to support optimisation analyses and transformations. This design is implemented by the LLVM virtual instruction set and a .

3 ³7KHXVHRIWKH//90YLUWXDOLQVWUXFWLRQVHWDOORZVZRUNWREHRIIORDGHGIURPOLQN-time to compile-time, speeding up incremental recompilations. Also, because all the components operate on the same representation, they can share implementations of WUDQVIRUPDWLRQV´>]. The LLVM code representation is one of special factors that the LLVM compiler differ from other compilers.

2.3 L L V M multi-stage optimisation

This section introduces the multi-stages optimisation of the LLVM compiler. As the information of the LLVM install-time optimisation is not enough, so we only focus on compile-time, link-time, run-time and idle time optimisations of LLVM.

2.3.1 Compile-time optimisation Compile-time is the amount of time compiler spends on translating to . During this process, compiler optimizer has opportunities to optimise codes and gather necessary information for further optimisation. LLVM-GCC 4.2.1 is fully compatible with GCC 4.2, but the different is that LLVM-GCC replaces the intermediate representation of GCC 4.2 to the LLVM intermediate representation. Figure 2.1 shows how LLVM GCC front-end (static compiler) replace the GCC optimizer and code generator, and use its own optimizer and code generator to translate source code to the LLVM intermediate representation and generate machine code.

GCC 4.2 Front-end GCC Optimizer GCC Code Generator

GCC 4.2 Front-end LLVM-GCC Optimizer LLVM Code Generator

Figure 2.1: L L V M IR replaces G C C IR. (Reproduced from Chris Lattner et al [11], 2008.)

During compile time, ³HDFKVWDWLFFRPSLOHUFDQSHUIRUPWKUHHNH\WDVNVRIZKLFKWKH first and third are optional: (1) Perform language-specific optimizations, e.g., optimizing closures in languages with higher-order functions. (2) Translate source programs to LLVM code, synthesizing as much useful LLVM type information as possible, especially to expose pointers, structures, and arrays. (3) Invoke LLVM passes for global or interprocedural optimizations at the module level. The LLVM optimizations are built into libraries, making it easy for front-HQGVWRXVHWKHP´>].

This procedure improves the modularity of the LLVM compiler and provides better representation for optimising transformations. $QGDOVR³WKLVDOORZVWKHVWDWLFFRPSLOHU

4 to perform substantial optimizations at compile time, while still communicating high-leveOLQIRUPDWLRQWRWKHOLQNHU´>].

2.3.2 Link-time optimisation The main job of linker is to combine native object files with libraries to generate an executable program. As the LLVM which contains high-level information for optimisation is generated with the object files at compile-time, so the LLVM link-time optimizer can use the optimized information to improve its performance of interprocedural optimisations at link-time.

As the descriptions ViNUDP$GYHDQG&KULV/DWWQHUVWDWHGLQ³7KHGHVLJQRIWKH compile- and link-time optimizers in LLVM permit the use of a well-known technique for speeding up interprocedural analysis. At compile-time, interprocedural summaries can be computed for each function in the program and attached to the LLVM bytecode. The link-time interprocedural optimizer can then process these interprocedural summaries as input instead of having to compute results from scratch. This technique can dramatically speed up incremental compilation when a small number of translation units are modified.´>].

2.3.3 Run-time optimisation 6RPH³KRW´IXQFWLRQVPD\QRWEHDEOHWRGHILQHGDQGRSWLPLVHGDWFRPSLOH-time, except the program is running. The main purpose of run-time optimizer is to overcome this disadvantage. Run-time optimiser monitors the running program in real-WLPHLIWKH³KRW´ functions are detected, run-time optimiser will reoptimise the functions. Chris Lattner described the mechanism in 2004, ³ZKHQD KRWORRS region is detected at runtime, a runtime instrumentation library instruments the executing native code to identify frequently-executed paths within that region. Once hot paths are identified, we duplicate the original LLVM code into a trace, perform LLVM optimizations on it, and then regenerate native code into a software-managed trace cache. We then insert branches between the original code and the new nativHFRGH´>].

2.3.4 Idle time optimisation ³Some transformations are too expensive to perform at run-time, even given an efficient representation to work with. For these transformations, the run-time optimizer gathers profile informatioQ VHULDOL]LQJLWWRGLVN « Zhen idle time is GHWHFWHGRQ WKHXVHU¶V computer, an offline reoptimizer is used to perform the most aggressive profile-driven optimizations to the application.´>].

5 Chris Lattner et al [2, 3] described that, compared to traditional profile-guided optimizers, the different of the LLVM offline, idle-time reoptimiser is that it can use profile information gather from run-time optimiser with the LLVM bytecode to reoptimise and recompile the application, and perform aggressive profile-driven interprocedural optimizations without competing with the application for processor cycles.

2.4 L L V M front-ends

This section introduces two front-end of the LLVM compilers; they are LLVM-GCC (version 4.2.1) and Clang. The main job of the LLVM front-ends is to translate source code which written in different languages to the LLVM IR.

2.4.1 L L V M G C C 4.2.1 LLVM-GCC is a C, Objective-C and C++ front-end of the LLVM compiler. It is compatible with the GCC compiler and supports the same optimisations as GCC. The LLVM-GCC front-end is also an extension of current GCC architecture which has a powerful mid-level optimizer and extra link-time interprocedural analysis and optimization capabilities [12]. This feature allows the LLVM-GCC compiler able to use link-time optimisation -O4 which GCC (version 4.2) cannot support.

In the project, we compared the performance of LLVM-GCC (version 4.2.1) with PGI, Clang and GCC (version 4.1.2 and version 4.5.2), and investigate whether the LLVM-GCC compiler can provide better performance at compiler time and execution time.

2.4.2 Clang Clang is a native front-end of the LLVM compiler, a new generation development tool, which supports C, C++ and Objective-C languages. As a new compiler front-end, Clang has many new features, such as fast compile time, low memory use, providing better VXSSRUWIRU$SSOH¶V,'(H[SUHVVLYHGLDJQRVWLFV (which provide very useful error and warning messages , clearly point out where exactly the problem is and highlight the errors code with colourful font), modular design, easy to understand and maintain [1, 13].

The fast compile time and low memory use are two of major purposes of the Clang, a testing result published on the Clang performance testing website [14] shows that the

6 time taken by clang is 2.5 times faster than GCC (version 4.0) on Mac OS/X when parsing ³&DUERQK´; the Abstract Syntax Tree (AST) [15] memory use of Clang is 5 times less than GCC syntax trees on the Space testing (more detail please visit website at [14]).

In this project, we are going to test the Clang by using GADGET-2 and LAMMPS benchmarks, more detail of these two benchmarks are introduced in chapter 3.

2.5 PG I compiler and GNU G C C compiler

2.5.1 PGI compiler The PGI [4] compiler is developed by the Portland Group, which support C, C++, Fortran 77, Fortran 90/95, High Performance Fortran (HPF) programming languages; the newest PGI compiler (version 11.7, 2011) can also support CUDA Fortran and Fortran 2003. As we can see, the PGI compiler can fully support Fortran programming language, a particular programming language used in high performance computing. HPF is an extension of Fortran 90, which has many advantages on parallel programming compared to C, C++ and Java; for example, useful array operation syntax, improved facilities for numeric computation and data mapping directives. These specific features of HPF make the parallel programming easy to achieve, understand and maintain. This is one of reason why the PGI compiler is popular for high performance computing. The PGI compiler provides rich optimisation options for user, which allows user to experiment and find out suitable optimisations that can provide the best performance for a program on different computer architectures and platforms.

The version of the PGI compiler on the Ness is version 7.0-7, and the optimisation flags used for testing in this project are listed in table 2.1 as below:

-pg Enable profiling of the gprof profiler

Enable scheduling within extended basic blocks and some register -O1 allocation

Enable all the optimisations by level ±O1, and add traditional scalar -O2 optimization

Enable all the optimisations by level ±O1 and -O2, add more -O3 aggressive code optimisations and scalar replacement optimizations

Enable generally optimal flags and set the optimization level to a -fast minimum of ±O2

Table 2.1: Some standard PGI compiler flags

7 2.5.2 GNU G C C compiler The GNU Compiler Collection (GCC) [5] compiler is a free compiler, which provides a wide variety of front ends including C, C++, Objective-C, Fortran, Java, Ada and Go. GCC compiler is the default compiler of GNU/ , and also is a standard compiler of BSD and similar Unix operating systems.

The two versions of GCC compiler used in this project are version 4.1.2 and version 4.5.2. The GCC compiler flags used for testing are listed in table 4.2 as below:

-pg Enable profiling of the gprof profiler

Use for determining the language standard, such as gnu99, c99. The default language standard for GCC version 4.1.2 is gnu89. When -std=gnu89 compiling GADGET-2, the -std=gnu89 flag is needed to be turned on for Clang, because the default setting of Clang is gnu99, which fail to compile GADGET-2.

This option turns on some optimisations which not affect the -O1 compile time.

Enable all the optimisation by level ±O1 and the most -O2 optimisations, except loop unrolling and function inlining optimisations.

Enable all the optimisations by level ±O1 and ±O2, also turn on -O3 some loop optimisation flags such as ±finline-functions and ±funswich-loops.

This option turn on the pre-SURFHVVRUPDFUR³B)$67B0$7H_´, -ffast-math which reduce the restriction level of floating-point, this option may affect the correctness of the testing result.

Table 2.2: Some standard G C C compiler flags

2.6 G O LD Linker and L L V M gold plugin

The job of linker is to combine the object files generated by compiler and libraries to form an executable program. The GOLD [16, 17] is a new linker written in C++, which is developed by Ian Lance Taylor and was added to the GNU Binary Utilities (binutils) in 2008 [16]. The goal of the gold linker is to be faster than the GNU linker, and also enable link-time (global) optimisation (LTO) [18]. The gold linker can only be used for ELF (Executable and Linkable Format) [19, 20] operating systems such as GNU and Linux [16].

8 LLVM supports the gold linker via plugin. The LLVM gold plugin allows the LLVM-GCC and Clang to implement -O4 level optimisation. In this project, we investigate if the gold linker can improve the performance of compile time and link-time optimisation of LLVM-GCC and Clang.

2.7 LibDispatch library

Grand Central Dispatch (GCD) [48] is a technology used in Mac OS X operating systems, which optimises a program to run in parallel on multi-cores platform and speeds up the performance of program.

A function or a loop in a program can be defined as a ³EORFN´ [48] in GCD, ³EORFN´is an extension of C programming language, and is implemented via libdispatch library. When a program is running, GCD manages the LPSOHPHQWDWLRQRI³EORFN´ HJVXEPLWEORFNV to dispatch queue), if GCD detected available resources it can use in the system, then the blocks in the dispatch queue can be submitted to execute concurrently.

9 Chapter 3

Tools and methods for benchmarking

This chapter introduces three benchmarks used for performance testing and the testing platform Ness [21]. Section 3.5 describes how to correctly configure and build LLVM-GCC and Clang on Ness.

3.1 Testing platform

The platform used in this project is the EPCC (Edinburgh Parallel Computing Centre) HPC service called Ness. Ness is a shared-memory machine, which has 32 AMD Opteron processors (2.6 GHz) at back-end and 2 at front-end. The front-end processors are used for handling system operation, I/O and job requirements; the back-end processors are in charge of computation only. The operation system on Ness is Scientific Linux. All the testing jobs in this project are submitted to the back-end of Ness through the Sun Grid Engine (SGE), this procedure can minimise the system noise which affects the accuracy of timing.

3.2 Benchmarks

A benchmark in computing is a program or code, at same time it is a tool used to test the performance of software or hardware. The three benchmarks chosen for this project are GADGET-2, LAMMPS and HELIUM; they are freely available codes and are used for scientific research simulation and study. In this section, we introduce more details of these benchmarks.

3.2.1 G A D G E T-2 GADGET (GAlaxies with Dark matter and Gas intEracT) [10] is developed by Volker Springel. The first GADGET public version was released in 2000 and written in ANSI C; the version GADGET-2 used for testing in this project was released in 2005.

10 GADGET is a cosmological simulation code (TreeSPH [6] code), which can be used for simulations such as cosmological expansion of space, galaxy collision, isolated systems and gas dynamics [6]. The main algorithm of GADGET is called TreePM [6], which combines the hierarchical Tree algorithm [22] and the Particle-Mesh (PM) [23, 24] algorithm to improve the performance of the program. The Tree algorithm is used for computing short-range gravitational forces, and the PM method is used for long-range forces computation [6].

GADGET-2 simulation needs Message Passing Interface (MPI) [25] library, GNU Scientific Library (GSL) [26] and Fastest Fourier Transform in West (FFTW) [27] library for compilation. MPI is an application programming interface and standard for message communication on parallel machines. GSL is a numerical library; GADGET-2 uses it for cosmological integrations at beginning of simulation. FFTW library is used for TreePM algorithm in GADGET-2 simulation.

GADGET supports MPI interface which allows it to be implemented both on serial and parallel machines. As the main purpose of this project is to investigate the performance of the compilers, so we only run the simulation on one processor. The setting of GADGET-2 simulation in this project is provided in appendix C.1.

3.2.2 L A M MPS LAMMPS (Large-scale Atomic/Molecular Massively Parallel Simulator) is a classical molecular dynamics code, the current LAMMPS is written in C++ language and was released in 2004 [7]. Classical molecular dynamic (MD) is a commonly used computational tool for simulating the properties of liquids, solids, and molecules [28, 29, 30]. The LAMMPS code is designed to support MPI interface, so it can run on single processor or massively parallel machines.

Two benchmarks problems of the LAMMPS are used for testing, which are the Lennard-Jones (LJ) problem and the Rhodopsin problem. The Lennard-Jones problem simulates an atomic fluid with Lennard-Jones potential [31]. In this simulation, 32000 atoms are included and each atom has 55 neighbors; the Volume (V) and the Energy (E) of the system are set to be constant (NVE integration); and the timestep is set to 2600. The setting detail is provided in appendix C.2.

The Rhodopsin problem simulates all-atom rhodopsin protein in solvated lipid bilayer, which uses CHARMM (Chemistry at HARvard Macromolecular Mechanics) [32] force field, and computes the long-range Coulombics by using particle-particle particle-mesh (PPPM) [33] method. The total amount of atom in simulation is 32000, each atom has 440 neighbors. The timestep is set to 100. The setting detail of the Rhodopsin problem simulation is provided in appendix C.3.

11 3.2.3 H E L IU M HELIUM [8] FRGH ³simulates the interaction of HELIUM atom with incident laser radiation. It does so by numerically solving the time-dependent Schrodinger equation for the full system.´ [34]. HELIUM simulation is one of benchmarks used to test the hardware performance of UK supercomputers such as HECToR [38], which is written in Fortran and can run on numbers of processors from 1 to 16000. The HELIUM simulation code can be used for many scientific researches such as the interaction coherent x-rays with the K-shell electrons of matter [35], intense-filed interactions with XUV radiation generated by Free-Electron lasers and atom-laser interaction at Titanium-Sapphire laser wavelengths [36, 37]. The setting for HELIUM simulation is provided in appendix C.4.

3.3 Profiling tools

To understand how a program is running, and what factors affect execution time or performance of a code, we need a tool to gather useful information and analyse what happens inside of the code (e.g. how much time spend on running a particular ). Then we can optimise the code base on the profile information to improve its performance. In this project, the purpose of using profile tools is not to improve the performance of the benchmarks, but investigates how different compiler optimisations affect the performance of the benchmarks and which compiler can provide better optimisations for the benchmarks.

There are two standard Unix profilers are used for analysis, they are the prof and the gprof [39]. These two profilers can both provide the information which tells user how many functions are called, how the total execution time is spent in each sections of code and hardware specific information such as cache misses and flop rate. The different is that the gprof profiler can provide the some extra information which shows how the routines call each other.

To implement the profile tools for a program, a compiler flag is needed to be inserted at compile time, which is -pg for gprof. When this flag is enabled, the profile tools insert some sample points into the testing code which can stop the running program for a while for gathering necessary information. This compiler flag is disabled when we gather the compile time and the execution time for benchmarks comparison, because the profile tools can affect the accuracy of the testing result.

Clang is not support gprof profiling tool when this thesis is written.

12 3.4 Correctness checking

Correctness checking is one important process of scientific simulation benchmarks, as some high level compiler optimisations or specific compiler flags may cause result incorrectness. Different benchmark has different method to check its correctness. The table 3.1 list the methods for checking the correctness of three benchmarks used in this project.

BE N C H M A KS Methods of correctness checking

With same simulation setting, the result verification is done by G A D G E T-2 checking whether the energy value LQ³HQHUJ\W[W´ILOHgenerated by GADGET simulation after each implementation is same.

With same simulation setting (e.g. same number of time step), the total energy of whole systems should be same, or the error L A M MPS should be small enough after each implementation, no matter what compiler or compiler optimisation flags are used.

The result verification is done by checking whether the total H E L IUM population is very close to 1 after simulation.

Table 3.1: Benchmar ks correctness checking methods

3.5 MPI wrapper

MPI Wrapper is a useful tool when we are dealing with MPI programs such as GADGET-2, LAMMPS and HELIUM. The MPI environment variables (e.g. the header file path of MPI library) are needed to be specified in the Makefile script of MPI programs. On different platform, the setting of MPI library could be different. The main job of MPI wrapper is to wrap the correct environment setting of MPI library with compilers. Table 3.2 lists the MPI wrappers used in this project.

MPI Wrappers Languages

Mpicc For C compilers

Mpicxx For C++ compilers

mpif90 For Fortran 90 compilers

Table 3.2: MPI wrappers

13 3.6 Tools of L L V M compiler

This section lists some tools and compiler flags of LLVM we used in this project.

Tools Descriptions

LLVM static compiler, which used to convert source code Llc into assembly code.

LLVM , which can directly execute LLVM bitcode Lli format.

LLVM linker, which used to link LLVM bitcodes together to -link form a single LLVM bitcode file.

The main linker for LLVM, which can produce executable llvm-ld bitcode.

LLVM bitcode analyser, which used to analyse LLVM llvm-bcanalyzer bitcode and sometimes for debugging.

Table 3.3: Several tools of L L V M

L L V M front-end Descriptions llvm-gcc LLVM-GCC front-end for C llvm-g++ LLVM-GCC front-end for C++ llvm-gfortran LLVM-gfortran front-end for Fortran clang LLVM native C front-end clang++ LLVM native C++ front-end

Table 3.4: Several L L V M front-ends

L L V M compiler flags Descriptions

-emit-llvm Convert source code to LLVM bitcode or assembly code.

-c Only compile or assemble source code, not link libraries.

-O Compiler optimisation level

Table 3.5: Several L L V M compiler flags

14 3.7 Issues and solutions for building L L V M and Clang on Ness

LLVM, Clang and LLVM-GCC can be built separately, or Clang can be built with LLVM as a tool. In this project, Clang was built with LLVM, and LLVM-GCC was built separately. There are two key factors for correctly building LLVM on Ness.

1. Host C++ compiler for building LLVM.

2. LLVM environment variable setting.

The difficult part of building LLVM in this project is that the default GCC compiler on Ness (version 4.1.2) could not correctly build LLVM, because LLVM heavily depend on the host C++ compiler for building. PGI compiler (version 7.0-7) has been tried, but could not correctly build LLVM. The solution in this project was to use GCC 4.5.2 to build LLVM.

The second problem was that LLVM could not correctly find the compiler libraries path on Ness. We needed to specify several environment variables to LLVM, which allowed LLVM to find out which host C++ compiler is used and where the compiler libraries are.

The details of how to correctly install LLVM, Clang and LLVM-GCC on Ness have been provided in appendix A.

15 Chapter 4

Performance Testing and Analysis

This chapter compares the performance of LLVM-GCC (version 4.2.1), Clang, PGI and GCC (version 4.1.2 and version 4.5.2) by each benchmark. We will compare compile time, execution time and executable size of each compiler with different optimisation levels, analyse and discuss the factors which affect the performance. Each benchmark was compiled three times by each compiler under different optimisation flags and run three times. The testing result for comparison is the average of the three runs since it reduces the error produced by system noise. The lower time is the better performance, and the fastest time is highlighted.

The common compiler optimisation flags are -O1, -O2, -O3, but different compilers have their own specific optimisation flags for different purposes. In order to make the comparison easier, the ³IDVW´RSWLPLVDWLRQLQWKHFRmparison figure means the specific optimisation flags of compiler or combinations of several compiler flags which may provide better optimisation for the benchmarks. The table 4.1 lists the compiler flags of each compiler used for benchmarking.

Optimisation Compiler Flags Level PGI GCC(4.1.2&4.5.2) LLVM-GCC Clang

O1 -O1 -O1 -O1 -O1

O2 -O2 -O2 -O2 -O2

O3 -O3 -O3 -O3 -O3

Fast -fast -O3 -ffast-math -O3-ffast-math -O3-ffast-math

Table 4.1: The optimisation flags of each compiler for comparisons

16 4.1 G A D G E T-2 testing result

4.1.1 Compile time comparisons

Figure 4.1: G A D G E T-2 compile time comparisons

At O1 optimisation, the compile time of GCC 4.1.2 is the fastest, which is about 2 times faster than PGI and nearly 1.6 times faster than Clang. The compile time of GCC 4.1.2 and 4.5.2 increases when the optimisation level increases. The reason could be that the -O3 flag turn on more optimisations, the time spent on the extra optimisations makes the compile time become longer. The interesting thing is that the compile time of GCC varies with optimisation flags; more time is taken at higher level optimisation, whereas LLVM-GCC and Clang behave differently from GCC, the time of LLVM-GCC at O3 and fast is even slightly faster than the time at O2.

At high level optimisation (O3 and fast), LLVM-GCC is faster than all other compilers; it is about 9% faster than GCC 4.1.2, and nearly 31% faster than GCC 4.5.2. The parser LLVM-GCC used is the GCC 4.2 parser, but optimiser and code generator are replaced by LLVM own optimiser and code generator. It means that LLVM-GCC optimiser and code generator work more efficiently than GCC 4.1.2 and 4.5.2 for compiling GADGET-2 with -O3 flags.

17 4.1.2 Execution time comparisons

Figure 4.2: G A D G E T-2 execution time comparisons

In general, the best performance of each compile is provided at -O3 or fast optimisation. The fastest execution time is performed by GCC 4.1.2 at fast optimisation. The second fastest is LLVM-GCC at fast optimisation, which is approximately 3.7% slower than GCC 4.1.2. The performance of PGI and GCC 4.5.2 are close, PGI is slightly faster than GCC 4.5.2. The fastest execution time of Clang is about 7.3% slower than the fastest execution time of GCC 4.1.2.

For further analysis, profiling information from the executable generated by each compiler is collected. As the profiling information is extremely detailed, several critical information were selected for comparison. We only focus on comparing the optimisation level with the fastest execution time of each compiler, and the routines which occupy large amount of execution time. 7KH³WLPH´LQWDEOHVKRZVWKHSHUFHQWDJHRIWKH WRWDOH[HFXWLRQWLPHVSHQWRQWKHURXWLQHVWKH³VHOIVHFRQG´LVWKHQXPEHURIVHFRQGVUXQ by the routines.

PGI (fast) G CC-4.1.2(fast) G CC-4.5.2(O3) LLVM -G CC(fast) self self self self Routine name % time second % time second % time second % time second force_treeevaluate_shortrange 50.38% 88.59 52.52% 80.35 53.36% 88.35 53.94% 87.22 ngb_treefind_variable 12.89% 22.67 14.44% 22.09 13.91% 23.03 14.11% 22.81 ngb_treefind_pairs 11.91% 20.95 11.64% 17.81 11.43% 18.93 11.36% 18.37 Table 4.2: G A D G E T-2 profiling comparisons 18 According to the profiling information in table 4.2, over 50% amount of execution time is spent on the force_treeevaluate_shortrange subroutine, which is the kernel function of TreePM algorithm for computing short-range gravitational force. The total amount of time of the ngb_treefind_variable and the ngb_treefind_pairs together is nearly 25% of the total amount of execution time. Obviously, these three subroutines are the ³KRW´WDUJHWVwe would like compilers to optimise.

GCC 4.1.2 compiler provides the best optimisations on the force_treeevaluate_shortrange subroutine, which is approximately 9% faster than other compilers. GCC 4.1.2 also provides slightly better performance on the ngb_treefind_variable and the ngb_treefind_pairs subroutines compared to other compilers. LLVM-GCC is the second fastest, the optimisations for force_treeevaluate_shortrange routine are slightly better than GCC 4.5.2 and PGI. As Clang does not support gprof, it is not included in the profile comparisons.

4.1.3 Executable size comparisons

Figure 4.3 G A D G E T-2 executable size comparisons

In figure 4.3, The file size tends to increase with the higher optimisation levels, because the extra optimisation enabled at higher optimisation levels (e.g. loop unrolling) may increase the file size.

The executable size of GCC-4.5.2 at O1 is the smallest, but GCC 4.1.2 is slightly better than GCC 4.5.2 at O2, O3 and fast. Both GCC compilers have good result compared to other compilers. The executable size of LLVM-GCC and Clang are the biggest, the size

19 at O1 of LLVM-GCC and Clang are both nearly 3.5 times bigger than the size of GCC 4.5.2, and the others are about 1.6 times larger than GCC 4.5.2. The reason why the file size of LLVM-GCC and Clang at O1 so high is not clear. The file sizes of PGI is between GCC and LLVM.

4.1.4 Sum up In compile time testing, LLVM-GCC is the best at O2, O3 and fast option, Clang is slower than LLVM-GCC and GCC 4.1.2. The performances of each compiler in execution time test are nearly same, but the best execution time is performed by GCC 4.1.2. LLVM-GCC and Clang both produce larger executable sizes than the other compilers.

4.2 L A M MPS testing result

4.2.1 Compile time comparisons

Table 4.3: L A M MPS compile time comparisons

In the test, GCC 4.5.2 is the fastest at O1 optimisation. However GCC 4.5.2 is nearly 43 seconds slower than LLVM-GCC at O3, which means LLVM-GCC is about 14.7% faster. LLVM-GCC is about 33% faster than PGI at O3 and . The performance of Clang at O3 and fast optimisation is slower than LLVM-GCC, but faster than PGI, GCC 4.1.2

20 and GCC 4.5.2. The PGI compiler is the slowest compared to other compilers, the compile time at fast option is nearly 1.9 times slower than LLVM-GCC.

The test result shows that the compile speed of LLVM-GCC and Clang are better than PGI and GCC (4.1.2 and 4.5.2) at O2, O3 and fast optimisations. The performance of GCC 4.1.2 and GCC 4.5.2 are similar.

4.2.2 Lennard-Jones problem execution time comparisons

Figure 4.4: L A M MPS Lennard-Jones problem execution time comparisons

The fastest execution time is performed by LLVM-GCC at fast option; there is about 2.6% performance increase compared to GCC 4.5.2 (the second fastest). The performance of GCC 4.5.2 and GCC 4.1.2 are close, GCC 4.5.2 is slightly faster than GCC 4.1.2 at fast option. The result of Clang is not good compared to LLVM-GCC, GCC 4.1.2 and GCC 4.5.2, but overall faster than PGI except the execution time at fast option. Table 4.4 below shows the profiling result of each compiler.

PGI (fast) G CC-4.1.2(O3) G CC-4.5.2(fast) LLVM -G CC(fast) self self self self Routine name % time second % time second % time second % time second LAMMPS_NS::PairLJCut::compute 75.95% 127.06 81.67% 121.88 81.15% 118.97 79.44% 113.53 LAMMPS_NS::Neighbor::half_bin_newton 6.78% 11.16 7.56% 11.28 7.82% 11.09 8.59% 12.28 Table 4.4: Lennard-Jones problem profiling comparisons

21 One thing need to be mentioned here is that the PGI profiling information produced by gprof is different from other compilers in LAMMPS simulation, reason is not clear. The timing of the L A M MPS_NS::PairLJCut::compute routine is spread onto the sub-functions of it, so we need to sum up the execution time of each sub-functions as the execution time of the L A M MPS_NS::PairLJCut::compute routine to compare with other compilers in table 4.4. In table 4.5, the timing result of PGI compiler on the L A M MPS_NS::PairLJCharmmCoulLong::compute routine is using the same solution as in table 4.4. The timing of each sub-function is highlighted in PGI profiling result in appendix D.2.1 and D.2.5.

The L A M MPS_NS::PairLJCut::compute subroutine is the main function used for Lennard-Jones problem, over 75% amount of entire execution time is spent on it. The reason LLVM-GCC faster than other compilers is because it provides better optimisations on the L A M MPS_NS::PairLJCut::compute subroutine, which is about 4.5% faster than GCC 4.5.2 (the second fastest). Whereas, on the subroutine L A M MPS_NS::Neighbor::half_bin_newton, GCC 4.5.2 provides slightly better performance than LLVM-GCC.

4.2.3 Rhodopsin problem execution time comparisons

Figure 4.5: L A M MPS Rhodopsin problem execution time comparisons

Overall, the performance of each compiler is close at fast option. The best performance is provided by GCC 4.5.2 at O2, and GCC 4.1.2 is the second fastest. LLVM-GCC, Clang and PGI are approximately 2.2% slower than GCC 4.5.2. It is interesting that the

22 performances of GCC (4.1.2 and 4.5.2) at O2 option are both slightly faster than the performance at O3 and fast; the extra optimisations enabled at O3 and fast do not improve the performance.

PGI (fast) G CC-4.1.2(O2) G CC-4.5.2(O2) LLVM -G CC(fast) self self self self Routine name % time second % time second % time second % time second

LAMMPS_NS::PairLJCharmmCoulLong::compute 63.32% 59.12 72.95% 67.38 74.78% 66.97 70.38% 64.12 LAMMPS_NS::Neighbor::half_bin_newton 9.56% 8.92 10.50% 9.7 10.67% 9.56 11.42% 10.40 Table 4.5 Rhodopsin problem profiling comparisons

The subroutine L A M MPS_NS::PairLJCharmmCoulLong::compute is the kernel function for computing long-range Coulombics in the Rhodopsin problem, up to 63% of the execution time is spent on it. GCC 4.5.2 and GCC 4.1.2 provide nearly same performance on the L A M MPS_NS::PairLJCharmmCoulLong::compute subroutine, but they are both slower than LLVM-GCC. The L A M MPS_NS::Neighbor::half_bin_newton VXEURXWLQHLVWKHVHFRQG³KRW´WDUJHWLQ5KRGRSVLQSUREOHPVLPXODWLRQ3*,GRHVEHWWHU optimisations for it compared to GCC (4.1.2 and 4.5.2) and LLVM-GCC.

4.2.4 Executable size comparisons

Figure 4.6: L A M MPS executable size comparisons

The most striking aspect of Figure 4.6 is the very large executable produced by PGI, which are about 2 times bigger than all the others. The performances of GCC 4.1.2, GCC 4.5.2 and LLVM-GCC are close, but Clang produces the smallest file size of all.

23 4.2.5 Sum up Overall, Clang is the best in compile time and executable size testing. In Lennard Jones problem simulation, the best execution time is provided by LLVM-GCC, but Clang is worse than GCC (4.1.2 and 4.5.2). In Rhodopsin problem simulation, the best execution time of each compiler are very close, the fastest execution time is performed by GCC 4.5.2 at O2 option.

4.3 H E L IU M testing result

4.3.1 Compile time comparisons

Figure 4.7: H E L IUM compile time comparisons

In the test, the performances of GCC 4.1.2 are generally better than other compilers. The performances of PGI at O1, O2 and O3 are nearly same as the performance of LLVM-GCC, whereas, PGI is about 26% slower than LLVM-GCC at fast option. GCC 4.5.2 take more time at O3 and fast option compared to all other compilers.

24 4.3.2 Execution time comparisons

Figure 4.8: H E L IUM execution time comparisons

Overall PGI is the best in the test; the execution time dramatically reduces from 194.457 seconds (at O3) to 138.203 seconds (when the PGI -fast compiler flag is enabled), which is approximately 28.8% performance increase. The symptoms of GCC 4.5.2 and LLVM-GCC are similar, the execution times are over 260 seconds at O1, O2 and O3, but it only take about 200 seconds at fast option (when the -ffast-math flag is enabled). Even though the performance of GCC 4.5.2 and LLVM-GCC are dramatically increased at fast option; their performances are still about 31% slower than PGI. We will investigate why the -fast compiler flag can deeply impact the performance of PGI, and also compare what cause the difference between PGI and other compilers in the HELIUM simulation. First, we compare the profiling information of PGI, and analyse which routines are impacted while the -fast flag is enabled.

Compiler: PG I O3 fast self self Routine name % time second % time second local_ham_matrix_incr_result_w_1_over_r12_terms_ 21.91% 46.54 29.66% 46.12 local_ham_matrix_incr_with_2nd_deriv_in_r2_ 11.31% 24.02 2.19% 4.65 local_ham_matrix_incr_with_1st_deriv_op_in_r2_ 10.25% 21.78 7.66% 11.91 local_ham_matrix_incr_with_1st_deriv_op_in_r1_ 10.25% 21.76 11.79% 18.33 local_ham_matrix_incr_result_w_1_over_r12_ibm_ 4.71% 10.01 0.63% 0.98 Table 4.6: PGI profiling comparison in H E L IUM

The time spent on the local_ham_matrix_incr_result_w_1_over_r12_terms subroutine is the highest in the table 4.6, the execution time on the subroutine is nearly same, but the

25 occupancy rate of the subroutine in whole execution time at fast option is about 8% higher than O3. It means the fast option provides better optimisations on other routines compared to O3. The performance on the local_ham_matrix_incr_with_2nd_deriv_in_r2 and the local_ham_matrix_incr_result_w_1_over_r12_ibm are both dramatically increased at fast option. The fast option is about 4 times faster than O3 on the local_ham_matrix_incr_with_2nd_deriv_in_r2, the execution time occupancy rate drops from 11.31% (O3) to 2.19% (fast). The fast option is nearly 9 times faster than O3 on the local_ham_matrix_incr_result_w_1_over_r12_ibm, the execution time occupancy rate drops from 4.71% (O3) to only 0.63% (fast). This result shows that the ±fast compiler flag of PGI can dramatically improve the performance of HELIUM simulation. Other URXWLQHV¶ profiling information not been listed in table 4.6, which are provided in appendix D 3.1 and D 3.2.

In table 4.7, we compare the profiling information of each compiler further. The profiling of LLVM-GCC on the local_ham_matrix__incr_result_w_1_over_r12_terms subroutine is not showed in profiling result. By experimenting the HELIUM code, we found the subroutine hamiltonians__init_w_atomicham_x_psi calls the subroutine local_ham_matrix__incr_result_w_1_over_r12_terms, the reason it is not been showed up is probably because the local_ham_matrix__incr_result_w_1_over_r12_terms subroutine is being inlines by the LLVM compiler.

PGI(fast) G CC-4.1.2(fast) G CC-4.5.2(fast) LLVM -G CC(fast) self self self self Routine name % time second % time second % time second % time second local_ham_matrix__incr_result_w_1_over_r12_terms 29.66% 46.12 32.58% 65.08 41.88% 81.54 N/A N/A local_ham_matrix_incr_with_1st_deriv_op_in_r1_ 11.79% 18.33 12.99% 25.96 11.63% 22.64 11.56% 21.81 local_ham_matrix_incr_with_1st_deriv_op_in_r2_ 7.66% 11.91 9.85% 19.67 7.87% 15.32 7.88% 14.87 hamiltonians__init_w_atomicham_x_psi 2.99% 4.65 3.01% 6.02 2.46% 4.79 42.74% 80.63 propagators__arnoldi_propagate 5.72% 8.89 7.37% 14.73 6.82% 13.27 23.17% 43.72 Table 4.7: H E L IUM profiling comparisons

The most execution time of PGI, GCC 4.1.2 and GCC 4.5.2 are spent on the subroutine local_ham_matrix__incr_result_w_1_over_r12_terms, whereas PGI provided better optimisation on it. On the local_ham_matrix_incr_with_1st_deriv_op_in_r1 and the local_ham_matrix_incr_with_1st_deriv_op_in_r2, PGI is still better than others, LLVM-GCC is the second fastest. LLVM-GCC spend 42.74% of total execution time on the hamiltonians__init_w_atomicham_x_psi routine, where other compilers only spend about 3% of total execution time on it. LLVM-GCC spends 23.17% of total execution time on the propagators__arnoldi_propagate routine, where other compilers only spend few percentage of time on it. Overall, the PGI compiler definitely provides the best optimisation for this code.

26 4.3.3 Executable size comparisons

Figure 4.9: H E L IUM executable size comparisons

Figure 4.9 shows that the executable sizes tend to increase with the level of optimisation as seen before. The executable size generated by PGI is still the biggest. GCC 4.1.2 and GCC 4.5.2 have similar results. LLVM-GCC is slightly better than PGI.

4.3.4 Sum up Even though the compile time and executable size of produced by PGI is not the best in HELIUM testing, the execution time of PGI-compiled executable is much faster than other compilers when the ±fast compiler flag is turned on. The performances of GCC 4.1.2 and LLVM-GCC are close in the compile time and execution time test.

27 Chapter 5

Further Works

One of tests we wanted to do in this project is to investigate the performance of LLVM-GCC and Clang by using link-time optimisation. To allow LLVM to use LTO, WKH ³gold´ linker needed to be installed on Ness and replace GNU default linker. However, the GNU default linker on Ness cannot be simply replaced, so we need to compile the source code of benchmarks, and then manually invoke the gold linker to link all the object files and libraries to generate an executable. This requires us to modify the Makefiles of benchmarks and make some large changes. This is difficult to implement and would need extra time.

As Clang does not support the gprof profiler, so we cannot gather profiling information from Clang-compiled programs for further comparison and analysis with other compilers. One alternative option is to install a profiler called google-perftools cpuprofiler [40] on Ness, which supports Clang. However, time required for setting up new software was not allowed by the project schedule, so we had to ignore the profiling of Clang.

In the project, we also wanted to test the Just-In-Time (JIT) [47] optimisation of LLVM compiler. As HELIUM code only has a single source file, so it has been chosen for testing. HELIUM code can be successfully converted from source code to LLVM bitcode by using -emit-llvm and -c flags, but it could not be implemented by the LLVM inWHUSUHWHU³OOL´DVWKH³main´ entry-function could not be found. This issue was solved E\ DGGLQJ D YDULDEOH ³-HQWU\BIXQFWLRQ 0$,1BB´ ZKHUHDV DQRWKHU XQNQRZQ LVVXH occurred.

Grand Central Dispatch (GCD) is one of targets we wanted to investigate, because we only spared one week for it in the original work plan and extra time was needed, it has to be cancelled.

28 Chapter 6

Conclusions

This dissertation investigated and compared the performance of LLVM-GCC and Clang with PGI, GCC 4.1.2 and GCC 4.5.2 via three scientific benchmarks used in high performance computing.

In GADGET-2 test, LLVM-GCC shows its advantages on compile time competition, which benefits from the combination of GCC parser and LLVM code generator. Clang does not show its potential in this test. However, Clang performs best in compile time and executable size comparisons in LAMMPS benchmarking. The performance of LLVM-GCC for the Lennard-Jones problem simulation is good. For the Rhodopsin problem simulation, the execution time of each compiler is very close, but GCC 4.5.2 is slightly better than others.

In HELIUM benchmarking, PGI provides much better performance compared to other compilers. The reason is that the performance of HELIUM is controlled by the optimisation of the one key routine, which PGI provides a good optimisation on it.

The LLVM compiler is able to reduce compile-time, and produces small executable sizes; the runtime performance of scientific simulation programs compiled with LLVM is good, but it still has a long way to go to match existing HPC compilers. LTO optimisation (O4) may improve the results for LLVM, however we have not been able to test it in this project.

29 Appendix A

How to build L L V M and Clang on Ness

This section introduces how to correctly build the LLVM compiler and the Clang front-end together on Ness.

A.1 Preparation jobs

¾ Get the LLVM 2.9 source code and the Clang source code via LLVM download webpage (http://www.llvm.org/releases/download.html#2.9). Here, we strongly recommend using the LLVM release version, not the version download via SVN. Because the LLVM version checked out via SVN is under developing, it may cause some unpredictable issues.

¾ Copy and unpack the LLVM source code and the Clang source code to your working directory. Here, we assume the path of working directory on the Ness is ³ZRUNVV´

$> cd /work/s02/s1000/

$> tar ±vxf llvm-2.9.tgz

$> mv llvm-2.9 llvm

$> tar -vxf clang-2.9.tgz

$> mv clang-2.9 clang

$> mv ./clang /work/s02/s1000/llvm/tools/

$> mkdir llvm-build

$> mkdir local

Figure A.1 Preparation jobs before building L L V M and Clang

After the preparation jobs in figure A.1 have been done, there are three folders in SDWK³ZRUNVV´ZKLFKDUH³OOYP´³OOYP-EXLOG´DQG³ORFDO´

x ³OOYP´IROGHU± The folder contains the source files of the LLVM.

x ³OOYP-EXLOG´IROGHU± The folder where we configure and build LLVM.

30 x ³ORFDO´IROGHU± The folder is used for installing the libraries and tools of the local software.

As we want to build the clang front-end as one of tools of the LLVM compiler, so WKH³FODQJ´IROGHULVPRYHGWR³ZRUNVVOOYPWRROV´GLUHFWRU\WKHQDPH RIWKHIROGHUVKRXOGEH³FODQJ´:KHQWKH//90FRPSLOHULVFonfigured and built, the LLVM can pick up the clang automatically.

A.1.1 Set up G C C version 4.5.2 on Ness In this project, the GCC version 4.5.2 compiler is used for building LLVM and Clang. :HDVVXPH*&& YHUVLRQ LVLQVWDOOHGDWSDWK³ZRUNVV/gcc-4.5.2-LQVWDOO´

$> export PATH=/work/s02/s1000/gcc-4.5.2-install/bin:$PATH

$> export LD_LIBRARY_PATH=/work/s02/s1000/gcc-4.5.2-install/lib64:$LD_LIBRARY_PATH

$> export LD_LIBRARY_PATH=/work/s02/s1000/cloog-ppl/lib:$LD_LIBRARY_PATH

$> export LD_LIBRARY_PATH=/work/s02/s1000/ppl/lib:$LD_LIBRARY_PATH

$> export LD_LIBRARY_PATH=/work/s02/s1000/gmp/lib:$LD_LIBRARY_PATH

$> export LD_LIBRARY_PATH=/work/s02/s1000/mpc/lib:$LD_LIBRARY_PATH

$> export LD_LIBRARY_PATH=/work/s02/s1000/mpfr/lib:$LD_LIBRARY_PATH

Figure A.2 Load G C C version 4.5.2 to the Ness

In figure A.2 shows how to load the GCC 4.5.2 and the libraries needed by GCC 4.5.2 on Ness. As the GCC version 4.5.2 is not the default compiler of the Ness, so we need to load the GCC version 4.5.2 by hand before configuring the LLVM and the Clang. The CLooG-PPL [41], PPL [42], GMP [43], MPC [44] and MPFR [45] are the libraries needed by GCC 4.5.2. The GMP library and the MPFR library are also needed if we want the LLVM-GCC compiler to support Fortran.

A.1.2 Configure and build L L V M and Clang on Ness Before configuring and building the LLVM and the Clang, which version of GCC is needed to be checked by using command as below:

$> which g++

$> g++ -v

The first command return the path of the compiler, in this case, the path of the GCC 4.5.2 FRPSLOHU ³ZRUNVVJFF-4.5.2-LQVWDOOELQJ´ VKRXOG EH VKRZHG 7KH VHFRQG command show which version of the GCC is using.

31 Now, we can configure and build the LLVM and clang.

1. First, the LLVM need to know which compiler is used for building.

$> export CXX=/work/s02/s1000/gcc-4.5.2-install/bin/g++

2. Second, the location of bitcode libraries is needed to be provided; otherwise the LLVM linking tools may not able to find it. This is very important environment variable setting when we build the LLVM on multi-compiler setup platform such as the Ness.

$> LLVM_LIB_SEARCH_PATH=/work/s02/s1000/gcc-4.5.2-install/lib

3. Get into the build directory and configure the LLVM. The LLVM source file is in the directRU\³ZRUNVVOOYP´

$> cd /work/s02/s1000/llvm-build

$>./../llvm/configure --prefix=/work/s02/s1000/local --enable-assertions \ --with-cxx-include-root=/work/s02/s1000/gcc-4.5.2-install/include/c++/4.5.2 \ --enable-optimized --enable-shared --enable-jit

Figure 6.3: How to configure L L V M and Clang on Ness

7KH³--with-cxx-include-URRW´YDULDEOHLVQHHGHGWREHVSHFLILHG,IWKHIXOOSDWK LVQRWEHHQSURYLGHG HJRQO\³ZRUNVVJFF-4.5.2-insWDOOLQFOXGH´ WKH //90FDQQRWILQGWKHULJKWKHDGHUILOHVRI*&&VXFKDV³LRVWUHDP´LQWKLV FDVH 7KH ³--SUHIL[´ YDULDEOH LV XVHG WR VSHFLI\ WKH SDWK ZKHUH WKH //90 LV LQVWDOOHG WR 7KH ³--enable-DVVHUWLRQV´ RSWLRQ LV IRU HUURU FKHFNLQJ 7KH ³--enable-RSWLPL]HG´RSWLRQFDQUHGXFHWKHEXLOGLQJWLPHDQGVL]HRIILOHV

4. Build the LLVM

$> make

5. Install the LLVM

$> make install

As the Clang has been copy to the tools directory of the LLVM before building, so when the LLVM is built and installed, the Clang is automatically built and installed as well.

32 A.2 Build L L V M-G C C (version 4.2.1) on Ness

This section introduces how to configure and build the LLVM-GCC on the Ness.

A.2.1 Preparation jobs ¾ Get the LLVM-GCC 4.2 Front End Source Code via LLVM download webpage (http://www.llvm.org/releases/download.html#2.9). Here again, we strongly recommend using the release version, not the version download via SVN.

¾ Copy and unpack the LLVM-GCC source code to your working directory. In this SURMHFWWKHSDWKRIZRUNLQJGLUHFWRU\RQWKH1HVVLV³ZRUNVV´

$> cd /work/s02/s1000/

$> tar ±vxf llvm-gcc-4.2-2.9.source.tgz

$> mv llvm-gcc-4.2-2.9.source llvm-gcc

$> mkdir llvm-gcc-build

Figure A.4: Preparation jobs for building L L V M-G C C on Ness

In figure A.4, the LLVM-*&&VRXUFHFRGHLVXQSDFNHGDQGPRYHGWRWKH³OOYP-JFF´ IROGHU7KH³OOYP-gcc-EXLOG´IROGHULVFUHDWHGIRUEXLOGLQJWKH//90-GCC.

A.2.2 Configure and build L L V M-G C C on Ness To avoid the issues caused by the older version of GCC compiler, we still use the GCC 4.5.2 to build the LLVM-GCC. Please see section 3.5.2 for detail how to set up the GCC 4.5.2 on Ness.

1. First, the LLVM-GCC need to know which compiler is used for building.

$> export CXX=/work/s02/s1000/gcc-4.5.2-install/bin/g++

$> export CC=/work/s02/s1000/gcc-4.5.2-install/bin/gcc

2. To allow the LLVM-GCC support Fortran, three extra libraries are needed to be installed on Ness, they are the GMP, the MPFR and the LIBICONV [46]. We DVVXPH WKH OLEUDULHV DUH LQVWDOOHG DW GLUHFWRULHV ³ZRUNVVJPS´ ³ZRUNVVPSIU´ DQG ³ZRUNVVOLELFRQY´ RQ 1HVV )LJXUH A.5 shows how to configure the LLVM-GCC.

33 $> cd llvm-gcc-build

$>./../llvm-gcc/configure --prefix=/work/s02/s1000/local --program-prefix=llvm- \ --enable-llvm=/work/s02/s1000/llvm-build --enable-languages=c,c++,fortran \ --enable-checking --disable-bootstrap --disable-mutilib --with-gmp=/work/s02/s1000/gmp\ --with-mpfr=/work/s02/s1000/mpfr --libiconv-prefix=/work/s02/s1000/libiconv

Figure A.5: How to configure the L L V M-G C C on the Ness

7KH³--enable-OOYP´YDULDEOHLVDQLPSRUWDQWIDFWRURIEXLOGLQJWKH//90-GCC which specifies the LLVM path (the path where the LLVM is built, not the source code path or the installation path of the LLVM) to LLVM-GCC compiler.

3. Build the LLVM-GCC

$> make

4. Install the LLVM-GCC

$> make install

34 Appendix B

Performance Timing Results

B.1 G A D G E T-2 simulation results Benchmar k: G A D G E T-2 - Average Compile Time Comparisons (sec) Optimisation Level Compilers PGI-7.0-7 G C C-4.1.2 G C C-4.5.2 L L V M-G C C Clang-2.9 O1 11.561 5.838 6.661 6.691 9.680 O2 12.306 7.164 8.365 7.128 9.522 O3 11.342 7.559 10.102 6.858 9.610 fast 13.082 7.579 10.013 6.890 9.595 Table B.1 G A D G E T-2 compile time results

Benchmar k: G A D G E T-2 - Average Execution Time Comparisons (sec) Optimisation Level Compilers PGI-7.0-7 G C C-4.1.2 G C C-4.5.2 L L V M-G C C Clang-2.9 O1 229.400 177.905 184.814 167.489 180.352 O2 178.317 167.766 171.152 170.166 174.593 O3 175.411 166.524 170.947 169.781 172.935 fast 170.258 161.095 170.991 167.333 173.051 Table B.2 G A D G E T-2 execution time results

Benchmar k: G A D G E T-2 - Size of Executable Comparisons (kb) Optimisation Level Compilers PGI-7.0-7 G C C-4.1.2 G C C-4.5.2 L L V M-G C C Clang-2.9 O1 2148.005 1713.812 1710.483 6113.444 6113.233 O2 2151.730 1714.068 1722.291 2823.856 2816.574 O3 2151.730 1727.003 1766.186 2824.912 2819.934 fast 2173.034 1727.253 1759.406 2825.297 2819.934 Table B.3 G A D G E T-2 executable size results

35

Benchmar k: G A D G E T-2 Compiler: PGI Compile Time (Sec) Optimisation Flags Run 1 Run 2 Run 3 Average O1 11.567 11.303 11.813 11.561 O2 12.001 12.435 12.481 12.306 O3 11.214 11.241 11.572 11.342 fast 13.143 13.043 13.061 13.082 Table B.4 PGI compile time on G A D G E T-2

Benchmar k: G A D G E T-2 Compiler: PGI Execution Time (Sec) Optimisation Flags Run 1 Run 2 Run 3 Average O1 229.654 228.973 229.59 229.400 O2 178.184s 178.344s 178.424s 178.317 O3 175.712 175.288 175.232 175.411 fast 171.211 168.273 171.289 170.258 Table B.5 PGI execution time on G A D G E T-2

Benchmar k: G A D G E T-2 Compiler: G C C-4.1.2 Compile Time (Sec) Optimisation Flags Run 1 Run 2 Run 3 Average O1 5.89 5.797 5.827 5.838 O2 7.175 7.203 7.113 7.164 O3 7.535 7.567 7.575 7.559 O3 & ffast-math 7.555 7.529 7.654 7.579 Table B.6 G C C-4.1.2 compile time on G A D G E T-2

Benchmar k: G A D G E T-2 Compiler: G C C-4.1.2 Execution Time (Sec) Optimisation Flags Run 1 Run 2 Run 3 Average O1 177.292 180.401 176.023 177.905 O2 168.394 169.12 165.777 167.766 O3 165.645s 167.226 166.701 166.524 O3 & ffast-math 160.703 161.271 161.31 161.095 Table B.7 G C C-4.1.2 execution time on G A D G E T-2

36 Benchmar k: G A D G E T-2 Compiler: G C C-4.5.2 Compile Time (Sec) Optimisation Flags Run 1 Run 2 Run 3 Average O1 6.807 6.567 6.609 6.661 O2 8.353 8.355 8.387 8.365 O3 10.086 10.117 10.102 10.102 O3 & ffast-math 10.034 10.024 9.982 10.013 Table B.8 G C C-4.5.2 compile time on G A D G E T-2

Benchmar k: G A D G E T-2 Compiler: G C C-4.5.2 Execution Time (Sec) Optimisation Flags Run 1 Run 2 Run 3 Average O1 186.972 183.277 184.192 184.814 O2 175.149 169.305 169.001 171.152 O3 173.408 168.267 171.166 170.947 O3 & ffast-math 171.142 170.901 170.931 170.991 Table B.9 G C C-4.5.2 execution time on G A D G E T-2

Benchmar k: G A D G E T-2 Compiler: L L V M-G C C Compile Time (Sec) Optimisation Flags Run 1 Run 2 Run 3 Average O1 6.671 6.639 6.764 6.691 O2 7.391 6.976 7.016 7.128 O3 6.77 6.961 6.842 6.858 O3 & ffast-math 6.686 7.100 6.884 6.890 Table B.10 L L V M-G C C compile time on G A D G E T-2

Benchmar k: G A D G E T-2 Compiler: L L V M-G C C Execution Time (Sec) Optimisation Flags Run 1 Run 2 Run 3 Average O1 166.772 169.057 166.639 167.489 O2 171.335 168.654 170.510 170.166 O3 168.515 170.424 170.405 169.781 O3 & ffast-math 165.451 168.351 168.197 167.333 Table B.11 L L V M-G C C execution time on G A D G E T-2

37 Benchmar k: G A D G E T-2 Compiler: C L A N G-2.9 Compile Time (Sec) Optimisation Flags Run 1 Run 2 Run 3 Average O1 9.82 9.589 9.631 9.680 O2 9.578 9.452 9.535 9.522 O3 9.655 9.595 9.58 9.610 O3 & ffast-math 9.576 9.63 9.58 9.595 Table B.12 Clang compile time on G A D G E T-2

Benchmar k: G A D G E T-2 Compiler: C L A N G-2.9 Execution Time (Sec) Optimisation Flags Run 1 Run 2 Run 3 Average O1 179.270 182.437 179.348 180.352 O2 173.064 176.925 173.791 174.593 O3 172.895 173.006 172.903 172.935 O3 & ffast-math 173.294 173.581 172.277 173.051 Table B.13 Clang execution time on G A D G E T-2

B.2 L A M MPS simulation results Benchmar k: L A M MPS - Average Compile Time Comparisons (sec) Optimisation Level Compilers PGI-7.0-7 G C C-4.1.2 G C C-4.5.2 L L V M-G C C Clang O1 360.079 240.658 239.834 272.539 253.686 O2 371.476 276.144 274.798 252.421 258.696 O3 371.689 283.920 291.896 248.489 261.789 fast 483.214 284.087 293.170 249.411 259.389 Table B.14 L A M MPS compile time results

Benchmar k: L A M MPS - Lennard Jones - Execution Time Comparisons (sec) Optimisation Level Compilers PGI-7.0-7 G C C-4.1.2 G C C-4.5.2 L L V M-G C C Clang O1 180.051 165.863 153.409 148.438 165.269 O2 176.655 147.531 149.093 149.637 163.989 O3 175.489 147.403 147.869 154.862 167.880 fast 154.688 150.932 142.928 139.133 168.104 Table B.15 L A M MPS ± Lennard Jones execution time results 38 Benchmar k: L A M MPS - Rhodopsin - Execution Time Comparisons (sec) Optimisation Level Compilers PGI-7.0-7 G C C-4.1.2 G C C-4.5.2 L L V M-G C C Clang-2.9 O1 119.276 110.286 100.671 98.388 96.714 O2 111.588 90.822 89.675 95.571 92.395 O3 111.724 92.043 90.842 96.634 92.383 fast 92.407 92.791 93.753 93.684 92.768 Table B.16 L A M MPS ± Rhodopsin execution time results

Benchmar k: L A M MPS - Size of Executable Comparisons (kb) Optimisation Level Compilers PGI-7.0-7 G C C-4.1.2 G C C-4.5.2 L L V M-G C C Clang O1 9014.256 5076.806 4631.398 4960.679 4554.278 O2 9126.224 5050.802 4825.635 5207.043 4682.890 O3 9126.224 5222.719 5203.294 5231.858 4707.667 fast 10715.699 5158.315 5098.636 5223.264 4707.667 Table B.17 L A M MPS executable size results

Benchmar k: L A M MPS Compiler: PGI Compile Time (Sec) Optimisation Flags Run 1 Run 2 Run 3 Average O1 362.712 358.732 358.792 360.079 O2 371.106 371.236 372.086 371.476 O3 370.943 372.992 371.133 371.689 fast 476.821 494.836 477.986 483.214 Table B.18 PGI compile time on L A M MPS

Benchmar k: L A M MPS Compiler: G C C-4.1.2 Compile Time (Sec) Optimisation Flags Run 1 Run 2 Run 3 Average O1 240.676 240.713 240.586 240.658 O2 275.704 276.225 276.503 276.144 O3 283.697 284.247 283.817 283.920 fast & fast-math 284.87 282.995 284.396 284.087 Table B.19 G C C-4.1.2 compile time on L A M MPS

39 Benchmar k: L A M MPS Compiler: G C C-4.5.2 Compile Time (Sec) Optimisation Flags Run 1 Run 2 Run 3 Average O1 240.435 239.395 239.672 239.834 O2 275.736 277.186 271.472 274.798 O3 291.067 292.027 292.594 291.896 fast & fast-math 293.904 293.127 292.479 293.170 Table B.20 G C C-4.5.2 compile time on L A M MPS

Benchmar k: L A M MPS Compiler: L L V M-G C C Compile Time (Sec) Optimisation Flags Run 1 Run 2 Run 3 Average O1 239.58 286.251 291.785 272.539 O2 250.311 253.24 253.712 252.421 O3 248.303 247.401 249.763 248.489 fast & fast-math 248.318 249.864 250.052 249.411 Table B.21 L L V M-G C C compile time on L A M MPS

Benchmar k: L A M MPS Compiler: C L A N G Compile Time (Sec) Optimisation Flags Run 1 Run 2 Run 3 Average O1 254.706 253.664 252.688 253.686 O2 258.48 258.426 259.181 258.696 O3 259.926 264.807 260.633 261.789 fast & fast-math 258.876 258.666 260.625 259.389 Table B.22 Clang compile time on L A M MPS

Benchmar k: L A M MPS - Lennard-Jones Problem Compiler: PGI Execution Time (Sec) Optimisation Flags Run 1 Run 2 Run 3 Average O1 178.8 180.465 180.888 180.051 O2 176.681 176.105 177.18 176.655 O3 175.286 176.212 174.97 175.489 fast 153.075 152.043 158.946 154.688 Table B.23 PGI execution time on L A M MPS ± Lennard-Jones problem

40 Benchmar k: L A M MPS - Lennard-Jones Problem Compiler: G C C-4.1.2 Execution Time (Sec) Optimisation Flags Run 1 Run 2 Run 3 Average O1 165.706 166.619 165.265 165.863 O2 148.116 146.792 147.686 147.531 O3 145.365 147.831 149.012 147.403 fast & fast-math 156.096 148.344 148.355 150.932 Table B.24 G C C-4.1.2 execution time on L A M MPS ± Lennard-Jones problem

Benchmar k: L A M MPS - Lennard-Jones Problem Compiler: G C C-4.5.2 Execution Time (Sec) Optimisation Flags Run 1 Run 2 Run 3 Average O1 152.652 153.192 154.384 153.409 O2 151.609 147.862 147.809 149.093 O3 147.779 147.966 147.861 147.869 fast & fast-math 139.036 150.491 139.257 142.928 Table B.25 G C C-4.5.2 execution time on L A M MPS ± Lennard-Jones problem

Benchmar k: L A M MPS - Lennard-Jones Problem Compiler: L L V M-G C C Execution Time (Sec) Optimisation Flags Run 1 Run 2 Run 3 Average O1 146.647 149.158 149.509 148.438 O2 149.372 149.685 149.853 149.637 O3 146.594 159.931 158.061 154.862 fast & fast-math 139.29 139.231 138.877 139.133 Table B.26 L L V M-G C C execution time on L A M MPS ± Lennard-Jones problem

Benchmar k: L A M MPS - Lennard-Jones Problem Compiler: C L A N G Execution Time (Sec) Optimisation Flags Run 1 Run 2 Run 3 Average O1 166.232 167.348 162.226 165.269 O2 168.582 154.744 168.64 163.989 O3 169.35 169.379 164.912 167.880 fast & fast-math 170.072 165.129 169.112 168.104 Table B.27 Clang execution time on L A M MPS ± Lennard-Jones problem

41 Benchmar k: L A M MPS - Rhodopsin Problem Compiler: PGI Execution Time (Sec) Optimisation Flags Run 1 Run 2 Run 3 Average O1 119.376 117.081 121.371 119.276 O2 111.049 112.104 111.61 111.588 O3 111.618 112.488 111.066 111.724 fast 92.269 93.047 91.906 92.407 Table B.28 PGI execution time on L A M MPS ± Rhodopsin problem

Benchmar k: L A M MPS - Rhodopsin Problem Compiler: G C C-4.1.2 Execution Time (Sec) Optimisation Flags Run 1 Run 2 Run 3 Average O1 109.969 109.621 111.267 110.286 O2 90.165 91.996 90.305 90.822 O3 92.369 92.116 91.643 92.043 fast & ffast-math 93.269 92.558 92.545 92.791 Table B.29 G C C-4.1.2 execution time on L A M MPS ± Rhodopsin problem

Benchmar k: L A M MPS - Rhodopsin Problem Compiler: G C C-4.5.2 Execution Time (Sec) Optimisation Flags Run 1 Run 2 Run 3 Average O1 100.781 100.531 100.702 100.671 O2 87.482 91.022 90.521 89.675 O3 90.178 91.26 91.087 90.842 fast & ffast-math 93.859 93.785 93.562 93.753 Table B.30 G C C-4.5.2 execution time on L A M MPS ± Rhodopsin problem

Benchmar k: L A M MPS - Rhodopsin Problem Compiler: L L V M-G C C Execution Time (Sec) Optimisation Flags Run 1 Run 2 Run 3 Average O1 97.847 98.008 99.31 98.388 O2 95.187 94.849 96.677 95.571 O3 94.839 97.364 97.7 96.634 fast 93.682 93.76 93.61 93.684 Table B.31 L L V M-G C C execution time on L A M MPS ± Rhodopsin problem

42 Benchmar k: L A M MPS - Rhodopsin Problem Compiler: C L A N G-2.9 Execution Time (Sec) Optimisation Flags Run 1 Run 2 Run 3 Average O1 97.562 96.363 96.218 96.714 O2 93.355 92.057 91.773 92.395 O3 91.734 93.86 91.555 92.383 fast & ffast-math 93.312 92.013 92.979 92.768 Table B.32 Clang execution time on L A M MPS ± Rhodopsin problem

B.3 H E L IUIM simulation results Benchmar k: H E L IU M - Average Compile Time Comparisons (sec) Optimisation Level Compilers PGI-7.0-7 G C C-4.1.2 G C C-4.5.2 L L V M-G C C O1 6.079 5.265 6.675 6.269 O2 6.643 6.347 8.043 6.570 O3 6.522 6.504 11.115 6.730 fast 9.138 6.485 10.247 6.755 Table B.33 H E L IUM compile time results

Benchmar k: H E L IU M - Average Execution Time Comparisons (sec) Optimisation Level Compilers PGI-7.0-7 G C C-4.1.2 G C C-4.5.2 L L V M-G C C O1 260.025 262.964 273.812 277.256 O2 211.931 213.777 262.833 276.744 O3 194.457 207.781 282.406 281.638 fast 138.203 203.319 200.556 213.213 Table B.34 H E L IUM execution time results

Benchmar k: H E L IU M - Size of Executable Comparisons (kb) Optimisation Level Compilers PGI-7.0-7 G C C-4.1.2 G C C-4.5.2 L L V M-G C C O1 2058.009 1177.801 1174.349 1676.509 O2 2047.708 1177.733 1182.541 1689.658 O3 2047.708 1185.925 1280.917 1695.642 fast 2132.067 1187.185 1246.497 1690.323 Table B.35 H E L IUM executable size results

43 Benchmar k: H E L IU M Compiler: PGI Compile Time (Sec) Optimisation Flags Run 1 Run 2 Run 3 Average O1 6.096 6.089 6.052 6.079 O2 6.683 6.633 6.612 6.643 O3 6.478 6.501 6.587 6.522 fast 9.138 9.114 9.161 9.138 Table B.36 PGI compile time on H E L IU M

Benchmar k: H E L IU M Compiler: PGI Execution Time (Sec) Optimisation Flags Run 1 Run 2 Run 3 Average O1 288.355 245.854 245.867 260.025 O2 216.903 205.82 213.071 211.931 O3 188.514 205.798 189.059 194.457 fast 137.512 137.42 139.676 138.203 Table B.37 PGI execution time on H E L IUM

Benchmar k: H E L IU M Compiler: G C C-4.1.2 Compile Time (Sec) Optimisation Flags Run 1 Run 2 Run 3 Average O1 5.29 5.246 5.259 5.265 O2 6.353 6.354 6.334 6.347 O3 6.469 6.526 6.518 6.504 fast 6.487 6.507 6.462 6.485 Table B.38 G C C-4.1.2 compile time on H E L IUM

Benchmar k: H E L IU M Compiler: G C C-4.1.2 Execution Time (Sec) Optimisation Flags Run 1 Run 2 Run 3 Average O1 282.432 253.235 253.224 262.964 O2 220.773 220.901 199.657 213.777 O3 203.046 220.244 200.054 207.781 fast 207.038 201.725 201.194 203.319 Table B.39 G C C-4.1.2 execution time on H E L IU M

44 Benchmar k: H E L IU M Compiler: G C C-4.5.2 Compile Time (Sec) Optimisation Flags Run 1 Run 2 Run 3 Average O1 6.564 6.796 6.664 6.675 O2 8.057 8.096 7.977 8.043 O3 11.153 11.138 11.053 11.115 fast 10.337 10.147 10.256 10.247 Table B.40 G C C-4.5.2 compile time on H E L IUM

Benchmar k: H E L IU M Compiler: G C C-4.5.2 Execution Time (Sec) Optimisation Flags Run 1 Run 2 Run 3 Average O1 274.936 272.587 273.912 273.812 O2 251.114 268.321 269.064 262.833 O3 288.136 286.63 272.453 282.406 fast 206.323 187.653 207.693 200.556 Table B.41 G C C-4.5.2 execution time on H E L IU M

Benchmar k: H E L IU M Compiler: L L V M-G C C Compile Time (Sec) Optimisation Flags Run 1 Run 2 Run 3 Average O1 6.436 6.251 6.119 6.269 O2 6.527 6.557 6.625 6.570 O3 6.698 6.741 6.751 6.730 fast 6.748 6.761 6.755 6.755 Table B.42 L L V M-G C C compile time on H E LIUM

Benchmar k: H E L IU M Compiler: L L V M-G C C Execution Time (Sec) Optimisation Flags Run 1 Run 2 Run 3 Average O1 280.248 285.131 266.389 277.256 O2 279.091 272.453 278.689 276.744 O3 271.239 280.02 293.655 281.638 fast 213.301 213.519 212.819 213.213 Table B.43 L L V M-G C C execution time on H E L IUM

45 Appendix C

Benchmarks Input File Setting

C.1 G A D G E T-2 simulation input file setting

% Relevant files

InitCondFile ../ResultsDir/ICs/ICS_tiny

OutputDir ../ResultsDir/

EnergyFile energy.txt

InfoFile info.txt

TimingsFile timings.txt

CpuFile cpu.txt

RestartFile restart

SnapshotFileBase snapshot

OutputListFilename parameterfiles/outputs_lcdm_gas.txt

% CPU time -limit

TimeLimitCPU 36000 % = 10 hours

ResubmitOn 0

ResubmitCommand my-scriptfile

% Code options

ICFormat 1

SnapFormat 1

ComovingIntegrationOn 1

TypeOfTimestepCriterion 0

OutputListOn 0

PeriodicBoundariesOn 1

% Characteristics of run

TimeBegin 0.02

TimeMax 1.0

Omega0 0.3

46 OmegaLambda 0.7

OmegaBaryon 0.04

HubbleParam 0.7

BoxSize 30000.0

CoolingOn 0

StarformationOn 0

% Output frequency

TimeBetSnapshot 1.1

TimeOfFirstSnapshot 1.1

CpuTimeBetRestartFile 36000.0 ; here in seconds

TimeBetStatistics 0.05

NumFilesPerSnapshot 1

NumFilesWrittenInParallel 1

% Accuracy of time integration

ErrTolIntAccuracy 0.02

MaxRMSDisplacementFac 0.2

CourantFac 0.15

MaxSizeTimestep 0.01

MinSizeTimestep 0.01

% Tree algorithm, force accuracy, domain update frequency

ErrTolTheta 0.5

TypeOfOpeningCriterion 0

ErrTolForceAcc 0.005

TreeDomainUpdateFrequency 0.1

% Further parameters of SPH

DesNumNgb 33

MaxNumNgbDeviation 1

ArtBulkViscConst 1.0

InitGasTemp 150.0 % always ignored if set to 0

MinGasTemp 25.0

% Memory allocation

PartAllocFactor 1.3

BufferSize 250 % in MByte

% System of units

47 UnitLength_in_cm 3.085678e21 ; 1.0 kpc

UnitMass_in_g 1.989e43 ; 1.0e10 solar masses

UnitVelocity_in_cm_per_s 1e5 ; 1 km/sec

GravityConstantInternal 0

% Softening lengths

MinGasHsmlFractional 0.25

SofteningGas 15.0

SofteningHalo 15.0

SofteningDisk 0

SofteningBulge 0

SofteningStars 0

SofteningBndry 0

SofteningGasMaxPhys 15.0

SofteningHaloMaxPhys 15.0

SofteningDiskMaxPhys 0

SofteningBulgeMaxPhys 0

SofteningStarsMaxPhys 0

SofteningBndryMaxPhys 0

48 C.2 L A M MPS ± Lennard Jones problem input file setting

# 3d Lennard-Jones melt

variable x index 1

variable y index 1

variable z index 1

variable xx equal 20*$x

variable yy equal 20*$y

variable zz equal 20*$z

units lj

atom_style atomic

lattice fcc 0.8442

region box block 0 ${xx} 0 ${yy} 0 ${zz}

create_box 1 box

create_atoms 1 box

mass 1 1.0

velocity all create 1.44 87287 loop geom

pair_style lj/cut 2.5

pair_coeff 1 1 1.0 1.0 2.5

neighbor 0.3 bin

neigh_modify delay 0 every 20 check no

fix 1 all nve

run 2600

49 C.3 L A M MPS ± Rhodopsin problem input file setting

# Rhodopsin model

units real

neigh_modify delay 5 every 1

atom_style full

bond_style harmonic

angle_style charmm

dihedral_style charmm

improper_style harmonic

pair_style lj/charmm/coul/long 8.0 10.0

pair_modify mix arithmetic

kspace_style pppm 1e-4

read_data data.rhodo

fix 1 all shake 0.0001 5 0 m 1.0 a 232

fix 2 all npt temp 300.0 300.0 100.0 &

z 0.0 0.0 1000.0 mtk no pchain 0 tchain 1

special_bonds charmm

thermo 50

thermo_style multi

timestep 2.0

run 100

C.4 H E L IU M simulation setting Three parameters are needed to be set before compile and run HELIUM.

1. Parameter g_X_Last which controls the amount of work for each processor. The variable is set to 22 in this project.

2. Parameter g_No_of_blocks_in_row is set to 1, because we only test on one processor.

3. Parameter Steps_per_run which controls the time steps of simulation. In this project, this parameter is set to 80, and prints result very 20 steps.

50 Appendix D

Profiling information

D.1 G A D G E T-2 Profiling

D.1.1 PGI Profiling

% cumulative self self total

time seconds seconds calls s/call s/call name

50.38 88.59 88.59 3072000 0.00 0.00 force_treeevaluate_shortrange

12.89 111.26 22.67 2428420 0.00 0.00 ngb_treefind_variable

11.91 132.21 20.95 1536000 0.00 0.00 ngb_treefind_pairs

3.22 137.87 5.66 2428420 0.00 0.00 density_evaluate

2.84 142.87 5.00 3 1.67 2.31 pmforce_periodic

2.58 147.41 4.54 1536000 0.00 0.00 hydro_evaluate

1.92 150.78 3.37 _mcount2

1.43 153.29 2.51 __rouexit

1.33 155.63 2.33 1 2.33 2.33 distribute_file

1.15 157.64 2.01 3 0.67 2.98 long_range_force

1.03 159.46 1.81 3 0.60 0.60 msort_pmperiodic_with_tmp

0.71 160.70 1.24 __c_bzero

0.51 161.61 0.91 __msort_pmperiodic_with_tmpEND

0.39 162.29 0.69 4 0.17 0.17 force_update_node_recursive

0.38 162.97 0.68 __distribute_fileEND

0.37 163.62 0.65 3 0.22 30.15 gravity_tree

51 D.1.2 G C C-4.1.2 Profiling

% cumulative self self total

time seconds seconds calls s/call s/call name

52.52 80.35 80.35 3072000 0.00 0.00 force_treeevaluate_shortrange

14.44 102.44 22.09 2428420 0.00 0.00 ngb_treefind_variable

11.64 120.25 17.81 1536000 0.00 0.00 ngb_treefind_pairs

3.68 125.88 5.63 3 1.88 1.88 msort_pmperiodic_with_tmp

3.29 130.91 5.03 2428420 0.00 0.00 density_evaluate

3.24 135.86 4.95 3 1.65 3.56 pmforce_periodic

3.08 140.57 4.71 1536000 0.00 0.00 hydro_evaluate

0.86 141.88 1.31 4 0.33 7.11 density

0.52 142.67 0.79 4 0.20 0.20 force_update_node_recursive

0.41 143.30 0.63 4 0.16 0.48 force_treebuild_single

0.37 143.86 0.56 4 0.14 0.14 msort_domain_with_tmp

0.33 144.37 0.51 6 0.09 0.09 reconstruct_timebins

0.33 144.88 0.51 3584000 0.00 0.00 peano_and_morton_key

0.29 145.33 0.45 4 0.11 0.11 reorder_gas

0.27 145.75 0.42 3 0.14 27.41 gravity_tree

0.27 146.16 0.41 4 0.10 0.30 peano_hilbert_order

0.25 146.54 0.38 3 0.13 0.46 advance_and_find_timesteps

0.24 146.90 0.36 4 0.09 0.09 domain_check_for_local_refine

0.24 147.26 0.36 4 0.09 0.09 domain_sumCost

0.23 147.60 0.35 8 0.04 0.04 msort_peano_with_tmp

0.22 147.93 0.33 7 0.05 0.05 empty_read_buffer

0.21 148.25 0.32 4096160 0.00 0.00 peano_hilbert_key

0.21 148.57 0.32 13312000 0.00 0.00 get_hydrokick_factor

0.18 148.85 0.28 4 0.07 0.66 domain_decompose

0.18 149.12 0.27 4 0.07 0.07 do_box_wrapping

52 D.1.3 G C C-4.5.2 Profiling

% cumulative self self total

time seconds seconds calls s/call s/call name

53.36 88.35 88.35 3072000 0.00 0.00 force_treeevaluate_shortrange

13.91 111.38 23.03 2428420 0.00 0.00 ngb_treefind_variable

11.43 130.31 18.93 1536000 0.00 0.00 ngb_treefind_pairs

3.91 136.79 6.48 2428420 0.00 0.00 density_evaluate

3.29 142.23 5.44 1536000 0.00 0.00 hydro_evaluate

2.83 146.92 4.69 3 1.56 3.15 pmforce_periodic

2.80 151.56 4.64 3 1.55 1.55 msort_pmperiodic_with_tmp

0.94 153.12 1.56 4 0.39 7.77 density

0.49 153.93 0.81 4 0.20 0.20 force_update_node_recursive

0.42 154.62 0.69 4 0.17 0.48 force_treebuild_single

0.39 155.27 0.65 3 0.22 30.15 gravity_tree

0.36 155.87 0.60 6 0.10 0.10 reconstruct_timebins

0.36 156.46 0.59 4 0.15 0.15 reorder_gas

0.34 157.03 0.57 3 0.19 0.51 advance_and_find_timesteps

0.33 157.58 0.55 4 0.14 0.14 msort_domain_with_tmp

0.25 158.00 0.42 3584000 0.00 0.00 peano_and_morton_key

0.25 158.42 0.42 4 0.11 0.11 domain_sumCost

0.25 158.84 0.42 4 0.11 0.11 domain_check_for_local_refine

0.23 159.22 0.38 4 0.10 0.77 domain_decompose

0.22 159.59 0.37 8 0.05 0.05 msort_peano_with_tmp

0.21 159.94 0.36 15360012 0.00 0.00 get_gravkick_factor

0.21 160.29 0.35 4 0.09 0.09 domain_findExtent

0.21 160.64 0.35 4 0.09 0.09 reorder_particles

0.19 160.96 0.32 4096160 0.00 0.00 peano_hilbert_key

0.19 161.28 0.32 4 0.08 0.08 do_box_wrapping

0.18 161.59 0.31 13312000 0.00 0.00 get_hydrokick_factor

0.16 161.86 0.27 3 0.09 0.09 find_dt_displacement_constraint

0.16 162.13 0.27 3 0.09 8.21 hydro_force

53 D.1.4 L L V M-G C C Profiling

% cumulative self self total

time seconds seconds calls Ts/call Ts/call name

53.94 87.22 87.22 force_treeevaluate_shortrange

14.11 110.03 22.81 ngb_treefind_variable

11.36 128.40 18.37 ngb_treefind_pairs

3.51 134.08 5.68 msort_pmperiodic_with_tmp

3.44 139.64 5.56 density_evaluate

3.42 145.17 5.53 pmforce_periodic

2.67 149.48 4.31 hydro_evaluate

0.83 150.82 1.34 density

0.46 151.56 0.74 force_treebuild_single

0.45 152.29 0.73 force_update_node_recursive

0.33 152.83 0.54 reconstruct_timebins

0.32 153.35 0.52 msort_domain_with_tmp

0.28 153.80 0.45 gravity_tree

0.27 154.23 0.43 reorder_gas

0.25 154.63 0.41 msort_peano_with_tmp

0.24 155.02 0.39 peano_hilbert_key

0.24 155.41 0.39 peano_hilbert_order

0.22 155.76 0.35 advance_and_find_timesteps

0.21 156.10 0.34 domain_check_for_local_refine

0.21 156.44 0.34 domain_sumCost

0.19 156.75 0.31 empty_read_buffer

0.19 157.06 0.31 get_gravkick_factor

0.18 157.35 0.29 peano_and_morton_key

0.17 157.63 0.28 domain_decompose

0.14 157.85 0.23 get_timestep

0.14 158.07 0.22 hydro_force

54 D.2 L A M MPS Profiling

D.2.1 PGI Lennard-Jones Profiling

% cumulative self self total

time seconds seconds calls s/call s/call name

34.93 57.52 57.52 1 57.52 57.52 LAMMPS_NS::PairLJCut::coeff( (int, char **))

19.55 89.72 32.20 2601 0.01 0.01 compute__Q2_9LAMMPS_NS9PairLJCutFiT1

8.22 103.24 13.53 1 13.53 13.53 LAMMPS_NS::PairLJCut::settings( (int, char **))

7.96 116.36 13.11 1 13.11 13.11 LAMMPS_NS::PairLJCut::__dt( (void))

6.78 127.52 11.16 131 0.09 0.09 LAMMPS_NS::Neighbor::half_bin_newton( (LAMMPS_NS::NeighList *))

2.94 132.36 4.84 1 4.84 4.84 LAMMPS_NS::PairLJCut::__ct( (LAMMPS_NS::LAMMPS *))

2.52 136.51 4.15 1 4.15 56.67 LAMMPS_NS::Verlet::run( (int))

2.36 140.39 3.89 2600 0.00 0.00 LAMMPS_NS::FixNVE::initial_integrate( (int))

2.35 144.25 3.86 LAMMPS_NS::PairLJCutCoulLongTIP4P::__dt( (void))

1.52 146.75 2.50 2600 0.00 0.00 LAMMPS_NS::FixNVE::final_integrate( (void))

0.85 148.15 1.40 __map__Q2_9LAMMPS_NS4AtomFiEND

0.84 149.54 1.39 pack_comm_vel__Q2_9LAMMPS_NS13AtomVecAtomicFiPiPdT1T2

0.61 150.54 1.00 14820 0.00 0.00 pack_comm__Q2_9LAMMPS_NS13AtomVecAtomicFiPiPdT1T2

0.51 151.38 0.83 1 0.83 0.83 LAMMPS_NS::FixNVE::init( (void))

0.44 152.10 0.73 initial_integrate_respa__Q2_9LAMMPS_NS6FixNVEFiN21

0.43 152.82 0.71 pack_reverse__Q2_9LAMMPS_NS13AtomVecAtomicFiT1Pd

0.43 153.53 0.71 unpack_comm__Q2_9LAMMPS_NS13AtomVecAtomicFiT1Pd

0.41 154.21 0.68 LAMMPS_NS::Atom::map( (int))

0.34 154.76 0.56 786 0.00 0.00 pack_border__Q2_9LAMMPS_NS13AtomVecAtomicFiPiPdT1T2

0.33 155.30 0.54 unpack_comm_vel__Q2_9LAMMPS_NS13AtomVecAtomicFiT1Pd

0.29 155.79 0.48 final_integrate_respa__Q2_9LAMMPS_NS6FixNVEFiT1

0.27 156.22 0.44 131 0.00 0.01 LAMMPS_NS::Comm::borders( (void))

0.26 156.65 0.42 LAMMPS_NS::FixNVE::reset_dt( (void))

0.25 157.06 0.41 131 0.00 0.00 LAMMPS_NS::Comm::exchange( (void))

0.17 157.34 0.28 __compute__Q2_9LAMMPS_NS9PairLJCutFiT1END

0.14 157.58 0.24 LAMMPS_NS::Neighbor::half_bin_no_newton( (LAMMPS_NS::NeighList *))

0.14 157.81 0.23 131 0.00 0.00 LAMMPS_NS::Neighbor::bin_atoms( (void))

0.11 158.00 0.19 1 0.19 0.19 LAMMPS_NS::FixNVE::setmask( (void))

0.11 158.18 0.18 1 0.18 0.18 ____ct__Q2_9LAMMPS_NS6FixNVEFPQ2_9LAMMPS_NS6LAMMPSiPPcEND

55

D.2.2 G C C-4.1.2 Lennard-Jones Profiling

% cumulative self self total

time seconds seconds calls s/call s/call name

82.03 124.83 124.83 2601 0.05 0.05 LAMMPS_NS::PairLJCut::compute(int, int)

7.39 136.08 11.25 131 0.09 0.09 LAMMPS_NS::Neighbor::half_bin_newton(LAMMPS_NS::NeighList*)

3.96 142.10 6.02 2600 0.00 0.00 LAMMPS_NS::FixNVE::initial_integrate(int)

1.97 145.10 3.00 2600 0.00 0.00 LAMMPS_NS::FixNVE::final_integrate()

1.84 147.90 2.80 2601 0.00 0.00 LAMMPS_NS::Verlet::force_clear()

1.04 149.49 1.59 15606 0.00 0.00 LAMMPS_NS::AtomVecAtomic::unpack_reverse(int, int*, double*)

1.04 151.07 1.58 14820 0.00 0.00 LAMMPS_NS::AtomVecAtomic::pack_comm(int, int*, double*, int, int*)

0.30 151.53 0.46 131 0.00 0.00 LAMMPS_NS::Comm::borders()

0.12 151.72 0.19 131 0.00 0.00 LAMMPS_NS::Neighbor::bin_atoms()

0.06 151.81 0.09 131 0.00 0.00 LAMMPS_NS::Comm::exchange()

0.05 151.88 0.07 786 0.00 0.00 LAMMPS_NS::AtomVecAtomic::unpack_border(int, int, double*)

0.05 151.95 0.07 786 0.00 0.00 LAMMPS_NS::AtomVecAtomic::pack_border(int, int*, double*, int, int*)

0.04 152.01 0.06 4192000 0.00 0.00 LAMMPS_NS::Neighbor::coord2bin(double*)

0.04 152.07 0.06 131 0.00 0.00 LAMMPS_NS::Domain::pbc()

0.02 152.10 0.03 93757 0.00 0.00 LAMMPS_NS::AtomVecAtomic::copy(int, int, int)

0.01 152.12 0.02 1741318 0.00 0.00 LAMMPS_NS::Pair::ev_tally(int, int, int, int, double, double, double, double, double, double)

0.01 152.13 0.01 48668 0.00 0.00 LAMMPS_NS::Lattice::lattice2box(double&, double&, double&)

0.01 152.14 0.01 32000 0.00 0.00 LAMMPS_NS::AtomVecAtomic::create_atom(int, double*)

0.01 152.15 0.01 2601 0.00 0.00 LAMMPS_NS::Comm::reverse_comm()

0.01 152.16 0.01 3 0.00 0.01 LAMMPS_NS::Atom::sort()

0.01 152.17 0.01 1 0.01 151.99 LAMMPS_NS::Verlet::run(int)

0.00 152.17 0.00 96000 0.00 0.00 LAMMPS_NS::RanPark::uniform()

0.00 152.17 0.00 32000 0.00 0.00 LAMMPS_NS::RanPark::reset(int, double*)

0.00 152.17 0.00 7931 0.00 0.00 LAMMPS_NS::Timer::stamp(int)

0.00 152.17 0.00 5202 0.00 0.00 LAMMPS_NS::Compute::matchstep(long)

0.00 152.17 0.00 5201 0.00 0.00 LAMMPS_NS::Timer::stamp()

0.00 152.17 0.00 2601 0.00 0.00 LAMMPS_NS::Integrate::ev_set(long)

0.00 152.17 0.00 2600 0.00 0.00 LAMMPS_NS::Modify::final_integrate()

0.00 152.17 0.00 2600 0.00 0.00 LAMMPS_NS::Modify::initial_integrate(int)

56

D.2.3 G C C-4.5.2 Lennard-Jones Profiling

% cumulative self self total

time seconds seconds calls s/call s/call name

78.17 118.97 118.97 2601 0.05 0.05 LAMMPS_NS::PairLJCut::compute(int, int)

7.29 130.06 11.09 131 0.08 0.09 LAMMPS_NS::Neighbor::half_bin_newton(LAMMPS_NS::NeighList*)

5.13 137.87 7.81 2600 0.00 0.00 LAMMPS_NS::FixNVE::initial_integrate(int)

2.75 142.05 4.18 2600 0.00 0.00 LAMMPS_NS::FixNVE::final_integrate()

2.49 145.84 3.79 2601 0.00 0.00 LAMMPS_NS::Verlet::force_clear()

1.68 148.40 2.56 14820 0.00 0.00 LAMMPS_NS::AtomVecAtomic::pack_comm(int, int*, double*, int, int*)

1.47 150.63 2.23 15606 0.00 0.00 LAMMPS_NS::AtomVecAtomic::unpack_reverse(int, int*, double*)

0.38 151.21 0.58 131 0.00 0.01 LAMMPS_NS::Comm::borders()

0.17 151.47 0.26 10876269 0.00 0.00 LAMMPS_NS::Neighbor::coord2bin(double*)

0.11 151.64 0.17 131 0.00 0.00 LAMMPS_NS::Comm::exchange()

0.11 151.80 0.16 786 0.00 0.00 LAMMPS_NS::AtomVecAtomic::pack_border(int, int*, double*, int, int*)

0.06 151.89 0.09 131 0.00 0.00 LAMMPS_NS::Domain::pbc()

0.05 151.97 0.08 786 0.00 0.00 LAMMPS_NS::AtomVecAtomic::unpack_border(int, int, double*)

0.05 152.04 0.07 131 0.00 0.00 LAMMPS_NS::Neighbor::bin_atoms()

0.03 152.08 0.04 93757 0.00 0.00 LAMMPS_NS::AtomVecAtomic::copy(int, int, int)

0.03 152.12 0.04 1 0.04 152.00 LAMMPS_NS::Verlet::run(int)

0.01 152.14 0.02 2470 0.00 0.00 LAMMPS_NS::Comm::forward_comm(int)

0.01 152.16 0.02 1 0.02 0.02 LAMMPS_NS::Neighbor::set(int, char**)

0.01 152.17 0.01 96000 0.00 0.00 LAMMPS_NS::RanPark::uniform()

0.01 152.18 0.01 2600 0.00 0.00 LAMMPS_NS::Modify::initial_integrate(int)

0.01 152.19 0.01 3 0.00 0.00 LAMMPS_NS::ComputeTemp::compute_scalar()

0.00 152.19 0.00 1741318 0.00 0.00 LAMMPS_NS::Pair::ev_tally(int, int, int, int, double, double, double, double, double, double)

0.00 152.19 0.00 48676 0.00 0.00 LAMMPS_NS::Lattice::lattice2box(double&, double&, double&)

0.00 152.19 0.00 32000 0.00 0.00 LAMMPS_NS::AtomVecAtomic::create_atom(int, double*)

0.00 152.19 0.00 32000 0.00 0.00 LAMMPS_NS::RanPark::reset(int, double*)

0.00 152.19 0.00 7931 0.00 0.00 LAMMPS_NS::Timer::stamp(int)

0.00 152.19 0.00 5202 0.00 0.00 LAMMPS_NS::Compute::matchstep(long)

0.00 152.19 0.00 5201 0.00 0.00 LAMMPS_NS::Timer::stamp()

0.00 152.19 0.00 2601 0.00 0.00 LAMMPS_NS::Comm::reverse_comm()

57

D.2.4 L L V M-G C C Lennard-Jones Profiling

% cumulative self self total

time seconds seconds calls Ts/call Ts/call name

79.44 113.53 113.53 LAMMPS_NS::PairLJCut::compute(int, int)

8.59 125.81 12.28 LAMMPS_NS::Neighbor::half_bin_newton(LAMMPS_NS::NeighList*)

3.94 131.44 5.63 LAMMPS_NS::FixNVE::initial_integrate(int)

2.35 134.80 3.36 LAMMPS_NS::FixNVE::final_integrate()

2.32 138.11 3.31 LAMMPS_NS::Verlet::force_clear()

1.34 140.02 1.91 LAMMPS_NS::AtomVecAtomic::pack_comm(int, int*, double*, int, int*)

1.14 141.65 1.63 LAMMPS_NS::AtomVecAtomic::unpack_reverse(int, int*, double*)

0.26 142.02 0.37 LAMMPS_NS::Comm::borders()

0.24 142.36 0.34 LAMMPS_NS::Neighbor::coord2bin(double*)

0.13 142.55 0.19 LAMMPS_NS::AtomVecAtomic::pack_border(int, int*, double*, int, int*)

0.11 142.71 0.16 LAMMPS_NS::Comm::exchange()

0.03 142.75 0.04 LAMMPS_NS::AtomVecAtomic::unpack_border(int, int, double*)

0.03 142.79 0.04 LAMMPS_NS::Domain::pbc()

0.02 142.82 0.03 LAMMPS_NS::Neighbor::bin_atoms()

0.01 142.84 0.02 LAMMPS_NS::AtomVecAtomic::copy(int, int, int)

0.01 142.86 0.02 LAMMPS_NS::Atom::sort()

0.01 142.88 0.02 LAMMPS_NS::Verlet::run(int)

0.01 142.89 0.01 LAMMPS_NS::Pair::ev_tally(int, int, int, int, double, double, double, double, double, double)

0.01 142.90 0.01 LAMMPS_NS::Compute::matchstep(long)

0.01 142.91 0.01 LAMMPS_NS::RanPark::reset(int, double*)

0.01 142.92 0.01 LAMMPS_NS::Neighbor::bin_distance(int, int, int)

58 D.2.5 PGI Rhodopsin Profiling

% cumulative self self total

time seconds seconds calls s/call s/call name

36.60 34.18 34.18 101 0.34 0.34 compute__Q2_9LAMMPS_NS20PairLJCharmmCoulLongFiT1

11.72 45.12 10.94 1 10.94 10.94 LAMMPS_NS::PairLJCharmmCoulLong::__dt( (void))

11.59 55.94 10.82 68 0.16 0.16 LAMMPS_NS::PairLJCharmmCoulLong::coeff( (int, char **))

9.56 64.86 8.92 12 0.74 0.75 LAMMPS_NS::Neighbor::half_bin_newton( (LAMMPS_NS::NeighList *))

3.41 68.04 3.18 1 3.18 3.18 LAMMPS_NS::PairLJCharmmCoulLong::settings( (int, char **))

3.18 71.01 2.97 LAMMPS_NS::PPPM::compute_rho_coeff( (void))

2.77 73.60 2.59 101 0.03 0.03 LAMMPS_NS::PPPM::brick2fft( (void))

2.21 75.66 2.06 101 0.02 0.02 LAMMPS_NS::PPPM::fieldforce( (void))

1.95 77.48 1.82 __rouexit

1.22 78.62 1.14 compute_rho1d__Q2_9LAMMPS_NS4PPPMFdN21

1.18 79.72 1.10 101 0.01 0.01 poisson__Q2_9LAMMPS_NS4PPPMFiT1

1.10 80.75 1.02 101 0.01 0.01 compute__Q2_9LAMMPS_NS14DihedralCharmmFiT1

0.97 81.66 0.91 _mcount2

0.88 82.48 0.82 101 0.01 0.01 LAMMPS_NS::PPPM::make_rho( (void))

0.72 83.15 0.67 procs2grid2d__Q2_9LAMMPS_NS4PPPMFiN21PiT4

0.46 83.58 0.43 101 0.00 0.00 LAMMPS_NS::PPPM::setup( (void))

0.39 83.94 0.36 1272 0.00 0.00 unpack_3d_permute1_2__FPdT1P12pack_plan_3d

0.37 84.28 0.34 __rouinit

0.35 84.61 0.33 __dla(void *)

0.35 84.94 0.33 __nwa(unsigned long)

0.35 85.27 0.32 101 0.00 0.00 compute__Q2_9LAMMPS_NS11AngleCharmmFiT1

0.32 85.56 0.29 LAMMPS_NS::PPPM::compute_gf_denom( (void))

0.30 85.84 0.28 101 0.00 0.00 LAMMPS_NS::PPPM::particle_map( (void))

0.28 86.11 0.26 gf_denom__Q2_9LAMMPS_NS4PPPMFdN21

0.26 86.35 0.24 1 0.24 53.40 LAMMPS_NS::Verlet::run( (int))

0.23 86.56 0.21 366933 0.00 0.00 LAMMPS_NS::FixShake::shake3( (int))

0.21 86.76 0.20 LAMMPS_NS::FixNH::nhc_press_integrate( (void))

0.20 86.95 0.19 LAMMPS_NS::Neighbor::half_bin_no_newton( (LAMMPS_NS::NeighList *))

0.20 87.13 0.18 __fmth_i_dpowd

0.20 87.31 0.18 __compute__Q2_9LAMMPS_NS20PairLJCharmmCoulLongFiT1END

0.19 87.49 0.18 453 0.00 0.00 LAMMPS_NS::DihedralCharmm::coeff( (int, char **))

59

D.2.6 G C C-4.1.2 Rhodopsin Profiling

% cumulative self self total

time seconds seconds calls s/call s/call name

72.95 67.38 67.38 101 0.67 0.67 LAMMPS_NS::PairLJCharmmCoulLong::compute(int, int)

10.50 77.08 9.70 12 0.81 0.81 LAMMPS_NS::Neighbor::half_bin_newton(LAMMPS_NS::NeighList*)

2.80 79.67 2.59 101 0.03 0.03 LAMMPS_NS::PPPM::fieldforce()

1.78 81.31 1.64 101 0.02 0.02 LAMMPS_NS::DihedralCharmm::compute(int, int)

1.55 82.74 1.43 101 0.01 0.02 LAMMPS_NS::PPPM::make_rho()

1.10 83.76 1.02 6464000 0.00 0.00 LAMMPS_NS::PPPM::compute_rho1d(double, double, double)

0.81 84.51 0.75 101 0.01 0.01 LAMMPS_NS::AngleCharmm::compute(int, int)

0.55 85.02 0.51 1272 0.00 0.00 unpack_3d_permute1_2(double*, double*, pack_plan_3d*)

0.53 85.51 0.49 40211534 0.00 0.00 LAMMPS_NS::Domain::minimum_image(double&, double&, double&)

0.50 85.97 0.46 101 0.00 0.01 LAMMPS_NS::PPPM::setup()

0.45 86.39 0.42 101 0.00 0.01 LAMMPS_NS::PPPM::poisson(int, int)

0.41 86.77 0.38 1373 0.00 0.00 pack_3d(double*, double*, pack_plan_3d*)

0.38 87.12 0.35 200 0.00 0.00 LAMMPS_NS::FixNH::nve_v()

0.38 87.47 0.35 427533 0.00 0.00 LAMMPS_NS::FixShake::shake3angle(int)

0.34 87.78 0.31 101 0.00 0.00 LAMMPS_NS::Pair::virial_fdotr_compute()

0.32 88.08 0.30 101 0.00 0.00 LAMMPS_NS::FixShake::unconstrained_update()

0.30 88.36 0.28 366933 0.00 0.00 LAMMPS_NS::FixShake::shake3(int)

0.27 88.61 0.25 fftw_no_twiddle_32

0.22 88.81 0.20 101 0.00 0.00 LAMMPS_NS::PPPM::fillbrick()

0.19 88.99 0.18 200 0.00 0.00 LAMMPS_NS::FixNH::nh_v_temp()

0.18 89.16 0.17 534 0.00 0.00 LAMMPS_NS::AtomVecFull::pack_comm(int, int*, double*, int, int*)

0.18 89.33 0.17 200 0.00 0.00 LAMMPS_NS::FixNH::nh_v_press()

0.18 89.50 0.17 100 0.00 0.00 LAMMPS_NS::FixNH::nve_x()

0.17 89.66 0.16 25997784 0.00 0.00 LAMMPS_NS::Pair::ev_tally(int, int, int, int, double, double, double, double, double, double)

0.17 89.82 0.16 5739729 0.00 0.00 LAMMPS_NS::Dihedral::ev_tally(int, int, int, int, int, int, double, double*, double*, double*, double, double, double, double, double, double, double, double, double)

0.17 89.98 0.16 606 0.00 0.00 LAMMPS_NS::AtomVecFull::unpack_reverse(int, int*, double*)

0.14 90.11 0.13 101 0.00 0.00 LAMMPS_NS::Verlet::force_clear()

0.13 90.23 0.12 200 0.00 0.00 LAMMPS_NS::Domain::x2lamda(int)

0.13 90.35 0.12 12 0.01 0.01 LAMMPS_NS::Neighbor::dihedral_all()

60

D.2.7 G C C-4.5.2 Rhodopsin Profiling

% cumulative self self total

time seconds seconds calls s/call s/call name

74.78 66.97 66.97 101 0.66 0.66 LAMMPS_NS::PairLJCharmmCoulLong::compute(int, int)

10.67 76.53 9.56 12 0.80 0.80 LAMMPS_NS::Neighbor::half_bin_newton(LAMMPS_NS::NeighList*)

2.52 78.79 2.26 101 0.02 0.03 LAMMPS_NS::PPPM::fieldforce()

1.66 80.28 1.49 101 0.01 0.02 LAMMPS_NS::DihedralCharmm::compute(int, int)

1.18 81.34 1.06 6464000 0.00 0.00 LAMMPS_NS::PPPM::compute_rho1d(double, double, double)

1.16 82.38 1.04 101 0.01 0.02 LAMMPS_NS::PPPM::make_rho()

0.65 82.96 0.58 101 0.01 0.01 LAMMPS_NS::AngleCharmm::compute(int, int)

0.64 83.53 0.57 1272 0.00 0.00 unpack_3d_permute1_2(double*, double*, pack_plan_3d*)

0.56 84.03 0.50 101 0.00 0.01 LAMMPS_NS::PPPM::setup()

0.51 84.49 0.46 101 0.00 0.01 LAMMPS_NS::PPPM::poisson(int, int)

0.40 84.85 0.36 40211534 0.00 0.00 LAMMPS_NS::Domain::minimum_image(double&, double&, double&)

0.35 85.16 0.31 1373 0.00 0.00 pack_3d(double*, double*, pack_plan_3d*)

0.31 85.44 0.28 366933 0.00 0.00 LAMMPS_NS::FixShake::shake3(int)

0.27 85.68 0.24 101 0.00 0.00 LAMMPS_NS::Pair::virial_fdotr_compute()

0.26 85.91 0.23 200 0.00 0.00 LAMMPS_NS::FixNH::nh_v_press()

0.25 86.13 0.22 427533 0.00 0.00 LAMMPS_NS::FixShake::shake3angle(int)

0.22 86.33 0.20 200 0.00 0.00 LAMMPS_NS::FixNH::nve_v()

0.21 86.52 0.19 56 0.00 0.00 LAMMPS_NS::Neighbor::check_distance()

0.20 86.70 0.18 534 0.00 0.00 LAMMPS_NS::AtomVecFull::pack_comm(int, int*, double*, int, int*)

0.17 86.85 0.15 101 0.00 0.00 LAMMPS_NS::FixShake::unconstrained_update()

0.17 87.00 0.15 fftw_no_twiddle_32

0.16 87.14 0.14 101 0.00 0.00 LAMMPS_NS::ComputeTemp::compute_vector()

0.16 87.28 0.14 200 0.00 0.00 LAMMPS_NS::FixNH::nh_v_temp()

0.13 87.40 0.12 200 0.00 0.00 LAMMPS_NS::Domain::x2lamda(int)

0.13 87.52 0.12 100 0.00 0.00 LAMMPS_NS::FixNH::nve_x()

0.12 87.63 0.11 75447 0.00 0.00 LAMMPS_NS::FixShake::shake4(int)

0.12 87.74 0.11 101 0.00 0.00 LAMMPS_NS::Verlet::force_clear()

0.10 87.83 0.09 606 0.00 0.00 LAMMPS_NS::AtomVecFull::unpack_reverse(int, int*, double*)

0.10 87.92 0.09 200 0.00 0.00 LAMMPS_NS::Domain::lamda2x(int)

0.10 88.01 0.09 101 0.00 0.00 LAMMPS_NS::PPPM::fillbrick()

61

D.2.8 L L V M-G C C Rhodopsin Profiling

% cumulative self self total

time seconds seconds calls s/call s/call name

73.22 66.84 66.84 101 0.66 0.66 LAMMPS_NS::PairLJCharmmCoulLong::compute(int, int)

10.52 76.44 9.60 12 0.80 0.80 LAMMPS_NS::Neighbor::half_bin_newton(LAMMPS_NS::NeighList*)

2.66 78.87 2.43 101 0.02 0.03 LAMMPS_NS::PPPM::fieldforce()

2.14 80.82 1.95 101 0.02 0.03 LAMMPS_NS::DihedralCharmm::compute(int, int)

1.49 82.18 1.36 101 0.01 0.02 LAMMPS_NS::PPPM::make_rho()

1.18 83.26 1.08 6464000 0.00 0.00 LAMMPS_NS::PPPM::compute_rho1d(double, double, double)

0.85 84.04 0.78 101 0.01 0.01 LAMMPS_NS::AngleCharmm::compute(int, int)

0.66 84.65 0.61 40211534 0.00 0.00 LAMMPS_NS::Domain::minimum_image(double&, double&, double&)

0.58 85.18 0.53 1272 0.00 0.00 unpack_3d_permute1_2(double*, double*, pack_plan_3d*)

0.54 85.67 0.49 101 0.00 0.00 LAMMPS_NS::PPPM::setup()

0.41 86.04 0.37 1373 0.00 0.00 pack_3d(double*, double*, pack_plan_3d*)

0.38 86.39 0.35 101 0.00 0.00 LAMMPS_NS::FixShake::unconstrained_update()

0.37 86.73 0.34 25997784 0.00 0.00 LAMMPS_NS::Pair::ev_tally(int, int, int, int, double, double, double, double, double, double)

0.33 87.03 0.30 427533 0.00 0.00 LAMMPS_NS::FixShake::shake3angle(int)

0.32 87.32 0.29 101 0.00 0.01 LAMMPS_NS::PPPM::poisson(int, int)

0.25 87.55 0.23 200 0.00 0.00 LAMMPS_NS::Domain::x2lamda(int)

0.24 87.77 0.22 101 0.00 0.00 LAMMPS_NS::BondHarmonic::compute(int, int)

0.23 87.98 0.21 fftw_no_twiddle_32

0.22 88.18 0.20 200 0.00 0.00 LAMMPS_NS::FixNH::nve_v()

0.21 88.37 0.19 5739729 0.00 0.00 LAMMPS_NS::Dihedral::ev_tally(int, int, int, int, int, int, double, double*, double*, double*, double, double, double, double, double, double, double, double, double)

0.21 88.56 0.19 101 0.00 0.00 LAMMPS_NS::Pair::virial_fdotr_compute()

0.19 88.73 0.17 366933 0.00 0.00 LAMMPS_NS::FixShake::shake3(int)

0.19 88.90 0.17 606 0.00 0.00 LAMMPS_NS::AtomVecFull::unpack_reverse(int, int*, double*)

0.14 89.03 0.13 101 0.00 0.00 LAMMPS_NS::Verlet::force_clear()

0.14 89.16 0.13 534 0.00 0.00 LAMMPS_NS::AtomVecFull::pack_comm(int, int*, double*, int, int*)

0.12 89.27 0.11 200 0.00 0.00 LAMMPS_NS::FixNH::nh_v_temp()

0.11 89.37 0.10 200 0.00 0.00 LAMMPS_NS::Domain::lamda2x(int)

0.11 89.47 0.10 101 0.00 0.00 LAMMPS_NS::PPPM::fillbrick()

0.10 89.56 0.09 4812246 0.00 0.00 LAMMPS_NS::Domain::minimum_image(double*)

62

D.3 H E L IU M Profiling

D.3.1 PGI Profiling (O3 flag enable)

% cumulative self self total

time seconds seconds calls s/call s/call name

21.91 46.54 46.54 31130933 0.00 0.00 local_ham_matrix_incr_result_w_1_over_r12_terms_

11.31 70.57 24.02 957942 0.00 0.00 local_ham_matrix_incr_with_2nd_deriv_in_r2_

10.25 92.34 21.78 3217330 0.00 0.00 local_ham_matrix_incr_with_1st_deriv_op_in_r2_

10.25 114.11 21.76 3217330 0.00 0.00 local_ham_matrix_incr_with_1st_deriv_op_in_r1_

4.71 124.11 10.01 5 2.00 2.00 local_ham_matrix_incr_result_w_1_over_r12_ibm_

4.13 132.88 8.77 84 0.10 2.11 propagators_arnoldi_propagate_

3.13 139.52 6.64 957942 0.00 0.00 local_ham_matrix_incr_with_2nd_deriv_in_r1_

2.82 145.52 6.00 924 0.01 0.01 global_linear_algebra_self_inner_product_

2.57 150.98 5.46 844 0.01 0.06 hamiltonians_incr_w_intham_x_psi_

2.40 156.08 5.10 924 0.01 0.01 global_linear_algebra_self_local_inner_product_

2.29 160.94 4.86 844 0.01 0.11 hamiltonians_init_w_atomicham_x_psi_

2.14 165.49 4.55 1596 0.00 0.00 global_linear_algebra_decrement_v_with_cx_

2.01 169.77 4.28 844 0.01 0.01 mpi_communications_get_fresh_remote_bndries_

1.83 173.65 3.88 844 0.00 0.00 hamiltonians_get_missing_2nd_deriv_bndries_

1.68 177.23 3.57 953400 0.00 0.00 global_linear_algebra_increment_psi_with_zx_

1.65 180.73 3.50 __local_ham_matrix_test_2nd_deriv_in_r1_or_r2_END

1.41 183.73 3.00 _mcount2

1.31 186.52 2.79 844 0.00 0.01 global_linear_algebra_real_inner_product_

1.19 189.06 2.54 global_linear_algebra_decrement_v_with_zx_

1.18 191.56 2.50 2 1.25 1.25 local_ham_matrix_test_2nd_deriv_in_r1_or_r2_

0.93 193.53 1.97 844 0.00 0.00 hamiltonians_get_missing_1st_deriv_bndries_

0.91 195.45 1.93 844 0.00 0.00 global_linear_algebra_real_local_inner_product_

0.88 197.31 1.86 global_linear_algebra_local_inner_product_

0.71 198.82 1.50 __local_ham_matrix_incr_result_w_1_over_r12_terms_END

0.65 200.19 1.38 957940 0.00 0.00 local_ham_matrix_incr_with_laplacian_in_r2_

0.54 201.34 1.15 957940 0.00 0.00 local_ham_matrix_incr_with_laplacian_in_r1_

0.48 202.36 1.02 __global_linear_algebra_self_local_inner_product_END

0.38 203.17 0.81 f90io_close

63

D.3.2 PGI Profiling (fast flag enable)

% cumulative self self total

time seconds seconds calls s/call s/call name

29.66 46.12 46.12 31130933 0.00 0.00 local_ham_matrix_incr_result_w_1_over_r12_terms_

11.79 64.45 18.33 3217330 0.00 0.00 local_ham_matrix_incr_with_1st_deriv_op_in_r1_

8.34 77.41 12.96 1596 0.01 0.01 global_linear_algebra_decrement_v_with_cx_

7.66 89.32 11.91 3217330 0.00 0.00 local_ham_matrix_incr_with_1st_deriv_op_in_r2_

5.72 98.21 8.89 84 0.11 1.70 propagators_arnoldi_propagate_

5.01 106.00 7.79 844 0.01 0.01 global_linear_algebra_real_local_inner_product_

3.04 110.73 4.72 953400 0.00 0.00 global_linear_algebra_increment_psi_with_zx_

2.99 115.37 4.65 844 0.01 0.07 hamiltonians_init_w_atomicham_x_psi_

2.96 119.98 4.61 844 0.01 0.04 hamiltonians_incr_w_intham_x_psi_

2.92 124.52 4.54 924 0.00 0.00 global_linear_algebra_self_local_inner_product_

2.23 127.98 3.46 844 0.00 0.00 mpi_communications_get_fresh_remote_bndries_

2.19 131.39 3.41 957942 0.00 0.00 local_ham_matrix_incr_with_2nd_deriv_in_r2_

2.13 134.69 3.30 957942 0.00 0.00 local_ham_matrix_incr_with_2nd_deriv_in_r1_

1.97 137.75 3.06 844 0.00 0.00 hamiltonians_get_missing_2nd_deriv_bndries_

1.63 140.29 2.54 _mcount2

1.43 142.51 2.22 844 0.00 0.00 hamiltonians_get_missing_1st_deriv_bndries_

1.29 144.52 2.01 __rouexit

0.75 145.69 1.17 957940 0.00 0.00 local_ham_matrix_incr_with_laplacian_in_r1_

0.70 146.78 1.09 5208047 0.00 0.00 list_of_r12_ham_nbhrs_read_item_whose_code_is_

0.63 147.76 0.98 5 0.20 0.20 local_ham_matrix_incr_result_w_1_over_r12_ibm_

0.58 148.67 0.91 957940 0.00 0.00 local_ham_matrix_incr_with_laplacian_in_r2_

0.49 149.43 0.76 30168448 0.00 0.00 list_of_r12_ham_nbhrs_get_r12_ham_

0.47 150.17 0.74 35576 0.00 0.00 list_of_r12_ham_nbhrs_append_

0.39 150.77 0.60 __rouinit

0.36 151.33 0.56 __global_linear_algebra_self_local_inner_product_END

0.35 151.87 0.54 2 0.27 0.27 local_ham_matrix_test_2nd_deriv_in_r1_or_r2_

0.25 152.26 0.39 __local_ham_matrix_test_2nd_deriv_in_r1_or_r2_END

0.24 152.63 0.37 __c_mzero8

0.21 152.96 0.33 f90io_close

0.16 153.21 0.25 4540 0.00 0.00 hamiltonians_get_local_correlation_

64

D.3.3 G C C-4.1.2 Profiling

% cumulative self self total

time seconds seconds calls s/call s/call name

32.58 65.08 65.08 31130933 0.00 0.00 __local_ham_matrix__incr_result_w_1_over_r12_terms

12.99 91.04 25.96 3217330 0.00 0.00 __local_ham_matrix__incr_with_1st_deriv_op_in_r1

9.85 110.70 19.67 3217330 0.00 0.00 __local_ham_matrix__incr_with_1st_deriv_op_in_r2

7.81 126.30 15.60 1596 0.01 0.01 __global_linear_algebra__decrement_v_with_cx

7.37 141.03 14.73 84 0.18 2.35 __propagators__arnoldi_propagate

3.66 148.34 7.31 844 0.01 0.07 __hamiltonians__incr_w_intham_x_psi

3.43 155.19 6.85 844 0.01 0.01 __global_linear_algebra__real_local_inner_product

3.08 161.34 6.15 957942 0.00 0.00 __local_ham_matrix__incr_with_2nd_deriv_in_r1

3.01 167.36 6.02 844 0.01 0.11 __hamiltonians__init_w_atomicham_x_psi

2.96 173.28 5.92 924 0.01 0.01 __global_linear_algebra__self_local_inner_product

2.96 179.19 5.92 957942 0.00 0.00 __local_ham_matrix__incr_with_2nd_deriv_in_r2

2.32 183.82 4.63 844 0.01 0.01 __mpi_communications__get_fresh_remote_bndries

2.31 188.43 4.61 953400 0.00 0.00 __global_linear_algebra__increment_psi_with_zx

1.29 191.01 2.58 844 0.00 0.00 __hamiltonians__get_missing_2nd_deriv_bndries

1.06 193.13 2.12 844 0.00 0.00 __hamiltonians__get_missing_1st_deriv_bndries

0.70 194.53 1.40 957940 0.00 0.00 __local_ham_matrix__incr_with_laplacian_in_r2

0.64 195.81 1.28 957940 0.00 0.00 __local_ham_matrix__incr_with_laplacian_in_r1

0.58 196.96 1.16 30168448 0.00 0.00 __list_of_r12_ham_nbhrs__get_r12_ham

0.50 197.96 1.00 5208047 0.00 0.00 __list_of_r12_ham_nbhrs__read_item_whose_code_is

0.09 198.14 0.18 3849920 0.00 0.00 __list_of_r12_ham_nbhrs__r12_ham_list_index_of_last_item

0.09 198.31 0.17 4540 0.00 0.00 __hamiltonians__get_local_correlation

0.07 198.45 0.14 4 0.04 0.04 __hamiltonians__acceleration_x_state

0.06 198.56 0.11 5121526 0.00 0.00 __full_ham_matrix__reduced_mat_element_of_c

0.05 198.67 0.11 5152900 0.00 0.00 __list_of_int_ham_nbhrs__read_item_whose_code_is

0.05 198.77 0.10 6465152 0.00 0.00 __list_of_int_ham_nbhrs__get_int_ham

0.05 198.86 0.09 3927020 0.00 0.00 __full_ham_matrix__factorial

0.05 198.95 0.09 957940 0.00 0.00 __local_ham_matrix__get_centripetal_pot_in_r1

0.04 199.02 0.07 957940 0.00 0.00 __local_ham_matrix__get_centripetal_pot_in_r2

0.04 199.09 0.07 1704374 0.00 0.00 __full_ham_matrix__rational_factorial

0.03 199.15 0.07 5121527 0.00 0.00 __full_ham_matrix__three_j_with_0_m

65

D.3.4 G C C-4.5.2 Profiling

% cumulative self self total

time seconds seconds calls s/call s/call name

41.88 81.54 81.54 31130928 0.00 0.00 __local_ham_matrix_MOD_incr_result_w_1_over_r12_terms

11.63 104.18 22.64 3217330 0.00 0.00 __local_ham_matrix_MOD_incr_with_1st_deriv_op_in_r1

7.87 119.50 15.32 3217330 0.00 0.00 __local_ham_matrix_MOD_incr_with_1st_deriv_op_in_r2

6.82 132.77 13.27 84 0.16 2.28 __propagators_MOD_arnoldi_propagate

5.96 144.37 11.60 1596 0.01 0.01 __global_linear_algebra_MOD_decrement_v_with_cx

2.83 149.87 5.50 844 0.01 0.01 __global_linear_algebra_MOD_real_inner_product

2.60 154.92 5.06 953400 0.00 0.00 __global_linear_algebra_MOD_increment_psi_with_zx

2.46 159.71 4.79 844 0.01 0.12 __hamiltonians_MOD_init_w_atomicham_x_psi

2.43 164.45 4.74 844 0.01 0.05 __hamiltonians_MOD_incr_w_intham_x_psi

2.32 168.97 4.52 957942 0.00 0.00 __local_ham_matrix_MOD_incr_with_2nd_deriv_in_r2

2.32 173.48 4.51 957942 0.00 0.00 __local_ham_matrix_MOD_incr_with_2nd_deriv_in_r1

2.30 177.96 4.48 924 0.00 0.00 __global_linear_algebra_MOD_self_inner_product

2.19 182.23 4.27 844 0.01 0.01 __mpi_communications_MOD_get_fresh_remote_bndries

1.16 184.48 2.25 844 0.00 0.00 __hamiltonians_MOD_get_missing_2nd_deriv_bndries

1.08 186.58 2.10 844 0.00 0.00 __hamiltonians_MOD_get_missing_1st_deriv_bndries

1.06 188.65 2.07 957940 0.00 0.00 __local_ham_matrix_MOD_incr_with_laplacian_in_r1

0.81 190.23 1.58 957940 0.00 0.00 __local_ham_matrix_MOD_incr_with_laplacian_in_r2

0.73 191.66 1.43 30168448 0.00 0.00 __list_of_r12_ham_nbhrs_MOD_get_r12_ham

0.47 192.57 0.91 5208047 0.00 0.00 __list_of_r12_ham_nbhrs_MOD_read_item_whose_code_is

0.16 192.88 0.31 4540 0.00 0.00 __hamiltonians_MOD_get_local_correlation

0.12 193.11 0.23 1 0.23 0.53 __full_ham_matrix_MOD_test_6j

0.07 193.25 0.14 1 0.14 1.12 __full_ham_matrix_MOD_test_dielectric_ham_matrices

0.07 193.38 0.14 5121527 0.00 0.00 __full_ham_matrix_MOD_three_j_with_0_m

0.07 193.51 0.13 957940 0.00 0.00 __local_ham_matrix_MOD_get_centripetal_pot_in_r1

0.06 193.64 0.13 980615 0.00 0.00 __full_ham_matrix_MOD_triangle_val

0.06 193.76 0.12 5152900 0.00 0.00 __list_of_int_ham_nbhrs_MOD_read_item_whose_code_is

0.06 193.87 0.11 3849920 0.00 0.00 __list_of_r12_ham_nbhrs_MOD_r12_ham_list_index_of_last_item

0.05 193.97 0.11 6465152 0.00 0.00 __list_of_int_ham_nbhrs_MOD_get_int_ham

0.05 194.07 0.10 5152900 0.00 0.00 __int_ham_matrix_MOD_int_ham_coupling

0.05 194.17 0.10 5 0.02 38.93 MAIN__

66 D.3.5 L L V M-G C C Profiling

% cumulative self self total

time seconds seconds calls Ts/call Ts/call name

42.74 80.63 80.63 __hamiltonians__init_w_atomicham_x_psi

23.17 124.35 43.72 __propagators__arnoldi_propagate

11.56 146.16 21.81 __local_ham_matrix__incr_with_1st_deriv_op_in_r1

7.88 161.03 14.87 __local_ham_matrix__incr_with_1st_deriv_op_in_r2

3.54 167.70 6.67 __hamiltonians__incr_w_intham_x_psi

2.38 172.19 4.49 __mpi_communications__get_fresh_remote_bndries

2.24 176.41 4.22 __local_ham_matrix__incr_with_2nd_deriv_in_r1

2.22 180.60 4.19 __local_ham_matrix__incr_with_2nd_deriv_in_r2

1.06 182.60 2.00 __hamiltonians__get_missing_2nd_deriv_bndries

0.94 184.38 1.78 __hamiltonians__get_missing_1st_deriv_bndries

0.53 185.39 1.01 __full_ham_matrix__test_dielectric_ham_matrices

0.41 186.17 0.78 _gfortrani_internal_pack_c8

0.29 186.71 0.54 _gfortran_internal_pack

0.25 187.19 0.48 __hamiltonians__get_local_correlation

0.10 187.38 0.19 _gfortran_system_clock_8

0.09 187.54 0.17 __full_ham_matrix__get_dielectric_ham

0.08 187.70 0.16 __local_ham_matrix__get_centripetal_pot_in_r1

0.07 187.83 0.13 __hamiltonians__acceleration_x_state

0.05 187.93 0.10 __full_ham_matrix__test_6j

0.05 188.03 0.10 __full_ham_matrix__triangle_val

0.05 188.12 0.09 _gfortrani_internal_pack_c4

0.04 188.20 0.08 __full_ham_matrix__make_full_ham

0.04 188.27 0.07 __full_ham_matrix__three_j_with_0_m

0.04 188.34 0.07 __local_ham_matrix__get_centripetal_pot_in_r2

0.03 188.39 0.05 MAIN__

0.02 188.43 0.04 _gfortran_pow_i4_i4

0.02 188.46 0.03 __hamiltonians__ham_x_vector

0.02 188.49 0.03 __full_ham_matrix__get_six_j

0.02 188.52 0.03 __full_ham_matrix__test_make_int_ham

0.02 188.55 0.03 __hamiltonians__get_global_acceleration

67 Appendix E

Project work plan and risk analysis

Weeks Week ending Tasks a) Install the LLVM version 2.9 and LLVM-GCC, Clang and 1 06/05/2011 Fortran front-end on Ness b) Try to compile some simple code in C,C++,FORTRAN a) Investigate the performance of GADGET-2 code compiled 2 13/05/2011 by LLVM-GCC4.2, GCC and PGI compilers. 3 20/05/2011 b) Experiment with best optimisation from each compiler a) Investigate the performance of LAMMPS code compiled by 4 27/05/2011 LLVM-GCC4.2, GCC and PGI compilers 5 03/06/2011 b) Experiment with best optimisation from each compiler a) Investigate the performance of HELIUM code compiled by 6 10/06/2011 LLVM-GCC4.2, GCC and PGI compilers. 7 17/06/2011 b) Experiment with best optimistation from each compiler

a) Review the project, analyse collected data, documentation 8 24/06/2011 b) Discuss issues and works have done with supervisor 9 01/07/2011 Install the libdispatch library and write test program 10 08/07/2011 Investigate perfromance of the libdispatch library

11 15/07/2011 Analyse collected data and documentation 12 22/07/2011 Review the project and discuss further work with supervisor 13 29/07/2011 Further work 14 05/08/2011 This week is for handling some unpredictable risks and issues 15 12/08/2011 The project final review and documentation 16 19/08/2011 Dead line of dissertation

As the work plan above, the first and second weeks are used to install LLVM-GCC and Clang on Ness and begin to investigate GADGET-2 simulation. However, the examination of second semester was occurred in these two weeks, so the schedule was

68 delayed a bit. There were six weeks for investigating the performance of compiler as plan, but this plan was impacted by building Clang compiler on Ness, because the version of compiler on Ness is too low to build Clang. Under the direction and help of my supervisor Mr. Iain Bethune, we decided to use the GCC 4.5 which is built by Jay Chetty to successfully solve the problem. There were nearly 3 extra weeks spent on settling down Clang on Ness. We lost about 5 weeks; this forced us to drop the investigation plan on libdispatch library. Even the original plan was a little bit out of control, the main purpose of thLVSURMHFWZDVVWLOOEHHQDFKLHYHGZLWK,DLQ¶VIXOO\VXSSRUW and help.

Size of Risk Risk Impact loss Solution (weeks)

New version of LLVM will be released on 03 April, 2011. New version Moderate 1 Reinstall previour version could have new bugs we do not know

Testing codes are not Prepare other testing code working well with LLVM Moderate 1 at early stage compiler

Examinations on 07/May, Moderate 0.5 N/A 13/May and 17/May

Data loss Severe 1 Back up the data regularly

Spare one week in case Sickness (cold and flu) Severe 1 this happen In the original risk analysis above, we spared 1 week on installing new version of LLVM and Clang, but in practice, the impact of this risk is much higher than we expected. The second risk did happen in our project, the HELIUM code could not be compiled correctly by GCC compilers and LLVM compilers. The HELIUM code could be compiled correctly after a slightly modification by Iain Bethune, so we did not need to choose different benchmark codes entirely.

Data loss did happen in this project once as the LLVM-GCC was deleted by mistake and there was no back-up. The LLVM-GCC had to be reinstalled, which took about 2 days to back on track. As we spare one week time on this risk before the project, so the time lose was covered. To sum up, the project work plan and the risk analysis are very useful methods for keeping the project on track. Even some issues were out of our preparation, but whole project was still successfully finished.

69 References

[1] The LLVM official website. Online at http://www.llvm.org (referenced 16/June/2011). [2] Chris Lattner, ³LLVM: An infrastructure for Multi-Stage Optimization´2002, Master Thesis, Dept., University of Illinois at Urbana-Chanmpaign. [3] &KULV/DWWQHUDQG9LNUDP$GYH³LLVM: A Compilation Framework for Lifelong Program Analysis & Transformation´2004, Proc. of the 2004 International Symposium on Code Generation and Optimization (CGO'04), Palo Alto, California. [4] The Poland Group compiler official website. Online at http://www.pgroup.com (referenced 16/June/2011) [5] The GNU GCC compiler official website. Online at http://gcc.gnu.org (referenced 16/June/2011) [6] 6SULQJHO9³The cosmological simulation code GADGET-´, 2005, MNRAS, submitted, astro-ph/0505010 [7] The LAMMPS website. Online at http://lammps.sandia.gov (referenced 16/June/2011) [8] 6P\WK(3DUNHU-6DQG7D\ORU.7³Numerical integration of the time-GHSHQGHQW6FKU٠GLQJHUHTXDWLRQIRUODVHU-driven helium´1998, Computer Physics Communications 114, 1-14 [9] Top 500 Supercomputer site, Online at: http://www.top500.org/ (Referenced 8/Aug/2011) [10] 6SULQJHO9

70 http://llvm.org/devmtg/2007-05/09-Naroff-CFE.pdf ( referenced 19/July/2011) [14] The Clang performance comparison official website, Online at: http://clang.llvm.org/features.html#performance (referenced 19/July/2011) [15] Abstract Syntax Tree (AST), Online at: http://en.wikipedia.org/wiki/Abstract_syntax_tree (Referenced 17/August/2011) [16] ,DQ/DQFH7D\ORU³New ELF linker code added to GNU binutils´2008, Online at: http://sourceware.org/ml/binutils/2008-03/msg00162.html (Referenced 30/July/2011) [17] ,DQ/DQFH7D\ORU³gold: Google Release New and Improved GCC Linker´ 2008, Online at: http://google-opensource.blogspot.com/2008/04/gold-google-releases-new-and- improved.html (Reference 30/July/2011) [18] Link-Time-Optimisation (LTO), Online at: http://en.wikipedia.org/wiki/Link-time_optimization (Referenced 17/August/2011) [19] The definition of the ELF, ³6\VWHP9$SSOLFDWLRQ%LQDU\,QWHUIDFH´, 1997, Edition 4.1, Online at: http://www.sco.com/developers/devspecs/gabi41.pdf (Referenced 30/July/2011) [20] The definition of the ELF³Tool Interface Standard(TIS), Executable and LinNLQJ)RUPDW (/) 6SHFLILFDWLRQ´, 1995, version 1.2, Online at: http://refspecs.freestandards.org/elf/elf.pdf (Referenced 30/July/2011) [21] The website of the EPCC Ness server. Online at: http://www.epcc.ed.ac.uk/facilities/ness/ (referenced 22/July/2011) [22] -%DUQHVDQG3+XW³A hierarchical O(N log N) force-calculation algorithm´ 1986, Nature, 324(4). [23] Klypin A. A., Shandarin S. F., 1983, MNRAS, 204, 891 [24] White S. D. M., Frenk C. S., Davis M., 1983, ApJ, 274, L1 [25] Gropp, W., Lusk, E., Skjellum, A.³Using MPI: portable parallel programming with the message-SDVVLQJLQWHUIDFH´1994, Cambridge, MA, USA: MIT Press Scientific And Engineering Computation Series, ISBN 0-262-57104-8. [26] The GNU Scientific Library website, Online at: http://www.gnu.org/s/gsl/ (Referenced 11/August/2011) [27] Frigo M., Johnson SG., "F FTW: an adaptive software architecture for the F FT", 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing 3: 1381±1384. doi:10.1109/ICASSP.1998.681704. [28] 63OLPSWRQ³Fast Parallel Algorithms for Short-Range Molecular Dynamics´, 1995, J Comp Phys, 117, 1-19. [29] ))$EUDKDP³Computational statistical mechanics: methodology, applications and supercomputing´$GYDQFHVLQ3K\VLFV±111

71 [30] 03$OOHQDQG'-7LOGHVOH\³Computer Simulation of Liquids´1987, Clarendon Press, Oxford. [31] Lennard-Jones, J. E., "On the Determination of Molecular Fields", 1924, Proc. R. Soc. Lond. A 106 (738): 463±477 [32] Brooks BR, Bruccoleri RE, Olafson BD, States DJ, Swaminathan S, Karplus M, "CHARMM: A program for macromolecular energy, minimization, and dynamics calculations", 1983, J Comp Chem 4 (2): 187±217. [33] Hockney R.W., Eastwood J.³Computer simulation using particles´ 1981, McGraw-Hill, New York. [34] Iain Bethune, Notes on porting HELIUM to HECToR & CRAY QUAD-CORE, referenced from private communication. [35] Centre for Theoretical Atomic, Molecular and Optical Physics (CTAMOP), the 4XHHQ¶V8QLYHUVLW\%HOIDVWZHEVLWH2QOLQHDW http://www.am.qub.ac.uk/ctamop/ili_1.html (Referenced 26/July/2011) [36] &6(UHSRUW³Implementation established algorithms to extend HELIUM´ 2011, Online at: http://www.hector.ac.uk/cse/distributedcse/reports/helium/ (Referenced 26/July/2011) [37] -RQDWKDQ63DUNHUDQG(GZDUG66P\WK³Implementation of established algorithms to extend HELIUM´, Online at: http://www.hector.ac.uk/cse/distributedcse/reports/helium/helium1.pdf (Referenced 26/July/2011) [38] HECToR website, Online at: http://www.hector.ac.uk/ (Reference 17/August/2011) [39] The GNU gprof profiler, Online at: http://www.cs.utah.edu/dept/old/texinfo/as/gprof.html (Referenced 17/August/2011) [40] The google perftools cpuprofiler, Online at: http://google-perftools.googlecode.com/svn/trunk/doc/cpuprofile.html (Referenced 16/August/2011) [41] ClooG library, Online at: http://www.cloog.org/ (Referenced 1/August/2011) [42] PPL library, Online at: http://www.cs.unipr.it/ppl/ (Referenced 1/August/2011) [43] GMP library, Online at: http://gmplib.org/ (Referenced 1/August/2011) [44] MPC library, Online at: http://www.multiprecision.org/ (Referenced 1/August/2011) [45] MPFR library, Online at: http://www.mpfr.org/ (Referenced 1/August/2011) [46] LIBICONV library, Online at: http://www.gnu.org/s/libiconv/ (Referenced 1/August/2011) [47] Just-In-Time (JIT) compilation, Online at: http://en.wikipedia.org/wiki/Just-in-time_compilation (Referenced 17/August/2011)

72 [48] Grand Central Dispatch (GCD) Reference, Online at: http://developer.apple.com/library/mac/#documentation/Performance/Referen ce/GCD_libdispatch_Ref/Reference/reference.html (Referenced 17/August/2011)

73