MB3 D7.17– Final report on Arm-optimized and mathematics libraries Version 1.0

Document Information

Contract Number 671697 Project Website www.montblanc-project.eu Contractual Deadline PM39 Dissemination Level PU Nature Report Authors Chris Goodyer (Arm), Paul Osmialowski (Arm) and Francesco Pet- rogalli (Arm) Contributors Chris Goodyer (Arm), Paul Osmialowski (Arm) and Francesco Pet- rogalli (Arm) Reviewers Keywords HPC, Fortran, OpenMP, Performance Libraries

Notices: This project has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 671697.

Mont-Blanc 3 Consortium Partners. All rights reserved. MB3 D7.17 - Final report on Arm-optimized Fortran compiler and mathematics libraries Version 1.0

Change Log

Version Description of Change

v0.1 Initial version of the deliverable

v1.0 Final version

2 D7.17 - Final report on Arm-optimized Fortran compiler and mathematics libraries Version 1.0 Contents

Executive Summary 4

1 Arm-optimized Fortran compiler 5 1.1 The importance of the Fortran programming language in the modern HPC world5 1.2 The significance of open source Fortran ...... 5 1.3 LLVM as an innovative open source infrastructure for developing modern compilers6 1.3.1 LLVM Intermediate Representation ...... 6 1.4 PGI Flang project ...... 7 1.5 How Flang fits into LLVM ...... 7 1.5.1 Fortran 90 Modules ...... 8 1.6 The Flang source code ...... 9 1.7 libpgmath – the mathematical intrinsics library ...... 9 1.8 Licensing ...... 10 1.9 Compatibility and performance ...... 10 1.9.1 Compatibility issues ...... 10 1.10 Vectorization and SVE ...... 14 1.11 Flang as a part of Arm Compiler for HPC suite ...... 17 1.11.1 Public contribution ...... 18 1.12 The future of Flang: F18 project ...... 19

2 OpenMP 20 2.1 Public contribution ...... 22

3 Current status of maths libraries on AArch64 26 3.1 libm functionality ...... 26 3.2 Vector math routines ...... 27 3.2.1 Limitations ...... 28 3.3 Vendor BLAS, LAPACK and FFT libraries ...... 29 3.3.1 Performance ...... 30 3.4 Additional HPC libraries ...... 33

A The ofc-test report 36

B Livermore benchmark report 45 B.1 Flang results ...... 45 B.2 GFortran results ...... 47

C Examples of vectorized and non–vectorized LLVM IR code 49

Acronyms and Abbreviations 58

3 MB3 D7.17 - Final report on Arm-optimized Fortran compiler and mathematics libraries Version 1.0 Executive Summary

This final report summarises the current state–of–the–art for the following topics:

• The current status and the future development of a new Arm–optimized Fortran compiler.

• The current status of various categorizations of mathematical libraries for the Armv8 architecture software ecosystem.

Overall it is shown how the range of packages available for users are looking very healthy for the deployment of real High Performance Computing (HPC) applications. The report emphasizes the importance of the Fortran programming language for HPC. It also explains that open source compilers ease adaptation of Fortran code base to a new hardware and various operating systems. The goal of the ‘PGI Flang’ project (and its successor, the ‘F18‘ project) is to provide an LLVM compliant open source Fortran Compiler. Thanks to a Contributor License Agreement (CLA) between PGI and Arm, we were able to actively participate in the development of the Flang compiler and its successor. Since it was made public, the Flang project has gained a lot of attention and was tested for conformance with Fortran standards and performance of generated code. This document summarises our effort in ensuring good standards conformance and compatibility with the AArch64 architecture, including suitability for SVE vectorization. The second half of the report focuses on the latest developments in the support for the differ- ent numerical libraries available on Arm systems. This is split into four sections. In the first we focus on the optimizations that have been upstreamed this year for higher performing versions of various transcendental functions and the improvements these make for real HPC applica- tions. Second we discuss the ability to vectorize loops that call vector versions of mathematical functions. This is followed by an update on the Arm Performance Libraries development that has happened, with a particular focus on FFT performance. Finally, the ongoing work with community codes, especially as provided through OpenHPC, is outlined.

4 D7.17 - Final report on Arm-optimized Fortran compiler and mathematics libraries Version 1.0 1 Arm-optimized Fortran compiler

1.1 The importance of the Fortran programming language in the modern HPC world

Through the decades of its existence, the Fortran programming language has proven to be an ideal language for numerical and scientific computing [Bra03][Loh10]. The key reasons Fortran is still so prevalent today are:

• Huge number of applications created over decades of investment in scientific software written in Fortran.

• Wide acceptance in scientific subject areas resulting in a large number of available experts.

• Good expressiveness for describing numerical algorithms.

• High efficiency of compiled code.

• Portability across many platforms with little need for conditional compilation.

These features guarantee continued development and further extension of the Fortran code base.

In this report we summarise the current effort to provide a new open source Arm-optimized Fortran compiler with following features:

• Capability to build optimized code for AArch64 architecture including ability to utilize SIMD units and SVE extensions.

• Compatibility with Fortran 95/2003 as well as with FORTRAN 77; partial compatibility with Fortran 2008.

• Utilization of the LLVM compiler suite’s infrastructure for advanced optimizations and binary code generation (including possible future integration with the LLVM ecosystem).

• Reasonable command line compatibility with GFortran from the GCC compiler suite.

1.2 The significance of open source Fortran compilers

Commercial Fortran compilers tend to be relatively expensive causing software developers to look for free open–source alternatives. The most used Open–source Fortran compiler now is GFortran which is covered by the GNU Public Licence that prevents basing a proprietary closed–source product upon it. These factors suppress the adoption of Fortran code base to new hardware or platforms and inhibit experimentation on new optimization techniques. Contrary to this, open source compilers with permissive licensing are amenable to modifications and fast prototyping of new ideas. In the rest of this section we focus our discussion on LLVM, the main open source alternative to the GCC compiler suite. 5 MB3 D7.17 - Final report on Arm-optimized Fortran compiler and mathematics libraries Version 1.0

Frontend Middle-end Backend (lexical analysis) (common optimizer) (binary code generation)

C AArch64

C++ LLVM IR POWER

Fortran MIPS64

Figure 1: Three-phase design of LLVM compilers

1.3 LLVM as an innovative open source infrastructure for de- veloping modern compilers

LLVM is a relatively young compiler originally started at the University of Illinois in 2000 as a research project of Chris Lattner. Despite being an open source project it gained the attention of commercial enterprises which has contributed to its rapid growth and wider recognition. This is mostly due to its permissive BSD/MIT–style license that does not impose limitations on including open source projects as a part of larger proprietary products. LLVM offers a modern design, with a modular architecture, reusable libraries and well– defined interfaces [LA04]. It implements a three–phase design that decouples parsing, optimiza- tion and binary code emission into three independent stages, as shown in Figure 1. This means that aggressive loop and interprocedural optimizations are possible, regardless of the language in which the program was written and independent of the target architecture onto which the compiled program will be executed.

1.3.1 LLVM Intermediate Representation

A typical LLVM compliant compiler frontend is responsible for parsing, validating and the diagnosis of errors in the input code written in a given programming language. It then translates the parsed code into LLVM IR (Intermediate Representation), typically by building an Abstract Syntax Tree (AST) and then converting the AST to LLVM IR. The IR syntax and structure is independent of the input programming language and is the only interface between the compiler frontend and the optimizer. Since LLVM IR has both binary and human–readable textual representations, one can ob- serve how the input code is presented to the optimizer and what transformations were performed on the IR. This is because optimization passes that transform the code do take LLVM IR as an input and produce (optimized) LLVM IR as an output. Such an approach, combined with the optional ability to obtain IR before and after each optimization pass, provides good insight into both the effect and correctness of the optimization steps. 6 D7.17 - Final report on Arm-optimized Fortran compiler and mathematics libraries Version 1.0 1.4 PGI Flang project In November 2015 the U.S. Department of Energy’s National Nuclear Security Administration (NNSA) and its three national laboratories announced an agreement with NVIDIA’s Portland Group, Inc. (PGI), for the creation of an open source Fortran compiler integrated with LLVM compilers infrastructure1. This enterprise is a part of the joint ‘Collaboration of Oak Ridge, Argonne and Lawrence Livermore laboratories’ (CORAL). In May 2017, PGI publicized Flang2 3, an Open–Source Fortran frontend for LLVM along with a complimentary runtime library. This frontend is capable of creating LLVM IR and is derived from the proprietary PGI Fortran compiler, which has been used in HPC environments for more than 25 years. PGI is a leading supplier of software compilers and tools for paral- lel computing; their compiler is known for its excellent level of support of the Fortran 2003 standard4. The GitHub flang-compiler account5 holds all of the source code and documentation and is open for external contributions; it is also used for bug tracking – this proved to be a good communication channel with the developers. To improve the communication even further, The flang-compiler channel was opened on the Slack on–line service6. This is an invitation only service, but anyone can join7. Thanks to an agreement between PGI and Arm, we have been able to test several binary pre-releases, preceding the official sources release. Later on, Arm was invited to join a group of early source code reviewers of both the compiler and the PGI Fortran runtime library. Our main concern during these reviews was to ensure compatibility with AArch64 architecture and that the LLVM IR code produced is of high enough quality, including suitability for vectorization.

1.5 How Flang fits into LLVM Due to its source code being based on PGI’s commercial Fortran compiler, Flang inherited most of its internal concepts: two–pass parser, semantic analyzer, two levels of internal representation (ILM, ILI) and internal optimizer. Because it is a part of the LLVM ecosystem and uses its backend, Flang can target multiple hardware architectures. Support is present for AArch64, 64-bit Power and x86 64 architectures, but it should be fairly easy to add any other 64-bit hardware architecture to this list. The Flang project is split into two major parts. The first part is the frontend driver which comprises of a series of patches applied on top of Clang’s git repository fork in GitHub8. This part is responsible for parsing Flang and Fortran specific command line options and establishing the whole of the pipeline. The second part is the front-end – its sources are held in a separate git repository on GitHub9. When compiled, it produces two executables: (flang1 and flang2). The first reads Fortran code and produces ILM (Flang’s first Intermediate Representation); the second reads ILM and produces ILI internally (Flang’s second Intermediate Representation, which also undergoes optimization steps) from which LLVM IR is finally generated.

1https://www.llnl.gov/news/nnsa-national-labs-team-nvidia-develop-open-source-fortran-compiler-technology 2http://lists.llvm.org/pipermail/llvm-dev/2017-May/113131.html 3https://www.phoronix.com/scan.php?page=news_item&px=LLVM-NVIDIA-Fortran-Flang 4http://fortranwiki.org/fortran/show/Fortran+2003+status 5https://github.com/flang-compiler 6https://flang-compiler.slack.com/messages 7https://join.slack.com/t/flang-compiler/shared_invite/MjExOTEyMzQ3MjIxLTE0OTk4NzQyNzUtODQzZWEyMjkwYw 8https://github.com/flang-compiler/clang 9https://github.com/flang-compiler/flang 7 MB3 D7.17 - Final report on Arm-optimized Fortran compiler and mathematics libraries Version 1.0

Clang driver Flang 1 Flang 2 LLVM (clang, opt, llc) System linker (ld)

Parse arguments Fortran -> ILM ILM -> ILI Optimize LLVM IR Object file -> ELF executable Optimize ILI LLVM IR -> AArch64 object file ILI -> LLVM IR

Figure 2: The Flang’s pipeline.

Figure 2 presents a broad view of the Fortran program compilation process. The more detailed list of functional blocks is following:

• Flang1 phases: scanner, turns Fortran code into tokens. parser, turns tokens into an Abstract Syntax Tree (AST) and a symbol table. transformer, turns the AST into a canonical AST. output, turns the canonical AST into ILM.

• Flang2 phases: expander, turns ILM into ILI. optimizer, turns ILI into optimized ILI. the bridge, turns optimized ILI into LLVM IR.

The Frontend driver starts the flang1 executable, which reads the Fortran program from an input file. The flang1’s scanner turns it into tokens. The parser turns them into an Abstract Syntax Tree (AST) and symbol table (symtab). The AST is then transformed into canonical form which is used to generate temporary text file containing ILM code – the first Intermediate Representation used by Flang. The ILM code is read from the temporary text file by the flang2 executable. During the import stage its text content is fed into data structures held in the process memory. This form is turned into ILI, the second Intermediate Representation used by Flang, which is then optimized by internal optimizer. Optimizations performed on the ILI are: constant folding, identities evaluation, comparison with zero evaluation and branch optimizations (changing a compare instruction followed by branch instruction into a single instruction that combines both). The optimized ILI is then used during the LLVM bridge phase where the final output text file with LLVM IR is generated. The LLVM IR generation is performed solely via explicit operations on text strings – no LLVM API is used for this process. The LLVM IR outputed by Flang is further processed by LLVM optimization and code generation passes. Optimizations at that stage are typical to LLVM’s middle–end. These include dead code elimination, loop unrolling and auto-vectorization.

1.5.1 Fortran 90 Modules For each module defined in a Fortran 90 source file, a .mod file is created along with a binary .o file produced for a given Fortran source file. If a source file defines more than one module, a single .o file is created accompanied by several .mod files. The role of Fortran 90 modules is 8 D7.17 - Final report on Arm-optimized Fortran compiler and mathematics libraries Version 1.0 similar to C/C++ header files (except that no preprocessor is involved and they are imported by Fortran 90 programs with a use statement) and they are typically installed into include directories along with C/C++ header files (while compiled library binaries containing actual executable code are held in lib directories). The .mod files are text files containing exported symbols definitions produced by flang1 (and not processed any further by flang2). Due to the differences in file format, modules compiled with one compiler (e.g. Flang) cannot be used by code compiled with the other compiler (e.g. GFortran).

1.6 The Flang source code Flang compiler was initially written in ANSI C98, although some of C++ files have started to appear during the further stages of development. There are also Fortran routines (e.g. ftn transpose real, ftn transpose cmplx, vmmul real, vmmul cmplx) in the runtime library. Vari- ous definition files are provided in nroff format (.n files) which are transformed into C headers (.h files). The documentation source files are also provided in nroff format (.n files). Various command–line utilities (e.g. groff or a2ps) can be used to generate documentation pages from these source files. As declared in point 1.3 of tools/flang2/docs/coding.n document, the established coding convention is as follows:

• ANSI C98 with some K&R code that will eventually be removed.

• The use of static keyword for limiting visibility of file globals and file functions.

• Naming convention for files: files that define external symbols have df prefix, headers that declare external symbols also have df prefix, e.g. ilidf.c, flgdf.h.

• Naming convention for code: variable names are lower case, macros and typedefs are typically upper case, enums begins with a capital letter, then are lower case (no explicit rule stated for function names though!).

• The first non commented line of each .c file is #include ”gbldefs.h” (note that there are three gbldefs.h files in the directory tree!).

• All dynamic storage allocation and freeing is done through the macros NEW, NEED and FREE defined in gbldefs.h.

The listed items seem to be reasonable, however, they do not prevent surprising constructs from appearing in the Flang code; one example is the re–defined assert macro which does not only reuse usual assert macro name, but also changes its interface by introducing additional parameters.

1.7 libpgmath – the mathematical intrinsics library An important part of Flang project is the Fortran runtime library. It implements all of the Fortran instructions that could not be expressed directly in the Intermediate Representation. Of all of those instructions, mathematical functions (also called fortran intrinsics) gained special attention. A new library called libpgmath was developed containing all the mathematical routines (implementing mathematical Fortran functions) extracted from Flang’s Fortran runtime library. 9 MB3 D7.17 - Final report on Arm-optimized Fortran compiler and mathematics libraries Version 1.0

Although still being a part of the Flang project, this library is built separately, before Flang itself (as Flang building process has now dependency on Libpgmath defined). Initially, x86 64–optimized implementation was developed, accompanied with generic im- plementation added soon after. The generic implementation acted as a wrapper to underlying mathematical library (usually libm). In July 2018 Cavium contributed an AArch64–optimized implementation. This implementation provides scalar and vectorized versions of mathematical functions. A patch for LLVM was provided by PGI that enables LLVM’s Loop Vectorizer to proceed with loops containing calls to those mathematical functions. Note that this library does not include specialized instructions or architecture specific ex- tensions of any kind.

1.8 Licensing

Flang and its runtime library are distributed under the Apache License version 2.010. This a is permissive Open Source licensing scheme which allows use of this software for developing both commercial and non–commercial projects. The LLVM project is distributed under the University of Illinois Open Source License11, yet another permissive licensing scheme. There are plans12 to relicense LLVM and distribute it under Apache License version 2.0.

1.9 Compatibility and performance Since it was made public, Flang has gained a lot of attention and has been tested for conformance with Fortran standards and performance of generated code. Figure 3 depicts how the choice of Fortran compiler (between GFortran version 8.2.0 and Flang) affects the performance of compiled workloads. Appendix B contains output messages from running Livermore benchmark compiled by GFortran and Flang. This benchmark was specifically designed for testing floating point com- putation rate (the number of floating point operations per second). Note that Flang performs most of its optimizations within LLVM passes and this arrange- ment – Fortran frontend combined with LLVM middle–end and backend – is still very new, with room for further improvement. Appendix A includes a Test Report from the Open Fortran Compiler Test Suite13 developed by Codethink. Although the test suite was originally designed for the purpose of testing their Fortran compiler, we reworked it for the purpose of testing Flang. The suite contains mostly FORTRAN 77 test cases imported from NIST F77 test suite14 as well as Fortran 90 cases added by Codethink. Test cases were compiled by Flang and GFortran then the both executed and the output compared (the Behaviour column). Some of the differences denoted as failures are caused by slight differences in output formatting between code compiled by these two compilers.

1.9.1 Compatibility issues Being a new Open Source compiler, Flang is perceived as an alternative to GFortran which can lead to the expectation that all software that is built by GFortran can also be built by Flang, possibly with use of the same command line options. Unfortunately doing this can

10http://www.apache.org/licenses/LICENSE-2.0 11https://otm.illinois.edu/disclose-protect/illinois-open-source-license 12http://lists.llvm.org/pipermail/llvm-dev/2018-October/126991.html 13https://github.com/CodethinkLabs/ofc-tests 14http://www.fortran-2000.com/ArnaudRecipes/fcvs21_f95.html 10 D7.17 - Final report on Arm-optimized Fortran compiler and mathematics libraries Version 1.0

(a) Polyhedron benchmarks (b) PolyBench/Fortran

Figure 3: Execution time differences between workloads compiled by Flang and workloads compiled by GFortran. Workloads are (a) the Polyhedron suite of benchmarks (downloaded from http://www.fortran.uk), and (b) the PolyBench/Fortran suite of benchmarks. All tests were executed on single core of an Armv8 machine equipped with Cortex–A57 cores. Both compilers were invoked with flags -mcpu=native -O3 -ffp–contract=fast -funroll–loops.

11 MB3 D7.17 - Final report on Arm-optimized Fortran compiler and mathematics libraries Version 1.0

benchmark gfortran (sec) flang (sec) rnflow 96.016 54.668 gas-dyn2 376.470 358.551 ac 43.041 44.622 aermod 49.455 49.078 air 16.037 15.606 capacita 50.955 38.262 channel2 204.390 227.479 doduc 61.272 64.467 fatigue2 257.309 403.286 induct2 398.882 388.594 linpk 14.138 13.837 mdbx 21.530 24.327 mp-prop-design 932.540 537.245 nf 21.320 23.871 protein 48.605 58.167 test-fpu2 191.847 194.076 tfft2 149.885 152.924

Table 1: Execution times of workloads compiled by Flang GFortran (raw re- sults in seconds). Workloads are from the Polyhedron suite of benchmarks (down- loaded from http://www.fortran.uk) executed on single core of an Armv8 ma- chine equipped with Cortex–A57 cores. Both compilers were invoked with flags -mcpu=native -O3 -ffp–contract=fast -funroll–loops.

12 D7.17 - Final report on Arm-optimized Fortran compiler and mathematics libraries Version 1.0

benchmark gfortran (sec) flang (sec) correlation 3.977 4.005 covariance 3.924 3.939 2mm 24.205 48.586 3mm 36.091 71.303 atax 0.072 0.0640 bicg 0.053 0.045 cholesky 0.544 0.445 doitgen 0.713 1.400 gemm 16.960 33.149 gemver 0.233 0.367 gesummv 0.062 0.060 mvt 0.152 0.233 symm 19.078 15.721 syr2k 6.717 5.642 syrk 3.214 2.613 trisolv 0.026 0.022 trmm 1.638 1.347 durbin 0.587 0.576 dynprog 0.525 0.642 gramschmidt 2.633 2.546 lu 0.499 0.453 ludcmp 2.064 2.063 floyd-warshall 2.110 2.647 reg-detect 0.018 0.015 adi 5.801 5.753 fdtd-2d 0.697 0.706 fdtd-ampl 1.828 1.824 jacobi-1d-imper 0.002 0.002 jacobi-2d-imper 0.110 0.129 seidel-2d 0.725 0.783

Table 2: Execution times of workloads compiled by Flang and GFortran (raw results in seconds). Worlkoads are from the PolyBench/Fortran suite of benchmarks executed on single core of an Armv8 machine equipped with Cortex–A57 cores. Both compilers were invoked with flags -mcpu=native -O3 -ffp–contract=fast -funroll–loops.

13 MB3 D7.17 - Final report on Arm-optimized Fortran compiler and mathematics libraries Version 1.0 result in compilation errors, unexpected runtime behaviours or even performance loss. Some of the options that are typical to GFortran (e.g. -fimplicit–none or -fno–second–underscore) are ignored by Flang, others have different default values. This situation changes dynamically as Flang developers are still working on fulfilling community expectations. A significant source of incompatibilites come from Open Source build configuration systems: GNU Autotools and CMake. The configure script shipped with Autotools–based projects cannot recognize Flang as a distinct Fortran compiler – it is wrongly recognized as PGIs commercial compiler, resulting in use of invalid link–time flags. To workaround this, the libtool script generated by configure needs to be modified before use (see listing 1 for ’sed’ shell commands that can be used for modifying generated libtool flie). sed −i −e ’ s#wl=””#wl=”−Wl,”#g’ libtool sed −i −e ’ s#p i c flag=””#pic f l a g=”−fPIC −DPIC”#g ’ l i b t o o l Listing 1: Shell commands needed for modifying the generated libtool script. With existing software that contains the configure script, not much can be done (except for instructing the users to run ’sed’ commands between configure and make steps). This problem may be solved in future releases. There used to be a similar problem with CMake which also recognized Flang as PGIs compiler and was putting into generated Makefiles incorrect flags for debug and release-with- debuginfo builds (PGI–specific -g opt instead of just -g). This issue has been fixed in the more recent versions of CMake.

1.10 Vectorization and SVE One of the most important features of the Flang compiler for HPC–focused applications is the ability to generate Vector Length Agnostic (VLA) [Pet16] code which allows the utilization of SVE [Pet16] extension for the Armv8 architecture. Vectorization is wholly handled by the LLVM optimization passes in the middle–end. Since Flang does not support explicit vectorization pragmas, generation of the binary code utilizing SIMD units or SVE extension relies on auto-vectorization. In this process loops are detected and legality of their possible vectorization is proven. Loops that are proven to be safe to vectorize have their IR transformed by extending it with additional basic blocks containing instructions operating on vector data types. These instructions are translated in the backend into binary instructions specific for the given SIMD unit. Listing 2 presents example Fortran subroutine containing a do loop that can be easily auto- vectorized. Listing 8 in appendix C presents a non–vectorized LLVM IR code for that subroutine consisting of two characteristic basic blocks of this loop in its canonical form. It also contains a preheader (labeled L.LB1 317.preheader) which ensures that there is only one entry to the loop body (labeled L.LB1 317). Listing 9 (appendix C) presents LLVM IR code vectorized for SIMD units. Compar- ing to non–vectorized IR, it contains additional basic blocks (labeled min.iters.checked, vec- tor.memcheck, vector.body and middle.block) that prepare for execution number of loop itera- tions that is multiply of vector length. For the remaining of iterations, the original loop basic blocks (labeled L.LB1 317.preheader and L.LB1 317) are still used. The main basic block of the vectorized loop (vector.body) contains instructions that operate on the vector types. These types are defined with the syntax, < × >, where n is the number of elements, of given type, in a vector. Listing 10 (appendix C) presents LLVM IR code vectorized for the SVE extension. Com- paring to IR vectorized for SIMD units, it contains only two additional (to non–vectorized 14 D7.17 - Final report on Arm-optimized Fortran compiler and mathematics libraries Version 1.0

IR) basic blocks, labeled vector.ph and vector.body. Both these basic blocks contain instruc- tions that operate on the scalable vector types. These types are defined with the syntax15, <[n] × × >, where n is a symbol indicating that vector is scalable. For fixed–length vectors, the n × is omitted which guarantees backward compatibility with the old syntax. The m value denotes minimum number of elements (of given type) and the n × indicates that the total number of elements is an unknown multiple of m.

15http://lists.llvm.org/pipermail/llvm-dev/2017-June/113587.html 15 MB3 D7.17 - Final report on Arm-optimized Fortran compiler and mathematics libraries Version 1.0 subroutine sub1(length, a, b, c, d, e)

implicit none

i n t e g e r : : i integer :: length r e a l : : x real :: a(length) real :: b(length) real :: c(length) real :: d(length) real :: e(length)

do i = 1, length x = b( i )∗∗2 − 4 . 0 ∗ a ( i ) ∗ c ( i ) d(i) = sqrt(x) e ( i ) = (−d( i ) − b( i ) ) ∗ 0 . 5 / a ( i ) d( i ) = (d( i ) − b( i ) ) ∗ 0 . 5 / a ( i ) end do end subroutine sub1 Listing 2: Example Fortran subroutine with a loop trivial to auto-vectorize.

Listing 3 presents an example of Fortran subroutine containing a do loop that cannot be auto-vectorized for SIMD units, however, it can be auto-vectorized for SVE units, as shown in Listing 11 (appendix C). This subroutine was derived from the previous one (see listing 2), the only difference is the addition of the if condition. This version results in the LLVM loop vectorization pass unable to prove the legality of the vectorization: LV: Checking a loop in ”sub2 ” from /tmp/1a1299. ll LV: Loop hints: force=? width=0 unroll=0 LV: Found a loop: L.LB1 317 LV: Can ’ t i f −convert the loop. LV: Not vectorizing: Cannot prove legality.

In order to make vectorization of such loop possible, LLVM auto-vectorization pass tries to flatten if statement into a single stream of instructions. This requires that all of the basic blocks of the conditional statement could be predicated. Unfortunately, load/store operations on arrays prevent that. In case of SVE, masked load/store operations could be utilized (see listing 4), enabling successfull vectorization. subroutine sub2(length, a, b, c, d, e)

implicit none

i n t e g e r : : i integer :: length r e a l : : x real :: a(length) real :: b(length) real :: c(length) 16 D7.17 - Final report on Arm-optimized Fortran compiler and mathematics libraries Version 1.0

real :: d(length) real :: e(length)

do i = 1, length x = b( i )∗∗2 − 4 . 0 ∗ a ( i ) ∗ c ( i ) if (x .ge. 0.0) then d(i) = sqrt(x) e ( i ) = (−d( i ) − b( i ) ) ∗ 0 . 5 / a ( i ) d( i ) = (d( i ) − b( i ) ) ∗ 0 . 5 / a ( i ) end i f end do end subroutine sub2 Listing 3: Example Fortran subroutine with a loop that can only be auto-vectorized for SVE.

%wide.masked.load43 = c a l l @llvm.masked. load .nxv4f32.p0nxv4f32( ∗ %39, i 3 2 4 , %predicate , undef), !alias.scope !5 %40 = fmul %wide.masked.load43 , shufflevector ( insertelement ( undef , float 4.000000e+00, i32 0), undef , zeroinitializer) %41 = fmul %wide.masked.load39 , %40 %42 = fsub %35, %41 %43 = c a l l @llvm. sqrt .nxv4f32( %42) %44 = getelementptr float , float ∗ %15, i64 %offset.idx %45 = bitcast float ∗ %44 to ∗ call void @llvm.masked.store.nxv4f32.p0nxv4f32( %43, ∗ %45, i 3 2 4 , %predicate), !alias.scope !7, !noalias !9 Listing 4: Example of using masked load/store SVE intrinsics in LLVM IR.

1.11 Flang as a part of Arm Compiler for HPC suite The Flang project gained huge attention in Arm which resulted in both contributions and bug reports. The main concern was to ensure valid execution of compiled Fortran code on AArch64 machines. This attention emerged from more general strategy to promote the 64-bit Arm architecture in HPC. The major manifestation of this strategy was releasing of the Arm Compiler for HPC suite targeted to major Linux distributions. Based on LLVM it offered Clang as the C/C++ compiler, Flang as the Fortran compiler and Arm Performance Libraries. 17 MB3 D7.17 - Final report on Arm-optimized Fortran compiler and mathematics libraries Version 1.0

A single license was donated to OpenHPC project which now can use the suite as yet another compilers family, next to GNU, Intel and generic LLVM along with Arm Performance Libraries similarly to Intel’s MKL. In effect, the Arm Compiler for HPC suite was succesfully installed in public OBS server16 used for building offical OpenHPC releases. The OBS users were allowed to use Arm compilers family for building software packages for AArch64 within both public and private projects.

1.11.1 Public contribution Arm’s effort in improving Flang and making it suitable for Arm Compiler for HPC has already resulted in our public contribution to open source and number of published modifications. To have our modifications accepted, a necessary Contributor License Agreement (CLA) was signed between Arm and PGI. The purose of this agreement is to ensure that the Flang project has the necessary grants of rights over all contributions to allow distribution of under the Apache 2.0 license. The full list of accepted changes (patches) is given below:

• https://github.com/flang-compiler/flang/pull/427 AArch64: fix for failing ieee17 test case: Introduce correct values for Rounding Mode selection. – This patch fixes a bug causing ieee17 test case to fail. – The valid values for Rounding Mode control field for AArch64 (bits [23:22] of FPCR register) are listed in Processor Technical Reference Manual17. • https://github.com/flang-compiler/flang/pull/460 Runtime: remove locks around malloc()/free() in mpmalloc.c for glibc–based systems. – Having locks around those functions in the Flang runtime library can ruin optimiza- tion effort when tcmalloc is preloaded to replace standard malloc()/free() implemen- tation with the ones optimized for reducing lock contention. • https://github.com/flang-compiler/flang/pull/488 AArch64: implement Flush–To–Zero get/set. – This change adds missing feature by filling empty fenv fesetzerodenorm() and fenv fegetzerodenorm() functions with required implementation. • https://github.com/flang-compiler/flang/pull/511 Use zeroinitializer instead of hard to handle stream of zeros. – This change amends the way arrays are initialized with zeros in the generated LLVM IR. – Instead of emiting stream of zeros, the zeroinitializer idiom is used. – This is to prevent hard to identify errors when the number of emitted zeros does not match with array dimensions.

Apart from accepted ones, there are still 16 of our changes waiting for project maintainers approval. These changes were applied in the Flang compiler shipped within Arm Compiler for HPC suite. 16https://build.openhpc.community 17http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.ddi0488d/CIHCACFF. html 18 D7.17 - Final report on Arm-optimized Fortran compiler and mathematics libraries Version 1.0 1.12 The future of Flang: F18 project

In July 2018 the NVIDIA’s Portland Group, Inc. (PGI) scheduled two webminars18 to announce and discuss their works on the new Fortran comiler frontend for LLVM. The new Open Source project, called F1819 has been developed publicly since its beginning. It is intended to replace the existing frontend as soon as it reaches a mature state. Even after that, the support for current Flang project is expected to continue. The new F18 compiler is written entirely in C++ (the latest C++17 dialect with extensive use of modern design patterns) and should use LLVM API for generating LLVM IR. Currently F18 compiler is capable to generate Abstract Syntax Trees for compiled programs and do source code rewrites. It features recursive descent parser with localized backtracking20 which is typical to modern compilers. The compiler source code is accompanied with rich set of unit tests, another manifestation of modern approach to software development. It is expected that in two years time F18 project should result in fully operational Fortran compiler of production quality.

18http://clang-developers.42468.n3.nabble.com/Webinar-A-New-Flang-Interactive-Code-Review-and-Design-Discussion-July-11-td4061028. html 19https://github.com/flang-compiler/f18 20https://github.com/flang-compiler/f18/blob/master/documentation/parsing.md 19 MB3 D7.17 - Final report on Arm-optimized Fortran compiler and mathematics libraries Version 1.0 2 OpenMP

OpenMP is a programming language extension21 [WC00] designed for developing parallel ap- plications on shared memory systems. It is available for the C, C++ and Fortran languages and enables the use of shared-memory parallelism by annotating the existing source code with directives describing how individual parts, often loops, should be parallelised. In order to handle threading and synchronization, all programs utilizing OpenMP directives must be linked against an OpenMP runtime library which is responsible for providing all neces- sary means of parallelism. Programs compiled by compilers from the GCC suite usually use the GOMP runtime library, whilst programs compiled by LLVM compilers can only use LLVM’s OpenMP runtime library. Programs compiled using GCC compilers (e.g. GFortran) can also use LLVM’s OpenMP runtime library as a drop-in replacement for GOMP. Contrary to this, programs compiled using LLVM compilers cannot use GOMP as a replacement for the LLVM OpenMP runtime. This is because both of these runtime libraries extend the official OpenMP runtime API with their own functions. The LLVM OpenMP runtime library offers compatibility with GOMP by implementing a compatibility layer that exports all GOMP functions. However GOMP does not implement the extended functionality implemented by the LLVM OpenMP runtime library. In order to fully utilize the runtime library’s capabilities, the binary code generated by the LLVM compilers usually contains calls to functions from the extended API of the LLVM OpenMP runtime library causing incompatibility with GOMP. With the growing number of cores available on Armv8 chips, it is important for the OpenMP runtime to scale well with the number of utilized CPU cores. In order to compare how both of the runtimes (GOMP and LLVM OpenMP runtime) behave when handling the same workload processed by a different number of threads (with each thread bound to separate processor core, due to thread affinity), we performed a series of experiments on a 96–core Armv8.1 machine (two 48–cores processors on a single board). Figure 4 presents results from running the LULESH benchmark (written in C++) with problem size 80. The experiment was repeated 96 times, each time adding one more thread with all the threads pinned to separate CPU cores. The following variants were tested:

• gcc gomp – the benchmark was compiled with the system–provided GCC (g++) compiler (version 5.4.0) and the system–provided GOMP runtime was used at execution time.

• gcc libomp – the benchmark was compiled with the system–provided GCC (g++) compiler (version 5.4.0) but executed with the LLVM OpenMP runtime (built from source code available on git repository’s master branch).

• gitgcc gitgomp – the benchmark was compiled with the GCC (g++) compiler built from the source code available on the git repository’s master branch and the GOMP runtime also built from the source code held on the same git repository.

• gitgcc libomp – the benchmark was compiled with the GCC (g++) compiler built from the source code available on the git repository’s master branch and executed with the LLVM OpenMP runtime (built from the source code available on the git repository’s master branch).

• llvm libomp – the benchmark was compiled with LLVM’s Clang (clang++) compiler built from the source code available on the git repository master branch and executed with the

21http://www.openmp.org 20 D7.17 - Final report on Arm-optimized Fortran compiler and mathematics libraries Version 1.0

Figure 4: Performance of the LULESH benchmark, problem size 80: execution time improve- ment with growing number of computation cores involved.

LLVM OpenMP runtime (also built from the source code available on the git repository’s master branch).

We can observe that LLVM OpenMP runtime offers consistently better performance with growing number of threads, regardless the compiler used for building the workload. When using the GOMP runtime we can notice that at a certain number of cores the performance increase does not only stop but it also starts to deteriorate. The glitch at 48 threads is caused by crossing CPU boundary, communication between CPU slots starts to take time at that point. Figure 5 presents results from running the same LULESH benchmark, this time with problem size 20 which is too small to fully embrace benefits from multi–threaded execution. We can observe that GOMP starts to deteriorate much faster than LLVM OpenMP runtime. Figure 6 presents results from another benchmark, called SNAP (version 1.04) that was written in Fortran 90. The experiment was repeated 96 times, each time one more thread was used with all the threads pinned to separate CPU cores. The following variants were tested:

• gitgnu gitgomp – the benchmark was compiled with the GFortran compiler built from the source code available on the git repository’s master branch and the GOMP runtime also built from the source code held on the same git repository.

• gitgnu libomp – the benchmark was compiled with the GFortran compiler built from the source code available on the git repository’s master branch and executed with the LLVM OpenMP runtime (built from the source code available on the git repository’s master branch).

• llvm libomp – the benchmark was compiled with LLVM’s Flang compiler from its git repository and executed with the LLVM OpenMP runtime (built from the source code 21 MB3 D7.17 - Final report on Arm-optimized Fortran compiler and mathematics libraries Version 1.0

Figure 5: Performance of the LULESH benchmark, problem size 20: execution time improve- ment with growing number of computation cores involved.

available on the git repository’s master branch).

Similarly to LULESH, in case of SNAP we can also observe that LLVM OpenMP runtime offers consistently better performance with growing number of threads, regardless the compiler used for building the workload.

2.1 Public contribution As a part of the Mont-Blanc 3 work on runtime support, our major concern has been to ensure stable execution of programs utilizing the LLVM OpenMP runtime library on AArch64 hard- ware. This effort has already resulted in our public contribution to open source as we published our modifications to LLVM OpenMP runtime project. The full list of changes (patches) is given below:

• https://reviews.llvm.org/D19319 ARM Limited license agreement from the copyright/patent holder. – This was a prerequisite for contributions by ARM. • https://reviews.llvm.org/D19629 Clean up issues around KMP USE FUTEX and kmp lock.h. – In the preprocessor condition checking, some locking mechanisms had been accidently disabled on ARMv8. • https://reviews.llvm.org/D19628 New hwloc API compatibility. 22 D7.17 - Final report on Arm-optimized Fortran compiler and mathematics libraries Version 1.0

Figure 6: Performance of the SNAP 1.04 benchmark: execution time improvement with growing number of computation cores involved.

– The latest development snapshots of hwloc (topology discovery helper) started to use a new API for its library; in order to keep up with these changes, some pieces of hwloc-dependent topology discovery code in the LLVM OpenMP runtime had to be upgraded to work with both old and new hwloc.

• https://reviews.llvm.org/D19878 Use C++11 atomics for the implementation of ticket locking.

– This is an important fix for the ticket locking mechanism which did not previously perform well on ARMv8. The most visible outcome of the problem was a race condition in work stealing routines. This posed a major flaw as it effectively made OpenMP 4.x tasking unusable on ARMv8 machines. – This patch replaces use of compiler builtin atomics with C++11 atomics for the ticket locking implementation.

• https://reviews.llvm.org/D19879 Solve “Too many args to microtask” problem.

– This patch solves the “Too many args to microtask” problem which occured while executing the LULESH benchmark on ARMv8. – The problem did not occur on Intel architecture as it had separate routine written in x86 assembler language to eagerly populate the stack with arbitrary number of arguments for microtask execution. Our contribution is to add a similar routine for ARMv8 written in AArch64 assembler language. 23 MB3 D7.17 - Final report on Arm-optimized Fortran compiler and mathematics libraries Version 1.0

• https://reviews.llvm.org/D19880 Fine tuning of TC* macros.

– This patch allows easy replacement of current no-operational implementation of TC* macros with alternative implementations (e.g. written in assembler language). – It also fixes usage of those macros that appeared to be wrong in some places when an example operational implementation was tested.

• https://reviews.llvm.org/D22365 Make balanced affinity work on AArch64.

– This patch enables balanced affinity (contributed by Intel for better support for their architecture) on ARMv8 machines. Unlike x86 64, ARMv8 cores are not split into hardware threads. Also certain ARMv8 SoCs (like AMD OpteronTM A-Series) have cores clustered into packages that share the same L2 cache. The implementation contributed by Intel explicitly refused to work with this kind of cores arrangement. It turned out that it is possible to generalize their implementation to work for any arrangement of cores with at least two levels of hierarchy.

• https://reviews.llvm.org/D30644 Add AArch64 support.

– This patch enables AArch64 support for libomptarget library responsible for offloading workloads to hardware targets (e.g. accelerators). – This piece of code allows offloading-to-self on AArch64 machines.

• https://reviews.llvm.org/D31071 GOMP compatibility: add missing OMP4.0 taskdeps handling code.

– This patch adds OpenMP 4.0 task dependencies handling code that is missing in LLVM libomp’s GOMP compatibility layer. – The absence of this code is responsible for weird behaviour of the taskdep benchmarks from Kastors suite whenever they are compiled using GCC and linked against LLVM OpenMP runtime instead of GOMP.

• https://reviews.llvm.org/D36510 OMP PROC BIND: better spread.

– This change improves the way threads are spread across cores when OMP PROC BIND=spread is set and no unusual affinity masks are in use.

• https://reviews.llvm.org/D41000 Fix cpuinfo issues.

– This patch fixes an issue with older /proc/cpuinfo layout on AArch64 Linux systems. – There are two /proc/cpuinfo layots in use for AArch64: old and new; the old one has all ’processor : n’ lines in one section.

• https://reviews.llvm.org/D41482 Add required AArch64 specific code for running OMPT test cases.

• https://reviews.llvm.org/D41542 OMPT: Add missing initialization in nested lwt.c test case. 24 D7.17 - Final report on Arm-optimized Fortran compiler and mathematics libraries Version 1.0

– Without initialization provided by this patch, OMPT test case nested lw.c tends to fail.

• https://reviews.llvm.org/D41854 OMPT: Fix type mismatch in omp control tool() implementation that makes it run incor- rectly on 32–bit Arm machines.

25 MB3 D7.17 - Final report on Arm-optimized Fortran compiler and mathematics libraries Version 1.0 3 Current status of maths libraries on AArch64

In this report we summarise the current status of various categorizations of mathematical li- braries. These are namely:

• libm functionality

• vendor maths libraries for BLAS and LAPACK

• additional libraries commonly used in the solution of HPC workloads

Each of these topics is explored in greater detail over the rest of this section. Clearly there is interaction between the performance of the higher level libraries and those of the underlying libraries on which their performance is based. For example a high performing BLAS implementa- tion is important for LAPACK to perform well. This, in turn, helps increase the performance of libraries such as PETSc [BAA+16a, BAA+16b, BGMS97] or Elemental [PMvdG+13]. Similarly the performance of the compilers used is very important to any libraries not written in hand- coded assembly, such as just about all of the last category. For the work covered in this section the standard GCC open-source compiler has been used. Up to date porting results for both HPC libraries and applications are maintained for both GCC and Arm Compiler. They can be found on the Arm HPC GitLab page, https://gitlab.com/arm-hpc/packages/wikis/home. Note that none of the libraries discussed in this section have had any development or op- timization work done, to date, using funding from the Mont Blanc 3 project. This section is therefore a summary of the state-of-the-art for Arm functionality.

3.1 libm functionality On all Linux systems, glibc provides the “libm” functionality that covers the standard math- ematical operations included within a “math.h” include file (in C). The glibc’s libm aims for correctness first, performance second, and size of compiled code third. It already implements complete POSIX 2008 mathematics support (including all of C99 with Annex F [ISO99a]) along with a few extensions, and it also plans to implement TS 18661 [ISO99b]. Detailed documentation is available [gli] that includes information on accuracy (section 19.7) and ‘fenv’ (floating point rounding and exception) behaviour (section 20.5.1). It is all mainly written in portable C code although some assembly variants do exist and does have separate code for variants like 128bit vs 64bit long double. Issues are publically tracked22 and currently only 2 of the 41 entries are Arm specific, neither of which are on accuracy grounds. The performance of glibc’s libm is not intended to give optimal performance, since several functions are implemented with correct rounding which is hard and expensive. It also ensures floating point exceptions are set correctly, which users often do not need, but requires extra code and branches. For HPC users such behaviour can be very important, especially for debugging and validation. Over the past twelve months Arm have been working to open source high performing versions of certain commonly used maths functions. These have been released as open source, under a permissive licence, through the Arm Optimized Routines 23 GitHub repository. They have also been upstreamed and accepted into glibc. Since many HPC users may not wish to compile such functions from source themselves, and there is a long delay between an update being available

22https://sourceware.org/bugzilla/buglist.cgi?component=math&list_id=31625& product=glibc&resolution=--- 23https://github.com/ARM-software/optimized-routines 26 D7.17 - Final report on Arm-optimized Fortran compiler and mathematics libraries Version 1.0

Code GCC time (s) Arm Compiler time (s) Arm Compiler + libamath (s) WRF 316.86 286.98 191.63 Cloverleaf 4918.58 4193.84 4193.84 OpenMX - 604.24 469.00 Branson 622.955 604.24 505.08

Table 3: Comparison timings for various HPC applications using GCC, the Arm Compiler and also using libamath in glibc and it reaching end-users via their linux distributions, we have also included them in the Arm Performance Libraries product, discussed below, as a standalone library called libamath. This gives users of Arm’s commercial HPC tools a direct route to a precompiled and optimized library that gives the best performance available. The difference it makes to real HPC codes is highlighted in Table 3. Overall the performance improvement in a code will be very dependent on the number of basic mathematical calls made, but in these cases a typical speed-up of 20% or 35% on real HPC codes is possible. Arm intends to keep improving the performance of both open source and commercial performance of these functions, and the list of optimized functions is expected to continue to grow over the coming years.

3.2 Vector math routines The current version of armclang supports the vectorization of loops that invoke math routines from libm. This is a different approach than used by armflang (initiated by invoking the compiler with the -fveclib option, which is now default behaviour), where for every scalar Libpgmath’s math routine a mapping exists to the vectorized version of a given routine (the auto-vectorization pass can make use of this mapping and proceed with vectorization of a loop that invokes math routines). Note that currently Libpgmath does not support SVE and does not call SLEEF functions, both features are under planning. Any C loop using functions from (or from in the case of C++, see listings 5 and 6) can be vectorized by invoking the compiler with the new option -fsimdmath, together with the usual options that are needed to activate the auto-vectorizer (-O2 in the example):

$ armclang -fsimdmath -c -O2 source.c $ armclang -fsimdmath -c -O2 source.cpp

/∗ C code example: source.c ∗/ #i n c l u d e void do something(double ∗ a , double ∗ b, unsigned N) { for (unsigned i = 0; i < N; ++i ) { /∗ some computation ∗/ a[i] = sin(b[i]); /∗ some computation ∗/ } } Listing 5: Example C code with a loop using a function provided via math.h.

27 MB3 D7.17 - Final report on Arm-optimized Fortran compiler and mathematics libraries Version 1.0

// C++ code example: source.cpp #i n c l u d e void do something(float ∗ a , f l o a t ∗ b, unsigned N) { for (unsigned i = 0; i < N; ++i ) { // some computation a[i] = std::pow(a[i], b[i]); /∗ some computation ∗/ } } Listing 6: Example C++ code with a loop using a function provided via cmath.

The vector version of the routines are provided via libsleefgnueabi, a library providing a vector implementation of the routines available in libm. The shared library libsleefgnuabi.so needed to link with the code generated with -fsimdmath is shipped with the compiler, and needs to be installed on the machine where the program is run with LD LIBRARY PATH visibility. The linking flag -lsleefgnuabi is implicit when -fsimdmath is used: $ armclang −fsimdmath main.c −O2 −o program

The library is built out of SLEEF, an open source vector libm replacement library available at http://sleef.org.

3.2.1 Limitations

This is an experimental feature. This paragraph describes the current limitations. Vector math routines are enabled only for Advanced SIMD vectorization. The feature is working with no user intervention only for sin, cos, pow, exp, log, log10, and for sinf, cosf, powf, expf, logf, log10f. Math functions not mentioned in the previous list can be added to the list by redeclaring them in the compilation unit after math.h (or ) inclusion. In particular, double precision functions need to be decorated with #pragma omp declare simd simdlen(2) notinbranch, and single precision functions need to be decorated with #pragma omp declare simd simdlen(4) notinbranch as in listing 7. #pragma omp declare simd simdlen(2) notinbranch extern double atan2(double x, double y); #pragma omp declare simd simdlen(4) notinbranch extern float coshf(float x); Listing 7: Example that shows how to enable additional functions from libm.

Note that the declaration needs to be specified with extern "C" language linkage (C++ code should pick up this up automatically via the standard system header inclusions). Note also that the feature described in the last example has undergone limited testing, any bugs detected in when using it should be promptly reported. The amount of performance gain obtainable by using the feature depends on the code surrounding the math routine invocation in the loop body. Loops with a lot of computation are more likely to benefit from the vectorization of the routine. 28 D7.17 - Final report on Arm-optimized Fortran compiler and mathematics libraries Version 1.0

When debugging the output code, or disassembling the object files generated using -fsimdmath, the symbols with a ZGVnN prefix are the symbols of the vector math routines provided via SLEEF.24 Notice that the scalar version of the math routines are still provided via the system libm. Any value discrepancies between the scalar execution and the vector execution is due to the different algorithms used in the two versions. When comparing the values generated with libm with those generated with libsleefgnuabi, anything with more than 1ULP error should be reported.

3.3 Vendor BLAS, LAPACK and FFT libraries The BLAS (Basic Linear Algebra Subprograms) [LHKK79, DCHH88b, DCHH88a, DCDH90b, DCDH90a] and LAPACK (Linear Algebra Pack) [ABB+99] libraries are de facto standards in scientific computing and date back to the first “level 1 BLAS” implementations of the late 1970s. These libraries cover the following sets of functionality: • BLAS

– Level 1 – Vector operations, e.g. dot products, scaling – Level 2 – Matrix-vector operations, e.g. matrix-vector multiplication – Level 3 – Matrix-matrix operations, e.g. matrix-matrix multiplication

• LAPACK

– Solving systems of linear equations – Eigenvalue and eigenvector problems – Singular value problems

These routines involve both real and complex variants in single and double precision. Interfaces are typically provided for both C and Fortran and this is no different on Armv8. The Arm Performance Libraries also includes an implementation of a set of FFT (Fast Fourier Transform) routines. FFTs do not have the same standard interface as for BLAS and LAPACK. Instead the same interface as provided by FFTW [FJ05] is supported. Users of HPC systems and developers of higher level libraries rely on the functionality of the vendor-provided BLAS and LAPACK implementations. They therefore need to be highly performant in order to be used as key building blocks for their own applications. Vendor libraries must also be trusted as giving the highest accuracy possible in order to ensure that confidence in any results obtained using them is absolute. For AArch64 the vendor library is the ‘Arm Performance Libraries’. This product was first released in November 2015 and has had updates typically every two or three months since then. Its development is happening alongside the bring-up of the first serious production-level Arm- powered server hardware with feedback from the early test users helping shape the places where development work goes. The Arm Performance Libraries are developed by a team in Arm specifically tasked at ensuring that the product has high performance on all microarchitectures. Optimization work is focused on those microarchitectures where Arm HPC machines are being deployed. Since Arm do not, themselves, have any details of partner microarchitectures, in order to provide the best experience for users we incorporate tuned versions that silicon partners have contributed

24The Vector ABI specifications will be published in September. The document explains the relation between the scalar functions decorated with the OpenMP declare simd directive and the associated vector functions. 29 MB3 D7.17 - Final report on Arm-optimized Fortran compiler and mathematics libraries Version 1.0 to open-source projects, such as OpenBLAS. On top of this, Arm add architectural, especially parallel performance, tuning. As part of the Arm Compiler for HPC package tailored builds for the following microarchi- tectures are currently available:

• Arm Cortex-A72

• Cavium ThunderX2 CN99

• Generic AArch64 with specific builds done against the following linux operating systems:

• Red Hat Enterprise Linux 7.2 and 7.3

• Ubuntu 16.04

• SUSE 12

These versions have been chosen to ensure complete coverage of all the systems that vendors are currently shipping in order to maximise the libraries’ uptake within the Arm HPC ecosystem. The accuracy of the library is ensured through the use of a test and validation suite purchased from NAG, the Numerical Algorithms Group25. NAG have a worldwide reputation for high accuracy of their numerical results and their technology has been used by many of the other vendor libraries over the past few decades. Testing of the correctness of the library, both in terms of numerical accuracy and on the handling of incorrect input parameters is carried out on every build released. This guarantee is not possible with open source libraries and therefore adds extra confidence to anyone using the library that their calculations are correct. In every shipped library there are eight builds provided that are combinations of following variants:

• GCC or Arm Compiler compatiblity

• Serial or with OpenMP

• 32-bit or 64-bit integers in order to give end users the choice of the appropriate version for use in their applications. Functionality and accuracy testing must therefore encompass all these build types, with shared- memory runs tested on increasing thread counts.

3.3.1 Performance The Arm Performance Libraries in total comprise of about 1800 numerical subroutines alongside a handful of utility functions. They are intended to be used on a variety of problem sizes and thread counts, hence the potential optimization work to be done on the library is very broad in scope. Each routine, typically, has a variety of different computational cases that it needs to consider with potentially different code paths inside in order to achieve maximum performance. Much of the work that has taken place over the past year has focused on increasing per- formance on the Cavium ThunderX2 systems, such as are provided in the Mont Blanc Dibona cluster. In particular a heavy emphasis has been placed on increasing the performance of the FFT routines. In work presented by MB3 partners at SC18, the gulf in performance between FFTW and Arm Performance Libraries implementations was highlighted. An example of this 30 D7.17 - Final report on Arm-optimized Fortran compiler and mathematics libraries Version 1.0

Figure 7: Comparison of FFT performance between FFTW and Arm Performance Libraries 18.1 31 MB3 D7.17 - Final report on Arm-optimized Fortran compiler and mathematics libraries Version 1.0

Figure 8: Comparison of FFT performance between FFTW and Arm Performance Libraries 18.2 gap is shown in Figure 7 where the x-axis shows the length of the FFT in question, and the y-axis gives the relative performance between FFTW and the 18.1 release of the Arm Perfor- mance Libraries. In this graph points above the line at y = 1 means that FFTW was faster than Arm Performance Libraries. It is clear that the results seen by Barcelona were not an isolated case of poor performance, but a common trend. Note that there are extra coloured markers to denote the FFT lengths seen in testing of various real HPC applications calling FFT functions. The development work done over the year by the Arm Performance Libraries team have more than remedied this situation. The results for 1-d transforms for the 18.2 release is compared against FFTW in Figure 8, although this time the y-axis is inverted to mean that above 1 is better for Arm Performance Libraries. It is clear that for most cases the Arm Performance Libraries implementation was faster, sometimes significantly. The development of this FFT work has continued through the year with subsequent releases both increasing performance still further, and extending the cases covered from just the complex-to-complex basic interface case covered in the 18.2 release, to now also include real-to-complex and complex-to-real cases, multidimensional cases, and also the advanced and guru interfaces. In addition the 19.0 release also includes the FFTW MPI interface. Overall, across the microarchitectures we test, the Arm Performance Libraries are exhibiting a strong level of performance which, when coupled with the high accuracy standards demanded by a vendor library, give users and ecosystem partners confidence in using them for real HPC deployments.

25http://www.nag.co.uk 32 D7.17 - Final report on Arm-optimized Fortran compiler and mathematics libraries Version 1.0 3.4 Additional HPC libraries The HPC world is awash with open source libraries which are almost all written in high level languages (Fortran, C, C++) without any assembly. As such porting them to AArch64 is normally a straightforward matter of ensuring that configuration files work and that any #ifdef sections do not accidentally fall to the final #else, for example assuming the architecture is Windows. Arm and their partners have been working hard endeavouring to ensure that as many of these packages install and build without problems from the open source downloads. A major initiative has been in providing recipes for many common HPC libraries, tools and applications to users, for both GCC and Arm Compiler. We have endeavoured to do this in a collaborative fashion, so a ‘Packages Wiki’, https://gitlab.com/arm-hpc/packages/wikis/home, was set up to house these instructions. Whilst Arm has provided the bulk of the recipes, contributions have been received from various organisations across the globe. Another part of this work has involved becoming ‘Silver members’ of the OpenHPC ini- tiative26. This is a cross-architecture, HPC-specific, community effort to provide a common stack that may be seen as a ‘base’ level for a modern HPC service. The main goal of the OpenHPC project is to ease installation and configuration of typical HPC working environment with its open-sourced utilities, scientific libraries, resource managers and cluster provisioning tools. For Arm, compilation of these packages expose the typical challenges experienced while working with all other open-source software: installation dependencies, configuration peculiar- ities, maintenance disturbances. The OpenHPC project is evolving into a mature development environment for providing HPC software packages that are well specified (in terms of packaging system), properly tested and ready to install on popular Linux distributions built around the RPM package manager: SLES, CentOS and RHEL. Initially, x86 64 was the only supported hardware architecture. During the work on the 1.2 version, Arm joined the effort which resulted in AArch64 Tech Preview release. A full list for the current (1.3.1) release is given in Table 4. The only differences to the x86 version are that no Intel compiler builds or Intel MPI builds of performance packages are included.

26http://openhpc.community 33 MB3 D7.17 - Final report on Arm-optimized Fortran compiler and mathematics libraries Version 1.0

Table 4: Packages supported in OpenHPC for Arm

Administrative Tools:

clustershell http://clustershell.sourceforge.net conman http://dun.github.io/conman docs https://github.com/openhpc/ohpc examples https://github.com/openhpc/ohpc ganglia http://ganglia.sourceforge.net genders https://github.com/chaos/genders lmod https://github.com/TACC/Lmod losf https://github.com/hpcsi/losf mrsh https://github.com/chaos/genders nagios http://www.nagios.org nagios-plugins https://www.nagios-plugins.org ndoutils http://www.nagios.org/download/addons nrpe http://www.nagios.org pdsh http://sourceforge.net/projects/pdsh prun https://github.com/openhpc/ohpc test-suite https://github.com/openhpc/ohpc/tests Compiler Families:

Gnu Compiler Suite http://gcc.gnu.org Development Tools:

autoconf http://www.gnu.org/software/autoconf automake http://www.gnu.org/software/automake EasyBuild http://hpcugent.github.com/easybuild hwloc http://www.open-mpi.org/projects/hwloc libtool http://www.gnu.org/software/libtool numpy http://sourceforge.net/projects/numpy scipy http://www.scipy.org R http://www.r-project.org spack https://github.com/LLNL/spack valgrind http://www.valgrind.org Updates of Distro-provided packages:

lua-bit http://bitop.luajit.org lua-filesystem http://keplerproject.github.com/luafilesystem lua-posix https://github.com/luaposix/luaposix IO Libraries:

adios http://www.olcf.ornl.gov/center-projects/adios hdf5 http://www.hdfgroup.org/HDF5 netcdf-cxx http://www.unidata.ucar.edu/software/netcdf netcdf-fortran http://www.unidata.ucar.edu/software/netcdf netcdf http://www.unidata.ucar.edu/software/netcdf phdf5 http://www.hdfgroup.org/HDF5 Lustre packages:

34 D7.17 - Final report on Arm-optimized Fortran compiler and mathematics libraries Version 1.0

lustre-client https://wiki.hpdd.intel.com shine http://lustre-shine.sourceforge.net/ MPI Runtime Families:

mpich http://www.mpich.org mvapich2 http://mvapich.cse.ohio-state.edu/overview/mvapich2 openmpi http://www.open-mpi.org Parallel Libraries:

boost http://www.boost.org fftw http://www.fftw.org hypre http://www.llnl.gov/casc/hypre mumps http://mumps.enseeiht.fr petsc http://www.mcs.anl.gov/petsc superlu dist http://crd-legacy.lbl.gov/˜xiaoye/SuperLU trilinos http://trilinos.sandia.gov/index.html Performance Tools:

imb https://software.intel.com/en-us/articles/intel-mpi-benchmarks mpiP http://mpip.sourceforge.net papi http://icl.cs.utk.edu/papi pdtoolkit http://www.cs.uoregon.edu/Research/pdt scalasca http://www.scalasca.org scorep http://www.vi-hps.org/projects/score-p tau http://www.cs.uoregon.edu/research/tau/home.php Provisioning Tools:

warewulf http://warewulf.lbl.gov Resource Mangement:

munge http://dun.github.io/munge PBSPro https://github.com/pbspro/pbspro slurm http://slurm.schedmd.com Runtimes:

ocr https://xstack.exascale-tech.com/wiki singularity http://singularity.lbl.gov Serial Libraries:

gsl http://www.gnu.org/software/gsl metis http://glaros.dtc.umn.edu/gkhome/metis/metis/overview openblas http://www.openblas.net superlu http://crd.lbl.gov/˜xiaoye/SuperLU

Arm architecture is fully supported for all supported Linux distributions. On top of this, Arm served on the OpenHPC Technical Steering Committee for 2016-7. Arm’s interests for 2017-8 are now represented by Linaro. Ongoing work will be on the introduction of Arm Compiler builds akin to builds using Intel’s compiler. 35 MB3 D7.17 - Final report on Arm-optimized Fortran compiler and mathematics libraries Version 1.0 A The ofc-test report

36 D7.17 - Final report on Arm-optimized Fortran compiler and mathematics libraries Version 1.0

Open Fortran Compiler Test Report

Open Fortran Compiler Test Report

This is the test report for Open Fortran Compiler (OFC).

Branch: master

SHA1: 7d913108a5897afaaa6b2118419e02eba495c7fc

Test run started: Wed Oct 10 20:32:22 2018 UTC programs/

Valgrind Source Semantic Standard Behaviour Reingest Valgrind (Debug) PASS PASS ( result , PASS ANSI_0052.f ANSI_0052.f - - (log ) expected ) (log ) PASS PASS ( result , PASS ANSI_0079.f ANSI_0079.f - - (log ) expected ) (log ) PASS PASS ( result , PASS ANSI_0090.f ANSI_0090.f - - (log ) expected ) (log ) PASS PASS ( result , PASS ANSI_0096.f ANSI_0096.f - - (log ) expected ) (log ) PASS PASS ( result , PASS ANSI_0124.f ANSI_0124.f - - (log ) expected ) (log ) PASS PASS ( result , PASS ANSI_0169.f ANSI_0169.f - - (log ) expected ) (log ) PASS PASS ( result , PASS ANSI_0327.f ANSI_0327.f - - (log ) expected ) (log ) PASS PASS ( result , PASS ANSI_0395.f ANSI_0395.f - - (log ) expected ) (log ) PASS PASS ( result , PASS ANSI_0522.f ANSI_0522.f - - (log ) expected ) (log ) PASS PASS ( result , PASS ANSI_0771.f ANSI_0771.f - - (log ) expected ) (log ) PASS PASS ( result , PASS ANSI_0787.f ANSI_0787.f - - (log ) expected ) (log ) PASS PASS ( result , PASS ANSI_0799.f ANSI_0799.f - - (log ) expected ) (log ) PASS PASS ( result , PASS ANSI_0899.f ANSI_0899.f - - (log ) expected ) (log ) PASS PASS ( result , PASS ANSI_1097.f ANSI_1097.f - - (log ) expected ) (log ) FAIL (2) ANSI_1253.f - - - - - (log ) PASS PASS ( result , PASS ANSI_1259.f ANSI_1259.f - - (log ) expected ) (log ) FAIL (2) ANSI_1568.f - - - - - (log ) PASS PASS ( result , PASS ANSI_1569.f ANSI_1569.f - - (log ) expected ) (log ) PASS PASS ( result , PASS ANSI_ASSIGN.f ANSI_ASSIGN.f - - (log ) expected ) (log ) PASS PASS ( result , PASS ANSI_BYTE_TYPE.f ANSI_BYTE_TYPE.f - - (log ) expected ) (log ) PASS PASS ( result , PASS ANSI_F77_D.f ANSI_F77_D.f - - (log ) expected ) (log ) PASS PASS ( result , PASS ANSI_F77_SUBSCRIPT.f ANSI_F77_SUBSCRIPT.f - - (log ) expected ) (log ) FAIL (2) ANSI_IMPLICIT.f - - - - - (log ) PASS PASS ( result , PASS ANSI_NON_STANDARD_ARG.f ANSI_NON_STANDARD_ARG.f - - (log ) expected ) (log )

37 1 of 8 MB3 D7.17 - Final report on Arm-optimized Fortran compiler and mathematics libraries Version 1.0

PASS PASS ( result , PASS ANSI_SLASH_INIT.f ANSI_SLASH_INIT.f - - (log ) expected ) (log ) FAIL (2) ANSI_TYPE_AS_PRINT.f - - - - - (log ) PASS PASS ( result , PASS ANSI_VAX_STRUCTURE.f ANSI_VAX_STRUCTURE.f - - (log ) expected ) (log ) PASS PASS ( result , PASS ERROR_0306.f ERROR_0306.f - - (log ) expected ) (log ) PASS PASS ( result , PASS FACS46M.FOR FACS46M.FOR - - (log ) expected ) (log ) PASS PASS ( result , PASS array.f array.f - - (log ) expected ) (log ) FAIL (1) PASS PASS array_slice.f array_slice.f (result , - - (log ) (log ) expected ) FAIL (1) PASS PASS array_slice_ns.f array_slice_ns.f (result , - - (log ) (log ) expected ) FAIL (1) PASS PASS as_array.f as_array.f (result , - - (log ) (log ) expected ) FAIL (2) automatic.f - - - - - (log ) PASS PASS ( result , PASS bang_test.f bang_test.f - - (log ) expected ) (log ) FAIL (1) PASS PASS byte.f byte.f (result , - - (log ) (log ) expected ) FAIL (2) char_int_eq.f - - - - - (log ) FAIL (1) PASS PASS dimension.f dimension.f (result , - - (log ) (log ) expected ) FAIL (1) PASS PASS do_block.f do_block.f (result , - - (log ) (log ) expected ) FAIL (1) PASS PASS ds_array.f ds_array.f (result , - - (log ) (log ) expected ) PASS PASS ( result , PASS empty_string.f empty_string.f - - (log ) expected ) (log ) PASS PASS ( result , PASS enddo.f enddo.f - - (log ) expected ) (log ) FAIL (1) PASS PASS f90_array.f f90_array.f (result , - - (log ) (log ) expected ) PASS PASS ( result , PASS fixed_bang.f fixed_bang.f - - (log ) expected ) (log ) FAIL (2) freeform132.f90 - - - - - (log ) FAIL (1) PASS PASS func_type.f func_type.f (result , - - (log ) (log ) expected ) FAIL (1) PASS PASS goto_init.f goto_init.f (result , - - (log ) (log ) expected ) FAIL (1) PASS PASS hex_range.f hex_range.f (result , - - (log ) (log ) expected ) PASS PASS ( result , PASS hollerith_ambig.f hollerith_ambig.f - - (log ) expected ) (log ) FAIL (1) PASS PASS hollerith_constant_test.f hollerith_constant_test.f (result , - - (log ) (log ) expected )

2 of 8 38 D7.17 - Final report on Arm-optimized Fortran compiler and mathematics libraries Version 1.0

FAIL (1) PASS PASS holstr.f holstr.f (result , - - (log ) (log ) expected ) PASS PASS ( result , PASS idv.f idv.f - - (log ) expected ) (log ) PASS PASS ( result , PASS if_block_format.f if_block_format.f - - (log ) expected ) (log ) PASS PASS ( result , PASS implicit_program.f implicit_program.f - - (log ) expected ) (log ) FAIL (2) implicit_static.f - - - - - (log ) FAIL (1) PASS PASS implicit_sub_arg.f implicit_sub_arg.f (result , - - (log ) (log ) expected ) FAIL (1) PASS PASS inc.f inc.f (result , - - (log ) (log ) expected ) FAIL (1) PASS PASS init.f init.f (result , - - (log ) (log ) expected ) FAIL (1) PASS PASS int8.f int8.f (result , - - (log ) (log ) expected ) PASS PASS ( result , PASS int_str.f int_str.f - - (log ) expected ) (log ) FAIL (1) PASS PASS intrinsic_func.f intrinsic_func.f (result , - - (log ) (log ) expected ) FAIL (1) PASS PASS intrinsic_ishft.f intrinsic_ishft.f (result , - - (log ) (log ) expected ) PASS PASS ( result , PASS label_end_do.f label_end_do.f - - (log ) expected ) (log ) PASS PASS ( result , PASS label_end_if.f label_end_if.f - - (log ) expected ) (log ) FAIL (1) PASS PASS log_str.f log_str.f (result , - - (log ) (log ) expected ) FAIL (1) PASS PASS name_keyword.f name_keyword.f (result , - - (log ) (log ) expected ) FAIL (2) non_exec_goto.f - - - - - (log ) FAIL (1) PASS PASS non_exec_label.f non_exec_label.f (result , - - (log ) (log ) expected ) PASS PASS ( result , PASS nothing.f nothing.f - - (log ) expected ) (log ) PASS PASS ( result , PASS open_rt_status.f open_rt_status.f - - (log ) expected ) (log ) FAIL (1) PASS PASS parameter.f parameter.f (result , - - (log ) (log ) expected ) PASS PASS ( result , PASS pragma.f pragma.f - - (log ) expected ) (log ) FAIL (1) PASS PASS record.f record.f (result , - - (log ) (log ) expected ) FAIL (1) PASS PASS reshape.f reshape.f (result , - - (log ) (log ) expected ) PASS PASS ( result , PASS scope_label.f scope_label.f - - (log ) expected ) (log )

3 of 8 39 MB3 D7.17 - Final report on Arm-optimized Fortran compiler and mathematics libraries Version 1.0

PASS PASS ( result , PASS semicolon.f semicolon.f - - (log ) expected ) (log ) FAIL (1) PASS PASS star_in_LHS.f star_in_LHS.f (result , - - (log ) (log ) expected ) PASS PASS ( result , PASS substring.f substring.f - - (log ) expected ) (log ) FAIL (1) PASS PASS type_basic.f type_basic.f (result , - - (log ) (log ) expected ) PASS PASS ( result , PASS var_implicit_do.f var_implicit_do.f - - (log ) expected ) (log ) FAIL (2) while_do.f - - - - - (log ) Total 71 / 81 44 / 71 71 / 71 - - programs/negative/

Source Semantic Standard Behaviour Reingest Valgrind Valgrind (Debug) ANSI_1323.f - PASS ( log ) - - - - byte_star.f byte_star.f FAIL (1) ( log ) - - - - cast_log_real.f cast_log_real.f FAIL (1) ( log ) - - - - double_program.f - PASS ( log ) - - - - intrinsic_decl.f intrinsic_decl.f FAIL (1) ( log ) - - - - label_invalid.f - PASS ( log ) - - - - lit8.f lit8.f FAIL (1) ( log ) - - - - print8.f print8.f FAIL (1) ( log ) - - - - select-case-neg-test.f - PASS ( log ) - - - - sfunc.f - PASS ( log ) - - - - split_ident.f - PASS ( log ) - - - - star_array.f - PASS ( log ) - - - - Total 7 / 12 - - - - programs/nist/

Source Semantic Standard Behaviour Reingest Valgrind Valgrind (Debug) FM001.FOR FM001.FOR PASS ( log ) PASS ( result , expected ) PASS ( log ) - - FM002.FOR FM002.FOR PASS ( log ) PASS ( result , expected ) PASS ( log ) - - FM003.FOR FM003.FOR PASS ( log ) PASS ( result , expected ) PASS ( log ) - - FM004.FOR FM004.FOR PASS ( log ) PASS ( result , expected ) PASS ( log ) - - FM005.FOR FM005.FOR PASS ( log ) PASS ( result , expected ) PASS ( log ) - - FM006.FOR FM006.FOR PASS ( log ) PASS ( result , expected ) PASS ( log ) - - FM007.FOR FM007.FOR PASS ( log ) PASS ( result , expected ) PASS ( log ) - - FM008.FOR FM008.FOR PASS ( log ) PASS ( result , expected ) PASS ( log ) - - FM009.FOR FM009.FOR PASS ( log ) PASS ( result , expected ) PASS ( log ) - - FM010.FOR FM010.FOR PASS ( log ) PASS ( result , expected ) PASS ( log ) - - FM011.FOR FM011.FOR PASS ( log ) PASS ( result , expected ) PASS ( log ) - - FM012.FOR FM012.FOR PASS ( log ) PASS ( result , expected ) PASS ( log ) - - FM013.FOR FM013.FOR PASS ( log ) PASS ( result , expected ) PASS ( log ) - - FM014.FOR FM014.FOR PASS ( log ) PASS ( result , expected ) PASS ( log ) - - FM016.FOR FM016.FOR PASS ( log ) PASS ( result , expected ) PASS ( log ) - - FM017.FOR FM017.FOR PASS ( log ) PASS ( result , expected ) PASS ( log ) - - FM018.FOR FM018.FOR PASS ( log ) PASS ( result , expected ) PASS ( log ) - - FM019.FOR FM019.FOR PASS ( log ) PASS ( result , expected ) PASS ( log ) - - FM020.FOR FM020.FOR PASS ( log ) PASS ( result , expected ) PASS ( log ) - - FM021.FOR FM021.FOR PASS ( log ) PASS ( result , expected ) PASS ( log ) - - FM022.FOR FM022.FOR PASS ( log ) PASS ( result , expected ) PASS ( log ) - - FM023.FOR FM023.FOR PASS ( log ) PASS ( result , expected ) PASS ( log ) - - FM024.FOR FM024.FOR PASS ( log ) PASS ( result , expected ) PASS ( log ) - - FM025.FOR FM025.FOR PASS ( log ) PASS ( result , expected ) PASS ( log ) - -

4 of 8 40 D7.17 - Final report on Arm-optimized Fortran compiler and mathematics libraries Version 1.0

FM026.FOR FM026.FOR PASS ( log ) PASS ( result , expected ) PASS ( log ) - - FM028.FOR FM028.FOR PASS ( log ) PASS ( result , expected ) PASS ( log ) - - FM030.FOR FM030.FOR PASS ( log ) PASS ( result , expected ) PASS ( log ) - - FM031.FOR FM031.FOR PASS ( log ) PASS ( result , expected ) PASS ( log ) - - FM032.FOR FM032.FOR PASS ( log ) PASS ( result , expected ) PASS ( log ) - - FM033.FOR FM033.FOR PASS ( log ) PASS ( result , expected ) PASS ( log ) - - FM034.FOR FM034.FOR PASS ( log ) PASS ( result , expected ) PASS ( log ) - - FM035.FOR FM035.FOR PASS ( log ) PASS ( result , expected ) PASS ( log ) - - FM036.FOR FM036.FOR PASS ( log ) PASS ( result , expected ) PASS ( log ) - - FM037.FOR FM037.FOR PASS ( log ) PASS ( result , expected ) PASS ( log ) - - FM038.FOR FM038.FOR PASS ( log ) PASS ( result , expected ) PASS ( log ) - - FM039.FOR FM039.FOR PASS ( log ) PASS ( result , expected ) PASS ( log ) - - FM040.FOR FM040.FOR PASS ( log ) PASS ( result , expected ) PASS ( log ) - - FM041.FOR FM041.FOR PASS ( log ) PASS ( result , expected ) PASS ( log ) - - FM042.FOR FM042.FOR PASS ( log ) PASS ( result , expected ) PASS ( log ) - - FM043.FOR FM043.FOR PASS ( log ) PASS ( result , expected ) PASS ( log ) - - FM044.FOR FM044.FOR PASS ( log ) PASS ( result , expected ) PASS ( log ) - - FM045.FOR FM045.FOR PASS ( log ) PASS ( result , expected ) PASS ( log ) - - FM050.FOR FM050.FOR PASS ( log ) PASS ( result , expected ) PASS ( log ) - - FM056.FOR FM056.FOR PASS ( log ) PASS ( result , expected ) PASS ( log ) - - FM060.FOR FM060.FOR PASS ( log ) PASS ( result , expected ) PASS ( log ) - - FM061.FOR FM061.FOR PASS ( log ) PASS ( result , expected ) PASS ( log ) - - FM062.FOR FM062.FOR PASS ( log ) PASS ( result , expected ) PASS ( log ) - - FM080.FOR FM080.FOR PASS ( log ) PASS ( result , expected ) PASS ( log ) - - FM097.FOR FM097.FOR PASS ( log ) PASS ( result , expected ) PASS ( log ) - - FM098.FOR FM098.FOR PASS ( log ) PASS ( result , expected ) PASS ( log ) - - FM099.FOR FM099.FOR PASS ( log ) PASS ( result , expected ) PASS ( log ) - - FM100.FOR FM100.FOR PASS ( log ) PASS ( result , expected ) PASS ( log ) - - FM101.FOR FM101.FOR PASS ( log ) PASS ( result , expected ) PASS ( log ) - - FM102.FOR FM102.FOR PASS ( log ) PASS ( result , expected ) PASS ( log ) - - FM103.FOR FM103.FOR PASS ( log ) PASS ( result , expected ) PASS ( log ) - - FM104.FOR FM104.FOR PASS ( log ) PASS ( result , expected ) PASS ( log ) - - FM105.FOR FM105.FOR PASS ( log ) PASS ( result , expected ) PASS ( log ) - - FM106.FOR FM106.FOR PASS ( log ) PASS ( result , expected ) PASS ( log ) - - FM107.FOR FM107.FOR PASS ( log ) PASS ( result , expected ) PASS ( log ) - - FM108.FOR FM108.FOR PASS ( log ) PASS ( result , expected ) PASS ( log ) - - FM109.FOR FM109.FOR PASS ( log ) PASS ( result , expected ) PASS ( log ) - - FM110.FOR FM110.FOR PASS ( log ) PASS ( result , expected ) PASS ( log ) - - FM111.FOR FM111.FOR PASS ( log ) PASS ( result , expected ) PASS ( log ) - - FM200.FOR FM200.FOR PASS ( log ) PASS ( result , expected ) PASS ( log ) - - FM201.FOR FM201.FOR PASS ( log ) PASS ( result , expected ) PASS ( log ) - - FM202.FOR FM202.FOR PASS ( log ) PASS ( result , expected ) PASS ( log ) - - FM203.FOR FM203.FOR PASS ( log ) PASS ( result , expected ) PASS ( log ) - - FM204.FOR FM204.FOR PASS ( log ) PASS ( result , expected ) PASS ( log ) - - FM205.FOR FM205.FOR PASS ( log ) PASS ( result , expected ) PASS ( log ) - - FM251.FOR FM251.FOR PASS ( log ) PASS ( result , expected ) PASS ( log ) - - FM252.FOR FM252.FOR PASS ( log ) PASS ( result , expected ) PASS ( log ) - - FM253.FOR FM253.FOR PASS ( log ) PASS ( result , expected ) PASS ( log ) - - FM254.FOR FM254.FOR PASS ( log ) PASS ( result , expected ) PASS ( log ) - - FM255.FOR FM255.FOR PASS ( log ) PASS ( result , expected ) PASS ( log ) - - FM256.FOR FM256.FOR PASS ( log ) PASS ( result , expected ) PASS ( log ) - - FM257.FOR FM257.FOR PASS ( log ) PASS ( result , expected ) PASS ( log ) - - FM258.FOR FM258.FOR PASS ( log ) PASS ( result , expected ) PASS ( log ) - - FM259.FOR FM259.FOR PASS ( log ) PASS ( result , expected ) PASS ( log ) - - FM260.FOR FM260.FOR PASS ( log ) PASS ( result , expected ) PASS ( log ) - - FM261.FOR FM261.FOR PASS ( log ) PASS ( result , expected ) PASS ( log ) - - FM300.FOR FM300.FOR PASS ( log ) PASS ( result , expected ) PASS ( log ) - - FM301.FOR FM301.FOR PASS ( log ) PASS ( result , expected ) PASS ( log ) - - FM302.FOR FM302.FOR PASS ( log ) PASS ( result , expected ) PASS ( log ) - - FM306.FOR FM306.FOR PASS ( log ) PASS ( result , expected ) PASS ( log ) - -

5 of 8 41 MB3 D7.17 - Final report on Arm-optimized Fortran compiler and mathematics libraries Version 1.0

FM307.FOR FM307.FOR PASS ( log ) PASS ( result , expected ) PASS ( log ) - - FM308.FOR FM308.FOR PASS ( log ) PASS ( result , expected ) PASS ( log ) - - FM311.FOR FM311.FOR PASS ( log ) PASS ( result , expected ) PASS ( log ) - - FM317.FOR FM317.FOR PASS ( log ) PASS ( result , expected ) PASS ( log ) - - FM328.FOR FM328.FOR PASS ( log ) PASS ( result , expected ) PASS ( log ) - - FM351.FOR FM351.FOR PASS ( log ) PASS ( result , expected ) PASS ( log ) - - FM352.FOR FM352.FOR PASS ( log ) PASS ( result , expected ) PASS ( log ) - - FM353.FOR FM353.FOR PASS ( log ) PASS ( result , expected ) PASS ( log ) - - FM354.FOR FM354.FOR PASS ( log ) PASS ( result , expected ) PASS ( log ) - - FM355.FOR FM355.FOR PASS ( log ) PASS ( result , expected ) PASS ( log ) - - FM356.FOR FM356.FOR PASS ( log ) PASS ( result , expected ) PASS ( log ) - - FM357.FOR FM357.FOR PASS ( log ) PASS ( result , expected ) PASS ( log ) - - FM359.FOR FM359.FOR PASS ( log ) PASS ( result , expected ) PASS ( log ) - - FM360.FOR FM360.FOR PASS ( log ) PASS ( result , expected ) PASS ( log ) - - FM361.FOR FM361.FOR PASS ( log ) PASS ( result , expected ) PASS ( log ) - - FM362.FOR FM362.FOR PASS ( log ) PASS ( result , expected ) PASS ( log ) - - FM363.FOR FM363.FOR PASS ( log ) PASS ( result , expected ) PASS ( log ) - - FM364.FOR FM364.FOR PASS ( log ) PASS ( result , expected ) PASS ( log ) - - FM368.FOR FM368.FOR PASS ( log ) PASS ( result , expected ) PASS ( log ) - - FM369.FOR FM369.FOR PASS ( log ) PASS ( result , expected ) PASS ( log ) - - FM370.FOR FM370.FOR PASS ( log ) PASS ( result , expected ) PASS ( log ) - - FM371.FOR FM371.FOR PASS ( log ) PASS ( result , expected ) PASS ( log ) - - FM372.FOR FM372.FOR PASS ( log ) PASS ( result , expected ) PASS ( log ) - - FM373.FOR FM373.FOR PASS ( log ) PASS ( result , expected ) PASS ( log ) - - FM374.FOR FM374.FOR PASS ( log ) PASS ( result , expected ) PASS ( log ) - - FM375.FOR FM375.FOR PASS ( log ) PASS ( result , expected ) PASS ( log ) - - FM376.FOR FM376.FOR PASS ( log ) PASS ( result , expected ) PASS ( log ) - - FM377.FOR FM377.FOR PASS ( log ) PASS ( result , expected ) PASS ( log ) - - FM378.FOR FM378.FOR PASS ( log ) PASS ( result , expected ) PASS ( log ) - - FM379.FOR FM379.FOR PASS ( log ) PASS ( result , expected ) PASS ( log ) - - FM401.FOR FM401.FOR PASS ( log ) PASS ( result , expected ) PASS ( log ) - - FM402.FOR FM402.FOR PASS ( log ) PASS ( result , expected ) PASS ( log ) - - FM403.FOR FM403.FOR PASS ( log ) PASS ( result , expected ) PASS ( log ) - - FM404.FOR FM404.FOR PASS ( log ) PASS ( result , expected ) PASS ( log ) - - FM405.FOR FM405.FOR PASS ( log ) PASS ( result , expected ) PASS ( log ) - - FM406.FOR FM406.FOR PASS ( log ) PASS ( result , expected ) PASS ( log ) - - FM407.FOR FM407.FOR PASS ( log ) PASS ( result , expected ) PASS ( log ) - - FM411.FOR FM411.FOR PASS ( log ) PASS ( result , expected ) PASS ( log ) - - FM413.FOR FM413.FOR PASS ( log ) PASS ( result , expected ) PASS ( log ) - - FM500.FOR FM500.FOR PASS ( log ) PASS ( result , expected ) PASS ( log ) - - FM503.FOR FM503.FOR PASS ( log ) PASS ( result , expected ) PASS ( log ) - - FM506.FOR FM506.FOR PASS ( log ) PASS ( result , expected ) PASS ( log ) - - FM509.FOR FM509.FOR PASS ( log ) PASS ( result , expected ) PASS ( log ) - - FM514.FOR FM514.FOR PASS ( log ) PASS ( result , expected ) PASS ( log ) - - FM517.FOR FM517.FOR PASS ( log ) PASS ( result , expected ) PASS ( log ) - - FM520.FOR FM520.FOR PASS ( log ) PASS ( result , expected ) PASS ( log ) - - FM700.FOR FM700.FOR PASS ( log ) PASS ( result , expected ) PASS ( log ) - - FM701.FOR FM701.FOR PASS ( log ) PASS ( result , expected ) PASS ( log ) - - FM710.FOR FM710.FOR PASS ( log ) PASS ( result , expected ) PASS ( log ) - - FM711.FOR FM711.FOR PASS ( log ) PASS ( result , expected ) PASS ( log ) - - FM715.FOR FM715.FOR PASS ( log ) PASS ( result , expected ) PASS ( log ) - - FM718.FOR FM718.FOR PASS ( log ) PASS ( result , expected ) PASS ( log ) - - FM719.FOR FM719.FOR PASS ( log ) PASS ( result , expected ) PASS ( log ) - - FM722.FOR FM722.FOR PASS ( log ) PASS ( result , expected ) PASS ( log ) - - FM800.FOR FM800.FOR PASS ( log ) PASS ( result , expected ) PASS ( log ) - - FM801.FOR FM801.FOR PASS ( log ) PASS ( result , expected ) PASS ( log ) - - FM802.FOR FM802.FOR PASS ( log ) PASS ( result , expected ) PASS ( log ) - - FM803.FOR FM803.FOR PASS ( log ) PASS ( result , expected ) PASS ( log ) - - FM804.FOR FM804.FOR PASS ( log ) PASS ( result , expected ) PASS ( log ) - - FM805.FOR FM805.FOR PASS ( log ) PASS ( result , expected ) PASS ( log ) - -

6 of 8 42 D7.17 - Final report on Arm-optimized Fortran compiler and mathematics libraries Version 1.0

FM806.FOR FM806.FOR PASS ( log ) PASS ( result , expected ) PASS ( log ) - - FM807.FOR FM807.FOR PASS ( log ) PASS ( result , expected ) PASS ( log ) - - FM808.FOR FM808.FOR PASS ( log ) PASS ( result , expected ) PASS ( log ) - - FM809.FOR FM809.FOR PASS ( log ) PASS ( result , expected ) PASS ( log ) - - FM810.FOR FM810.FOR PASS ( log ) PASS ( result , expected ) PASS ( log ) - - FM811.FOR FM811.FOR PASS ( log ) PASS ( result , expected ) PASS ( log ) - - FM812.FOR FM812.FOR PASS ( log ) PASS ( result , expected ) PASS ( log ) - - FM813.FOR FM813.FOR PASS ( log ) PASS ( result , expected ) PASS ( log ) - - FM814.FOR FM814.FOR PASS ( log ) PASS ( result , expected ) PASS ( log ) - - FM815.FOR FM815.FOR PASS ( log ) PASS ( result , expected ) PASS ( log ) - - FM816.FOR FM816.FOR PASS ( log ) PASS ( result , expected ) PASS ( log ) - - FM817.FOR FM817.FOR PASS ( log ) PASS ( result , expected ) PASS ( log ) - - FM818.FOR FM818.FOR PASS ( log ) PASS ( result , expected ) PASS ( log ) - - FM819.FOR FM819.FOR PASS ( log ) PASS ( result , expected ) PASS ( log ) - - FM820.FOR FM820.FOR PASS ( log ) PASS ( result , expected ) PASS ( log ) - - FM821.FOR FM821.FOR PASS ( log ) PASS ( result , expected ) PASS ( log ) - - FM822.FOR FM822.FOR PASS ( log ) PASS ( result , expected ) PASS ( log ) - - FM823.FOR FM823.FOR PASS ( log ) PASS ( result , expected ) PASS ( log ) - - FM824.FOR FM824.FOR PASS ( log ) PASS ( result , expected ) PASS ( log ) - - FM825.FOR FM825.FOR PASS ( log ) PASS ( result , expected ) PASS ( log ) - - FM826.FOR FM826.FOR PASS ( log ) PASS ( result , expected ) PASS ( log ) - - FM827.FOR FM827.FOR PASS ( log ) PASS ( result , expected ) PASS ( log ) - - FM828.FOR FM828.FOR PASS ( log ) PASS ( result , expected ) PASS ( log ) - - FM829.FOR FM829.FOR PASS ( log ) PASS ( result , expected ) PASS ( log ) - - FM830.FOR FM830.FOR PASS ( log ) PASS ( result , expected ) PASS ( log ) - - FM831.FOR FM831.FOR PASS ( log ) PASS ( result , expected ) PASS ( log ) - - FM832.FOR FM832.FOR PASS ( log ) PASS ( result , expected ) PASS ( log ) - - FM833.FOR FM833.FOR PASS ( log ) PASS ( result , expected ) PASS ( log ) - - FM834.FOR FM834.FOR PASS ( log ) PASS ( result , expected ) PASS ( log ) - - FM900.FOR FM900.FOR PASS ( log ) PASS ( result , expected ) PASS ( log ) - - FM901.FOR FM901.FOR PASS ( log ) PASS ( result , expected ) PASS ( log ) - - FM903.FOR FM903.FOR PASS ( log ) PASS ( result , expected ) PASS ( log ) - - FM905.FOR FM905.FOR PASS ( log ) FAIL (1) ( result , expected ) PASS ( log ) - - FM906.FOR FM906.FOR PASS ( log ) PASS ( result , expected ) PASS ( log ) - - FM907.FOR FM907.FOR PASS ( log ) FAIL (1) ( result , expected ) PASS ( log ) - - FM908.FOR FM908.FOR PASS ( log ) PASS ( result , expected ) PASS ( log ) - - FM909.FOR FM909.FOR PASS ( log ) PASS ( result , expected ) PASS ( log ) - - FM910.FOR FM910.FOR PASS ( log ) PASS ( result , expected ) PASS ( log ) - - FM912.FOR FM912.FOR PASS ( log ) PASS ( result , expected ) PASS ( log ) - - FM914.FOR FM914.FOR PASS ( log ) PASS ( result , expected ) PASS ( log ) - - FM915.FOR FM915.FOR PASS ( log ) PASS ( result , expected ) PASS ( log ) - - FM916.FOR FM916.FOR PASS ( log ) PASS ( result , expected ) PASS ( log ) - - FM917.FOR FM917.FOR PASS ( log ) PASS ( result , expected ) PASS ( log ) - - FM919.FOR FM919.FOR PASS ( log ) PASS ( result , expected ) PASS ( log ) - - FM920.FOR FM920.FOR PASS ( log ) PASS ( result , expected ) PASS ( log ) - - FM921.FOR FM921.FOR PASS ( log ) PASS ( result , expected ) PASS ( log ) - - FM922.FOR FM922.FOR PASS ( log ) PASS ( result , expected ) PASS ( log ) - - FM923.FOR FM923.FOR PASS ( log ) PASS ( result , expected ) PASS ( log ) - - Total 192 / 192 190 / 192 192 / 192 - -

programs/sema/

Valgrind Source Semantic Standard Behaviour Reingest Valgrind (Debug) PASS PASS ANSI_0155.f ANSI_0155.f - - - (log ) (log ) PASS PASS ANSI_0157.f ANSI_0157.f - - - (log ) (log ) PASS PASS ANSI_F77_SUBSTRING.f ANSI_F77_SUBSTRING.f - - - (log ) (log )

7 of 8 43 MB3 D7.17 - Final report on Arm-optimized Fortran compiler and mathematics libraries Version 1.0

PASS PASS EDIT_DESCRIPTOR_WITHOUT_W.f EDIT_DESCRIPTOR_WITHOUT_W.f - - - (log ) (log ) PASS PASS decl_star_array.f decl_star_array.f - - - (log ) (log ) PASS PASS if_not_logical.f if_not_logical.f - - - (log ) (log ) PASS PASS illogical.f illogical.f - - - (log ) (log ) PASS PASS implicit_rhs.f implicit_rhs.f - - - (log ) (log ) PASS PASS int7.f int7.f - - - (log ) (log ) FAIL (2) jump_to_format.f - - - - - (log ) PASS PASS keyword_abuse.f keyword_abuse.f - - - (log ) (log ) FAIL (2) select-case-test.f - - - - - (log ) FAIL (2) star_len.f - - - - - (log ) PASS PASS type_data.f type_data.f - - - (log ) (log ) Total 11 / 14 - 11 / 11 - - programs/todo/

Source Semantic Standard Behaviour Reingest Valgrind Valgrind (Debug) ANSI_0161.f - FAIL (2) ( log ) - - - - ANSI_0376.f ANSI_0376.f PASS ( log ) FAIL (2) ( result , expected ) PASS ( log ) - - ANSI_1160.f ANSI_1160.f PASS ( log ) FAIL (2) ( result , expected ) PASS ( log ) - - char_int4_eq.f - FAIL (2) ( log ) - - - - char_kind.f - FAIL (2) ( log ) - - - - define_test.f - FAIL (2) ( log ) - - - - int4_str.f int4_str.f PASS ( log ) PASS ( result , expected ) PASS ( log ) - - str_array.f str_array.f PASS ( log ) FAIL (1) ( result , expected ) PASS ( log ) - - subarray.f subarray.f PASS ( log ) FAIL (2) ( result , expected ) PASS ( log ) - - type.f - FAIL (2) ( log ) - - - - volterra_627.f volterra_627.f PASS ( log ) PASS ( result , expected ) PASS ( log ) - - Total 6 / 11 2 / 6 6 / 6 - -

8 of 8 44 D7.17 - Final report on Arm-optimized Fortran compiler and mathematics libraries Version 1.0 B Livermore benchmark report

This Livermore benchmark27 was executed on single core of an Armv8 machine equipped with Cortex–A57 cores. Both compilers (Flang and GFortran) were invoked with flags: -mcpu=native -O3 -ffp–contract=fast -funroll–loops.

B.1 Flang results

∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗ THE LIVERMORE FORTRAN KERNELS ”MFLOPS” TEST: ∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗ VERIFY: 200 0.1055E−05 = Time Resolution of Cpu−timer 6800 Repetition Count = MULTI ∗ Loops2 = 50.000

CLOCK CALIBRATION TEST OF INTERNAL CPU−TIMER: SECOND MONOPROCESS THIS TEST, STANDALONE, NO TIMESHARING. VERIFY TIMED INTERVALS SHOWN BELOW USING EXTERNAL CLOCK START YOUR STOPWATCH NOW !

Verify T or DT observe external clock:

−−−−−−− −−−−−−− −−−−−− −−−−− Total T ? Delta T ? Mflops ? Flops −−−−−−− −−−−−−− −−−−−− −−−−− 1 9.89 9.89 987.70 0.97647E+10 2 19.77 9.89 987.67 0.19529E+11 3 29.66 9.89 987.69 0.29294E+11 4 39.54 9.89 987.71 0.39059E+11 −−−−−−− −−−−−−− −−−−−− −−−−− END CALIBRATION TEST.

ESTIMATED TOTAL JOB CPU−TIME:= 41.408 sec. ( Nruns= 7 Trials)

Trial= 1 ChkSum= 914 Pass= 0 Fail= 0 Trial= 2 ChkSum= 914 Pass= 1 Fail= 0 Trial= 3 ChkSum= 914 Pass= 2 Fail= 0 Trial= 4 ChkSum= 914 Pass= 3 Fail= 0 Trial= 5 ChkSum= 914 Pass= 4 Fail= 0 Trial= 6 ChkSum= 914 Pass= 5 Fail= 0 Trial= 7 ChkSum= 914 Pass= 6 Fail= 0

NET CPU TIMING VARIANCE (T e r r ) ; A few % i s ok :

AVERAGE STANDEV MINIMUM MAXIMUM Terr 0.47% 1.39% 0.03% 7.13%

27http://www.netlib.org/benchmark/livermore 45 MB3 D7.17 - Final report on Arm-optimized Fortran compiler and mathematics libraries Version 1.0

NET CPU TIMING VARIANCE (T e r r ) ; A few % i s ok :

AVERAGE STANDEV MINIMUM MAXIMUM Terr 0.65% 2.06% 0.06% 10.44%

NET CPU TIMING VARIANCE (T e r r ) ; A few % i s ok :

AVERAGE STANDEV MINIMUM MAXIMUM Terr 0.59% 0.82% 0.03% 3.52%

∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗ THE LIVERMORE FORTRAN KERNELS: ∗ SUMMARY ∗ ∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗

Computer : Cortex−A57 System : Linux Compiler : Flang Date : Testor :

MFLOPS RANGE: REPORT ALL RANGE STATISTICS : Mean DO Span = 167 Code Samples = 72

Maximum Rate = 2468.2390 Mega−Flops / Sec . Quartile Q3 = 841.1185 Mega−Flops / Sec . Average Rate = 775.8688 Mega−Flops / Sec . GEOMETRIC MEAN = 620.9705 Mega−Flops / Sec . Median Q2 = 583.8345 Mega−Flops / Sec . Harmonic Mean = 510.8773 Mega−Flops / Sec . Quartile Q1 = 343.4946 Mega−Flops / Sec . Minimum Rate = 185.4776 Mega−Flops / Sec .

Standard Dev. = 567.4987 Mega−Flops / Sec . Avg Efficiency = 25.16% Program & Processor Mean Precision = 12.69 Decimal Digits 1

Version: 22/DEC/86 mf528 6094 CHECK FOR CLOCK CALIBRATION ONLY: Total Job Cpu Time = 4.14117E+01 Sec. Total 24 Kernels Time = 1.29659E+00 Sec. Total 24 Kernels Flops= 8.47454E+08 Flops Warning: ieee underflow is signaling FORTRAN STOP

46 D7.17 - Final report on Arm-optimized Fortran compiler and mathematics libraries Version 1.0 B.2 GFortran results

∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗ THE LIVERMORE FORTRAN KERNELS ”MFLOPS” TEST: ∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗ VERIFY: 200 0.1060E−05 = Time Resolution of Cpu−timer 6800 Repetition Count = MULTI ∗ Loops2 = 50.000

CLOCK CALIBRATION TEST OF INTERNAL CPU−TIMER: SECOND MONOPROCESS THIS TEST, STANDALONE, NO TIMESHARING. VERIFY TIMED INTERVALS SHOWN BELOW USING EXTERNAL CLOCK START YOUR STOPWATCH NOW !

Verify T or DT observe external clock:

−−−−−−− −−−−−−− −−−−−− −−−−− Total T ? Delta T ? Mflops ? Flops −−−−−−− −−−−−−− −−−−−− −−−−− 1 9.10 9.10 1117.96 0.10171E+11 2 18.20 9.10 1117.96 0.20343E+11 3 27.29 9.10 1117.96 0.30514E+11 4 36.39 9.10 1117.96 0.40685E+11 −−−−−−− −−−−−−− −−−−−− −−−−− END CALIBRATION TEST.

ESTIMATED TOTAL JOB CPU−TIME:= 38.100 sec. ( Nruns= 7 Trials)

Trial= 1 ChkSum= 914 Pass= 0 Fail= 0 Trial= 2 ChkSum= 914 Pass= 1 Fail= 0 Trial= 3 ChkSum= 914 Pass= 2 Fail= 0 Trial= 4 ChkSum= 914 Pass= 3 Fail= 0 Trial= 5 ChkSum= 914 Pass= 4 Fail= 0 Trial= 6 ChkSum= 914 Pass= 5 Fail= 0 Trial= 7 ChkSum= 914 Pass= 6 Fail= 0

NET CPU TIMING VARIANCE (T e r r ) ; A few % i s ok :

AVERAGE STANDEV MINIMUM MAXIMUM Terr 0.27% 0.41% 0.00% 2.13%

NET CPU TIMING VARIANCE (T e r r ) ; A few % i s ok :

AVERAGE STANDEV MINIMUM MAXIMUM Terr 0.20% 0.18% 0.03% 0.79%

NET CPU TIMING VARIANCE (T e r r ) ; A few % i s ok :

AVERAGE STANDEV MINIMUM MAXIMUM Terr 0.18% 0.12% 0.02% 0.43% 47 MB3 D7.17 - Final report on Arm-optimized Fortran compiler and mathematics libraries Version 1.0

∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗ THE LIVERMORE FORTRAN KERNELS: ∗ SUMMARY ∗ ∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗

Computer : Cortex−A57 System : Linux Compiler : gfortran Date : Testor :

MFLOPS RANGE: REPORT ALL RANGE STATISTICS : Mean DO Span = 167 Code Samples = 72

Maximum Rate = 2407.9233 Mega−Flops / Sec . Quartile Q3 = 1016.4227 Mega−Flops / Sec . Average Rate = 833.5390 Mega−Flops / Sec . GEOMETRIC MEAN = 671.6541 Mega−Flops / Sec . Median Q2 = 629.6430 Mega−Flops / Sec . Harmonic Mean = 552.5307 Mega−Flops / Sec . Quartile Q1 = 461.9650 Mega−Flops / Sec . Minimum Rate = 219.0730 Mega−Flops / Sec .

Standard Dev. = 587.0114 Mega−Flops / Sec . Avg Efficiency = 27.89% Program & Processor Mean Precision = 12.69 Decimal Digits 1

Version: 22/DEC/86 mf528 6097 CHECK FOR CLOCK CALIBRATION ONLY: Total Job Cpu Time = 3.81014E+01 Sec. Total 24 Kernels Time = 1.16649E+00 Sec. Total 24 Kernels Flops= 8.47454E+08 Flops

48 D7.17 - Final report on Arm-optimized Fortran compiler and mathematics libraries Version 1.0 C Examples of vectorized and non–vectorized LLVM IR code

define void @sub1 ( i 6 4 ∗ nocapture readonly %length , i64 ∗ nocapture readonly %a, i 6 4 ∗ nocapture readonly %b, i 6 4 ∗ nocapture readonly %c, i 6 4 ∗ nocapture %d, i 6 4 ∗ nocapture %e) local u n n a m e d a d d r #0 { L . e n t r y : %0 = bitcast i64 ∗ %length to i32 ∗ %1 = load i32, i32 ∗ %0, a l i g n 4 %2 = icmp slt i32 %1, 1 br i1 %2, label %L.LB1 318, label %L.LB1 317. preheader

L . LB1 317.preheader: ;preds=%L.entry %3 = bitcast i64 ∗ %b to i 8 ∗ %4 = getelementptr i8, i8 ∗ %3, i 6 4 −4 %5 = bitcast i8 ∗ %4 to f l o a t ∗ %6 = bitcast i64 ∗ %c to i 8 ∗ %7 = getelementptr i8, i8 ∗ %6, i 6 4 −4 %8 = bitcast i8 ∗ %7 to f l o a t ∗ %9 = bitcast i64 ∗ %a to i 8 ∗ %10 = getelementptr i8, i8 ∗ %9, i 6 4 −4 %11 = bitcast i8 ∗ %10 to f l o a t ∗ %12 = bitcast i64 ∗ %d to i 8 ∗ %13 = getelementptr i8, i8 ∗ %12, i 6 4 −4 %14 = bitcast i8 ∗ %13 to f l o a t ∗ %15 = bitcast i64 ∗ %e to i 8 ∗ %16 = getelementptr i8, i8 ∗ %15, i 6 4 −4 %17 = bitcast i8 ∗ %16 to f l o a t ∗ br label %L.LB1 317

L . LB1 317: ;preds=%L.LB1 317.preheader , %L.LB1 317 %indvars.iv = phi i64 [ 1, %L.LB1 317.preheader ], [ %indvars.iv.next, %L.LB1 317 ] %. dY0001 319.0 = phi i32 [ %1, %L.LB1 317.preheader ], [ %43, %L.LB1 317 ] %18 = getelementptr float , float ∗ %5, i64 %indvars.iv %19 = load float , float ∗ %18, a l i g n 4 %20 = fmul float %19, %19 %21 = getelementptr float , float ∗ %8, i64 %indvars.iv %22 = load float , float ∗ %21, a l i g n 4 %23 = getelementptr float , float ∗ %11, i64 %indvars.iv %24 = load float , float ∗ %23, a l i g n 4 %25 = fmul float %24, 4.000000e+00 %26 = fmul float %22, %25 %27 = fsub float %20, %26 %28 = tail call float @llvm.sqrt.f32(float %27) %29 = getelementptr float , float ∗ %14, i64 %indvars.iv store float %28, float ∗ %29, a l i g n 4 %30 = fsub float −0.000000e+00, %28 %31 = load float , float ∗ %18, a l i g n 4 %32 = fsub float %30, %31 %33 = fmul float %32, 5.000000e−01 %34 = load float , float ∗ %23, a l i g n 4 %35 = fdiv float %33, %34 %36 = getelementptr float , float ∗ %17, i64 %indvars.iv store float %35, float ∗ %36, a l i g n 4 %37 = load float , float ∗ %29, a l i g n 4 %38 = load float , float ∗ %18, a l i g n 4 %39 = fsub float %37, %38 %40 = fmul float %39, 5.000000e−01 %41 = load float , float ∗ %23, a l i g n 4 %42 = fdiv float %40, %41 store float %42, float ∗ %29, a l i g n 4 %indvars.iv.next = add nuw nsw i64 %indvars.iv, 1 %43 = add nsw i32 %.dY0001 319 . 0 , −1 %44 = icmp sgt i32 %.dY0001 319 . 0 , 1 br i1 %44, label %L.LB1 317, label %L.LB1 318

L . LB1 318: ;preds=%L.LB1 317, %L.entry r e t v o i d } Listing 8: Example non–vectorized LLVM IR code for a Fortran subroutine with a loop.

define void @sub1 ( i 6 4 ∗ nocapture readonly %length , i64 ∗ nocapture readonly %a, i 6 4 ∗ nocapture readonly %b, i 6 4 ∗ nocapture readonly %c, i 6 4 ∗ nocapture %d, i 6 4 ∗ nocapture %e) local u n n a m e d a d d r #0 { L . e n t r y : %0 = bitcast i64 ∗ %length to i32 ∗ %1 = bitcast i64 ∗ %d to i 8 ∗ %2 = bitcast i64 ∗ %e to i 8 ∗ %3 = bitcast i64 ∗ %b to i 8 ∗ %4 = bitcast i64 ∗ %c to i 8 ∗ %5 = bitcast i64 ∗ %a to i 8 ∗ %6 = load i32, i32 ∗ %0, a l i g n 4 49 MB3 D7.17 - Final report on Arm-optimized Fortran compiler and mathematics libraries Version 1.0

%7 = icmp slt i32 %6, 1 br i1 %7, label %L.LB1 318, label %L.LB1 317. preheader

L . LB1 317.preheader: ;preds=%L.entry %8 = getelementptr i8, i8 ∗ %3, i 6 4 −4 %9 = bitcast i8 ∗ %8 to f l o a t ∗ %10 = getelementptr i8, i8 ∗ %4, i 6 4 −4 %11 = bitcast i8 ∗ %10 to f l o a t ∗ %12 = getelementptr i8, i8 ∗ %5, i 6 4 −4 %13 = bitcast i8 ∗ %12 to f l o a t ∗ %14 = getelementptr i8, i8 ∗ %1, i 6 4 −4 %15 = bitcast i8 ∗ %14 to f l o a t ∗ %16 = getelementptr i8, i8 ∗ %2, i 6 4 −4 %17 = bitcast i8 ∗ %16 to f l o a t ∗ %18 = xor i32 %6, −1 %19 = icmp sgt i32 %18, −2 %smax = select i1 %19, i32 %18, i32 −2 %20 = add i32 %6, %smax %21 = add i32 %20, 1 %22 = zext i32 %21 to i64 %23 = add nuw nsw i64 %22, 1 %min.iters.check = icmp ult i64 %23, 4 br i1 %min.iters.check, label %L.LB1 317, label %min. iters .checked

min.iters.checked: ;preds=%L.LB1 317. preheader %24 = add i32 %20, 2 %25 = and i32 %24, 3 %n.mod.vf = zext i32 %25 to i64 %n.vec = sub nsw i64 %23, %n.mod.vf %cmp.zero = icmp eq i64 %n.vec, 0 br i1 %cmp.zero , label %L.LB1 317, label %vector.memcheck

vector.memcheck: ;preds=%min.iters.checked %26 = xor i32 %6, −1 %27 = icmp sgt i32 %26, −2 %smax2 = select i1 %27, i32 %26, i32 −2 %28 = add i32 %6, %smax2 %29 = add i32 %28, 1 %30 = zext i32 %29 to i64 %31 = shl nuw nsw i64 %30, 2 %32 = add nuw nsw i64 %31, 4 %uglygep = getelementptr i8 , i8 ∗ %1, i 6 4 %32 %uglygep3 = getelementptr i8 , i8 ∗ %2, i 6 4 %32 %uglygep4 = getelementptr i8 , i8 ∗ %3, i 6 4 %32 %uglygep5 = getelementptr i8 , i8 ∗ %4, i 6 4 %32 %uglygep6 = getelementptr i8 , i8 ∗ %5, i 6 4 %32 %bound0 = icmp ugt i8 ∗ %uglygep3 , %1 %bound1 = icmp ugt i8 ∗ %uglygep , %2 %found. conflict = and i1 %bound0, %bound1 %bound07 = icmp ugt i8 ∗ %uglygep4 , %1 %bound18 = icmp ugt i8 ∗ %uglygep , %3 %found. conflict9 = and i1 %bound07, %bound18 %conflict.rdx = or i1 %found.conflict , %found.conflict9 %bound010 = icmp ugt i8 ∗ %uglygep5 , %1 %bound111 = icmp ugt i8 ∗ %uglygep , %4 %found. conflict12 = and i1 %bound010, %bound111 %conflict.rdx13 = or i1 %conflict.rdx, %found.conflict12 %bound014 = icmp ugt i8 ∗ %uglygep6 , %1 %bound115 = icmp ugt i8 ∗ %uglygep , %5 %found. conflict16 = and i1 %bound014, %bound115 %conflict.rdx17 = or i1 %conflict.rdx13, %found.conflict16 %bound018 = icmp ugt i8 ∗ %uglygep4 , %2 %bound119 = icmp ugt i8 ∗ %uglygep3 , %3 %found. conflict20 = and i1 %bound018, %bound119 %conflict.rdx21 = or i1 %conflict.rdx17, %found.conflict20 %bound022 = icmp ugt i8 ∗ %uglygep5 , %2 %bound123 = icmp ugt i8 ∗ %uglygep3 , %4 %found. conflict24 = and i1 %bound022, %bound123 %conflict.rdx25 = or i1 %conflict.rdx21, %found.conflict24 %bound026 = icmp ugt i8 ∗ %uglygep6 , %2 %bound127 = icmp ugt i8 ∗ %uglygep3 , %5 %found. conflict28 = and i1 %bound026, %bound127 %conflict.rdx29 = or i1 %conflict.rdx25, %found.conflict28 %ind.end = add nsw i64 %n.vec, 1 %cast.crd = trunc i64 %n.vec to i32 %ind.end31 = sub i32 %6, %cast.crd br i1 %conflict.rdx29, label %L.LB1 317, label %vector.body

vector.body: ; preds=%vector.memcheck,%vector.body %index = phi i64 [ %index.next, %vector.body ], [ 0, %vector.memcheck ] %offset.idx = or i64 %index, 1 %33 = getelementptr float , float ∗ %9, i64 %offset.idx %34 = bitcast float ∗ %33 to <4 x f l o a t >∗ %wide.load = load <4 x f l o a t >, <4 x f l o a t >∗ %34, align 4, !alias.scope !0 %35 = fmul <4 x f l o a t > %wide.load , %wide.load %36 = getelementptr float , float ∗ %11, i64 %offset.idx %37 = bitcast float ∗ %36 to <4 x f l o a t >∗ %wide.load36 = load <4 x f l o a t >, <4 x f l o a t >∗ %37, align 4, !alias.scope !3 %38 = getelementptr float , float ∗ %13, i64 %offset.idx %39 = bitcast float ∗ %38 to <4 x f l o a t >∗ %wide.load37 = load <4 x f l o a t >, <4 x f l o a t >∗ %39, align 4, !alias.scope !5 %40 = fmul <4 x f l o a t > %wide.load37 , %41 = fmul <4 x f l o a t > %wide.load36 , %40 %42 = f s u b <4 x f l o a t > %35, %41 50 D7.17 - Final report on Arm-optimized Fortran compiler and mathematics libraries Version 1.0

%43 = c a l l <4 x f l o a t > @llvm. sqrt .v4f32(<4 x f l o a t > %42) %44 = getelementptr float , float ∗ %15, i64 %offset.idx %45 = bitcast float ∗ %44 to <4 x f l o a t >∗ s t o r e <4 x f l o a t > %43, <4 x f l o a t >∗ %45, align 4, !alias.scope !7, !noalias !9 %46 = f s u b <4 x f l o a t > , %43 %47 = bitcast float ∗ %33 to <4 x f l o a t >∗ %wide.load38 = load <4 x f l o a t >, <4 x f l o a t >∗ %47, align 4, !alias.scope !0 %48 = f s u b <4 x f l o a t > %46, %wide. load38 %49 = fmul <4 x f l o a t > %48, %50 = bitcast float ∗ %38 to <4 x f l o a t >∗ %wide.load39 = load <4 x f l o a t >, <4 x f l o a t >∗ %50, align 4, !alias.scope !5 %51 = f d i v <4 x f l o a t > %49, %wide. load39 %52 = getelementptr float , float ∗ %17, i64 %offset.idx %53 = bitcast float ∗ %52 to <4 x f l o a t >∗ s t o r e <4 x f l o a t > %51, <4 x f l o a t >∗ %53, align 4, !alias.scope !11, !noalias !12 %54 = bitcast float ∗ %44 to <4 x f l o a t >∗ %wide.load40 = load <4 x f l o a t >, <4 x f l o a t >∗ %54, align 4, !alias.scope !7, !noalias !9 %55 = bitcast float ∗ %33 to <4 x f l o a t >∗ %wide.load41 = load <4 x f l o a t >, <4 x f l o a t >∗ %55, align 4, !alias.scope !0 %56 = f s u b <4 x f l o a t > %wide.load40 , %wide.load41 %57 = fmul <4 x f l o a t > %56, %58 = bitcast float ∗ %38 to <4 x f l o a t >∗ %wide.load42 = load <4 x f l o a t >, <4 x f l o a t >∗ %58, align 4, !alias.scope !5 %59 = f d i v <4 x f l o a t > %57, %wide. load42 %60 = bitcast float ∗ %44 to <4 x f l o a t >∗ s t o r e <4 x f l o a t > %59, <4 x f l o a t >∗ %60, align 4, !alias.scope !7, !noalias !9 %index.next = add i64 %index, 4 %61 = icmp eq i64 %index.next, %n.vec br i1 %61, label %middle.block, label %vector.body, !llvm.loop !13

middle.block: ;preds=%vector.body %cmp.n = icmp eq i32 %25, 0 br i1 %cmp.n, label %L.LB1 318, label %L.LB1 317

L . LB1 317: ;preds=%L.LB1 317.preheader , %min. iters .checked , %vector .memcheck, %middle.block , %L. LB1 317 %indvars.iv = phi i64 [ %indvars.iv.next, %L.LB1 317 ] , [ 1, %vector.memcheck ], [ 1, %min.iters.checked ], [ 1 , %L. LB1 317.preheader ], [ %ind.end, %middle.block ] %. dY0001 319.0 = phi i32 [ %87, %L.LB1 317 ] , [ %6, %vector.memcheck ], [ %6, %min.iters.checked ], [ %6, %L. LB1 317.preheader ], [ %ind.end31, %middle.block ] %62 = getelementptr float , float ∗ %9, i64 %indvars.iv %63 = load float , float ∗ %62, a l i g n 4 %64 = fmul float %63, %63 %65 = getelementptr float , float ∗ %11, i64 %indvars.iv %66 = load float , float ∗ %65, a l i g n 4 %67 = getelementptr float , float ∗ %13, i64 %indvars.iv %68 = load float , float ∗ %67, a l i g n 4 %69 = fmul float %68, 4.000000e+00 %70 = fmul float %66, %69 %71 = fsub float %64, %70 %72 = tail call float @llvm.sqrt.f32(float %71) %73 = getelementptr float , float ∗ %15, i64 %indvars.iv store float %72, float ∗ %73, a l i g n 4 %74 = fsub float −0.000000e+00, %72 %75 = load float , float ∗ %62, a l i g n 4 %76 = fsub float %74, %75 %77 = fmul float %76, 5.000000e−01 %78 = load float , float ∗ %67, a l i g n 4 %79 = fdiv float %77, %78 %80 = getelementptr float , float ∗ %17, i64 %indvars.iv store float %79, float ∗ %80, a l i g n 4 %81 = load float , float ∗ %73, a l i g n 4 %82 = load float , float ∗ %62, a l i g n 4 %83 = fsub float %81, %82 %84 = fmul float %83, 5.000000e−01 %85 = load float , float ∗ %67, a l i g n 4 %86 = fdiv float %84, %85 store float %86, float ∗ %73, a l i g n 4 %indvars.iv.next = add nuw nsw i64 %indvars.iv, 1 %87 = add nsw i32 %.dY0001 319 . 0 , −1 %88 = icmp sgt i32 %.dY0001 319 . 0 , 1 br i1 %88, label %L.LB1 317, label %L.LB1 318, !llvm.loop !16

L . LB1 318: ;preds=%L.LB1 317, %middle.block , %L.entry r e t v o i d } Listing 9: Example LLVM IR code for a Fortran subroutine with a loop vectorized for SIMD units.

51 MB3 D7.17 - Final report on Arm-optimized Fortran compiler and mathematics libraries Version 1.0

define void @sub1 ( i 6 4 ∗ nocapture readonly %length , i64 ∗ nocapture readonly %a, i 6 4 ∗ nocapture readonly %b, i 6 4 ∗ nocapture readonly %c, i 6 4 ∗ nocapture %d, i 6 4 ∗ nocapture %e) local u n n a m e d a d d r #0 { L . e n t r y : %0 = bitcast i64 ∗ %length to i32 ∗ %1 = bitcast i64 ∗ %d to i 8 ∗ %2 = bitcast i64 ∗ %e to i 8 ∗ %3 = bitcast i64 ∗ %b to i 8 ∗ %4 = bitcast i64 ∗ %c to i 8 ∗ %5 = bitcast i64 ∗ %a to i 8 ∗ %6 = load i32, i32 ∗ %0, a l i g n 4 %7 = icmp slt i32 %6, 1 br i1 %7, label %L.LB1 318, label %L.LB1 317. preheader

L . LB1 317.preheader: ;preds=%L.entry %8 = getelementptr i8, i8 ∗ %3, i 6 4 −4 %9 = bitcast i8 ∗ %8 to f l o a t ∗ %10 = getelementptr i8, i8 ∗ %4, i 6 4 −4 %11 = bitcast i8 ∗ %10 to f l o a t ∗ %12 = getelementptr i8, i8 ∗ %5, i 6 4 −4 %13 = bitcast i8 ∗ %12 to f l o a t ∗ %14 = getelementptr i8, i8 ∗ %1, i 6 4 −4 %15 = bitcast i8 ∗ %14 to f l o a t ∗ %16 = getelementptr i8, i8 ∗ %2, i 6 4 −4 %17 = bitcast i8 ∗ %16 to f l o a t ∗ %18 = xor i32 %6, −1 %19 = xor i32 %6, −1 %20 = icmp sgt i32 %19, −2 %smax2 = select i1 %20, i32 %19, i32 −2 %21 = add i32 %6, %smax2 %22 = add i32 %21, 1 %23 = zext i32 %22 to i64 %24 = shl nuw nsw i64 %23, 2 %25 = add nuw nsw i64 %24, 4 %uglygep = getelementptr i8 , i8 ∗ %1, i 6 4 %25 %uglygep3 = getelementptr i8 , i8 ∗ %2, i 6 4 %25 %uglygep4 = getelementptr i8 , i8 ∗ %3, i 6 4 %25 %uglygep5 = getelementptr i8 , i8 ∗ %4, i 6 4 %25 %uglygep6 = getelementptr i8 , i8 ∗ %5, i 6 4 %25 %bound0 = icmp ugt i8 ∗ %uglygep3 , %1 %bound1 = icmp ugt i8 ∗ %uglygep , %2 %found. conflict = and i1 %bound0, %bound1 %bound07 = icmp ugt i8 ∗ %uglygep4 , %1 %bound18 = icmp ugt i8 ∗ %uglygep , %3 %found. conflict9 = and i1 %bound07, %bound18 %conflict.rdx = or i1 %found.conflict , %found.conflict9 %bound010 = icmp ugt i8 ∗ %uglygep5 , %1 %bound111 = icmp ugt i8 ∗ %uglygep , %4 %found. conflict12 = and i1 %bound010, %bound111 %conflict.rdx13 = or i1 %conflict.rdx, %found.conflict12 %bound014 = icmp ugt i8 ∗ %uglygep6 , %1 %bound115 = icmp ugt i8 ∗ %uglygep , %5 %found. conflict16 = and i1 %bound014, %bound115 %conflict.rdx17 = or i1 %conflict.rdx13, %found.conflict16 %bound018 = icmp ugt i8 ∗ %uglygep4 , %2 %bound119 = icmp ugt i8 ∗ %uglygep3 , %3 %found. conflict20 = and i1 %bound018, %bound119 %conflict.rdx21 = or i1 %conflict.rdx17, %found.conflict20 %bound022 = icmp ugt i8 ∗ %uglygep5 , %2 %bound123 = icmp ugt i8 ∗ %uglygep3 , %4 %found. conflict24 = and i1 %bound022, %bound123 %conflict.rdx25 = or i1 %conflict.rdx21, %found.conflict24 %bound026 = icmp ugt i8 ∗ %uglygep6 , %2 %bound127 = icmp ugt i8 ∗ %uglygep3 , %5 %found. conflict28 = and i1 %bound026, %bound127 %conflict.rdx29 = or i1 %conflict.rdx25, %found.conflict28 br i1 %conflict.rdx29, label %L.LB1 317, label %vector.ph

vector.ph: ;preds=%L.LB1 317. preheader %26 = icmp sgt i32 %18, −2 %smax = select i1 %26, i32 %18, i32 −2 %27 = add i32 %6, %smax %28 = add i32 %27, 1 %29 = zext i32 %28 to i64 %30 = add nuw nsw i64 %29, 1 %wide.end.idx. splatinsert = insertelement undef, i64 %30, i32 0 %wide.end.idx.splat = shufflevector %wide.end.idx. splatinsert , undef , zeroinitializer %31 = icmp ugt %wide.end.idx.splat , stepvector %predicate.entry = c a l l @llvm. propff .nxv4i1( shufflevector ( insertelement ( undef , i1 true, i32 0), undef , zeroinitializer), %31) br label %vector.body

vector.body: ;preds=%vector.body,%vector.ph %index = phi i64 [ 0, %vector.ph ], [ %index.next, %vector.body ] %predicate = phi [ %predicate.entry , %vector.ph ], [ %predicate.next, %vector.body ]

52 D7.17 - Final report on Arm-optimized Fortran compiler and mathematics libraries Version 1.0

%32 = icmp ult i64 %index, 2147483647 call void @llvm.assume(i1 %32) %offset.idx = or i64 %index, 1 %33 = getelementptr float , float ∗ %9, i64 %offset.idx %34 = bitcast float ∗ %33 to ∗ %wide.masked. load = c a l l @llvm.masked. load . nxv4f32 . p0nxv4f32(∗ %34, i 3 2 4 , %predicate , undef), !alias.scope !0 %35 = fmul %wide.masked.load , %wide.masked. load %36 = getelementptr float , float ∗ %11, i64 %offset.idx %37 = bitcast float ∗ %36 to ∗ %wide.masked. load39 = c a l l @llvm.masked. load . nxv4f32 . p0nxv4f32(∗ %37, i 3 2 4 , %predicate , undef), !alias.scope !3 %38 = getelementptr float , float ∗ %13, i64 %offset.idx %39 = bitcast float ∗ %38 to ∗ %wide.masked. load43 = c a l l @llvm.masked. load . nxv4f32 . p0nxv4f32(∗ %39, i 3 2 4 , %predicate , undef), !alias.scope !5 %40 = fmul %wide.masked.load43 , shufflevector ( insertelement ( undef , float 4.000000e+00, i32 0), undef , zeroinitializer) %41 = fmul %wide.masked.load39 , %40 %42 = f s u b %35, %41 %43 = c a l l @llvm. sqrt .nxv4f32( %42) %44 = getelementptr float , float ∗ %15, i64 %offset.idx %45 = bitcast float ∗ %44 to ∗ call void @llvm.masked. store .nxv4f32.p0nxv4f32( %43, ∗ %45, i 3 2 4 , %predicate), !alias.scope !7, !noalias !9 %46 = f s u b shufflevector ( insertelement ( undef , f l o a t −0.000000e+00, i32 0), undef , zeroinitializer), %43 %47 = bitcast float ∗ %33 to ∗ %wide.masked. load47 = c a l l @llvm.masked. load . nxv4f32 . p0nxv4f32(∗ %47, i 3 2 4 , %predicate , undef), !alias.scope !0 %48 = f s u b %46, %wide.masked. load47 %49 = fmul %48, shufflevector ( insertelement ( undef , float 5.000000e −01 , i 3 2 0 ) , undef , zeroinitializer) %50 = bitcast float ∗ %38 to ∗ %wide.masked. load48 = c a l l @llvm.masked. load . nxv4f32 . p0nxv4f32(∗ %50, i 3 2 4 , %predicate , undef), !alias.scope !5 %51 = f d i v %49, %wide.masked. load48 %52 = getelementptr float , float ∗ %17, i64 %offset.idx %53 = bitcast float ∗ %52 to ∗ call void @llvm.masked. store .nxv4f32.p0nxv4f32( %51, ∗ %53, i 3 2 4 , %predicate), !alias.scope !11, !noalias !12 %54 = bitcast float ∗ %44 to ∗ %wide.masked. load52 = c a l l @llvm.masked. load . nxv4f32 . p0nxv4f32(∗ %54, i 3 2 4 , %predicate , undef ) , !alias.scope !7, !noalias !9 %55 = bitcast float ∗ %33 to ∗ %wide.masked. load53 = c a l l @llvm.masked. load . nxv4f32 . p0nxv4f32(∗ %55, i 3 2 4 , %predicate , undef), !alias.scope !0 %56 = f s u b %wide.masked. load52 , %wide.masked. load53 %57 = fmul %56, shufflevector ( insertelement ( undef , float 5.000000e −01 , i 3 2 0 ) , undef , zeroinitializer) %58 = bitcast float ∗ %38 to ∗ %wide.masked. load54 = c a l l @llvm.masked. load . nxv4f32 . p0nxv4f32(∗ %58, i 3 2 4 , %predicate , undef), !alias.scope !5 53 MB3 D7.17 - Final report on Arm-optimized Fortran compiler and mathematics libraries Version 1.0

%59 = f d i v %57, %wide.masked. load54 %60 = bitcast float ∗ %44 to ∗ call void @llvm.masked. store .nxv4f32.p0nxv4f32( %59, ∗ %60, i 3 2 4 , %predicate), !alias.scope !7, !noalias !9 %index.next = add nuw nsw i64 %index, mul (i64 vscale , i64 4) %61 = add nuw nsw i64 %index, mul (i64 vscale, i64 4) %.splatinsert55 = insertelement undef, i64 %61, i32 0 %.splat56 = shufflevector %.splatinsert55 , undef , zeroinitializer %62 = add %.splat56 , stepvector %63 = icmp ult %62, %wide.end.idx. splat %predicate.next = c a l l @llvm. propff .nxv4i1( %predicate , %63) %64 = extractelement %predicate.next, i64 0 br i1 %64, label %vector.body, label %L.LB1 318, !llvm.loop !13

L . LB1 317: ;preds=%L.LB1 317.preheader , %L.LB1 317 %indvars.iv = phi i64 [ %indvars.iv.next, %L.LB1 317 ], [ 1, %L.LB1 317.preheader ] %. dY0001 319.0 = phi i32 [ %90, %L.LB1 317 ], [ %6, %L.LB1 317.preheader ] %65 = getelementptr float , float ∗ %9, i64 %indvars.iv %66 = load float , float ∗ %65, a l i g n 4 %67 = fmul float %66, %66 %68 = getelementptr float , float ∗ %11, i64 %indvars.iv %69 = load float , float ∗ %68, a l i g n 4 %70 = getelementptr float , float ∗ %13, i64 %indvars.iv %71 = load float , float ∗ %70, a l i g n 4 %72 = fmul float %71, 4.000000e+00 %73 = fmul float %69, %72 %74 = fsub float %67, %73 %75 = tail call float @llvm.sqrt.f32(float %74) %76 = getelementptr float , float ∗ %15, i64 %indvars.iv store float %75, float ∗ %76, a l i g n 4 %77 = fsub float −0.000000e+00, %75 %78 = load float , float ∗ %65, a l i g n 4 %79 = fsub float %77, %78 %80 = fmul float %79, 5.000000e−01 %81 = load float , float ∗ %70, a l i g n 4 %82 = fdiv float %80, %81 %83 = getelementptr float , float ∗ %17, i64 %indvars.iv store float %82, float ∗ %83, a l i g n 4 %84 = load float , float ∗ %76, a l i g n 4 %85 = load float , float ∗ %65, a l i g n 4 %86 = fsub float %84, %85 %87 = fmul float %86, 5.000000e−01 %88 = load float , float ∗ %70, a l i g n 4 %89 = fdiv float %87, %88 store float %89, float ∗ %76, a l i g n 4 %indvars.iv.next = add nuw nsw i64 %indvars.iv, 1 %90 = add nsw i32 %.dY0001 319 . 0 , −1 %91 = icmp sgt i32 %.dY0001 319 . 0 , 1 br i1 %91, label %L.LB1 317, label %L.LB1 318, !llvm.loop !16

L . LB1 318: ;preds=%vector.body,%L.LB1 317, %L.entry r e t v o i d } Listing 10: Example LLVM IR code for a Fortran subroutine with a loop vectorized for SVE extension.

define void @sub2 ( i 6 4 ∗ nocapture readonly %length , i64 ∗ nocapture readonly %a, i 6 4 ∗ nocapture readonly %b, i 6 4 ∗ nocapture readonly %c, i 6 4 ∗ nocapture %d, i 6 4 ∗ nocapture %e) local u n n a m e d a d d r #0 { L . e n t r y : %0 = bitcast i64 ∗ %length to i32 ∗ %1 = bitcast i64 ∗ %d to i 8 ∗ %2 = bitcast i64 ∗ %e to i 8 ∗ %3 = bitcast i64 ∗ %b to i 8 ∗ %4 = bitcast i64 ∗ %c to i 8 ∗ %5 = bitcast i64 ∗ %a to i 8 ∗ %6 = load i32, i32 ∗ %0, a l i g n 4 %7 = icmp slt i32 %6, 1 br i1 %7, label %L.LB1 318, label %L.LB1 317. preheader

L . LB1 317.preheader: ;preds=%L.entry %8 = getelementptr i8, i8 ∗ %3, i 6 4 −4 %9 = bitcast i8 ∗ %8 to f l o a t ∗ %10 = getelementptr i8, i8 ∗ %4, i 6 4 −4 %11 = bitcast i8 ∗ %10 to f l o a t ∗ %12 = getelementptr i8, i8 ∗ %5, i 6 4 −4 %13 = bitcast i8 ∗ %12 to f l o a t ∗ %14 = getelementptr i8, i8 ∗ %1, i 6 4 −4 %15 = bitcast i8 ∗ %14 to f l o a t ∗ %16 = getelementptr i8, i8 ∗ %2, i 6 4 −4 %17 = bitcast i8 ∗ %16 to f l o a t ∗ %18 = xor i32 %6, −1 %19 = xor i32 %6, −1 54 D7.17 - Final report on Arm-optimized Fortran compiler and mathematics libraries Version 1.0

%20 = icmp sgt i32 %19, −2 %smax4 = select i1 %20, i32 %19, i32 −2 %21 = add i32 %6, %smax4 %22 = add i32 %21, 1 %23 = zext i32 %22 to i64 %24 = shl nuw nsw i64 %23, 2 %25 = add nuw nsw i64 %24, 4 %uglygep = getelementptr i8 , i8 ∗ %1, i 6 4 %25 %uglygep5 = getelementptr i8 , i8 ∗ %2, i 6 4 %25 %uglygep6 = getelementptr i8 , i8 ∗ %3, i 6 4 %25 %uglygep7 = getelementptr i8 , i8 ∗ %4, i 6 4 %25 %uglygep8 = getelementptr i8 , i8 ∗ %5, i 6 4 %25 %bound0 = icmp ugt i8 ∗ %uglygep5 , %1 %bound1 = icmp ugt i8 ∗ %uglygep , %2 %found. conflict = and i1 %bound0, %bound1 %bound09 = icmp ugt i8 ∗ %uglygep6 , %1 %bound110 = icmp ugt i8 ∗ %uglygep , %3 %found. conflict11 = and i1 %bound09, %bound110 %conflict.rdx = or i1 %found.conflict , %found.conflict11 %bound012 = icmp ugt i8 ∗ %uglygep7 , %1 %bound113 = icmp ugt i8 ∗ %uglygep , %4 %found. conflict14 = and i1 %bound012, %bound113 %conflict.rdx15 = or i1 %conflict.rdx, %found.conflict14 %bound016 = icmp ugt i8 ∗ %uglygep8 , %1 %bound117 = icmp ugt i8 ∗ %uglygep , %5 %found. conflict18 = and i1 %bound016, %bound117 %conflict.rdx19 = or i1 %conflict.rdx15, %found.conflict18 %bound020 = icmp ugt i8 ∗ %uglygep6 , %2 %bound121 = icmp ugt i8 ∗ %uglygep5 , %3 %found. conflict22 = and i1 %bound020, %bound121 %conflict.rdx23 = or i1 %conflict.rdx19, %found.conflict22 %bound024 = icmp ugt i8 ∗ %uglygep7 , %2 %bound125 = icmp ugt i8 ∗ %uglygep5 , %4 %found. conflict26 = and i1 %bound024, %bound125 %conflict.rdx27 = or i1 %conflict.rdx23, %found.conflict26 %bound028 = icmp ugt i8 ∗ %uglygep8 , %2 %bound129 = icmp ugt i8 ∗ %uglygep5 , %5 %found. conflict30 = and i1 %bound028, %bound129 %conflict.rdx31 = or i1 %conflict.rdx27, %found.conflict30 br i1 %conflict.rdx31, label %L.LB1 317, label %vector.ph vector.ph: ;preds=%L.LB1 317. preheader %26 = icmp sgt i32 %18, −2 %smax = select i1 %26, i32 %18, i32 −2 %27 = add i32 %6, %smax %28 = add i32 %27, 1 %29 = zext i32 %28 to i64 %30 = add nuw nsw i64 %29, 1 %wide.end.idx. splatinsert = insertelement undef, i64 %30, i32 0 %wide.end.idx.splat = shufflevector %wide.end.idx. splatinsert , undef , zeroinitializer %31 = icmp ugt %wide.end.idx.splat , stepvector %predicate.entry = c a l l @llvm. propff .nxv4i1( shufflevector ( insertelement ( undef , i1 true, i32 0), undef , zeroinitializer), %31) br label %vector.body vector.body: ;preds=%vector.body,%vector.ph %index = phi i64 [ 0, %vector.ph ], [ %index.next, %vector.body ] %predicate = phi [ %predicate.entry , %vector.ph ], [ %predicate.next, %vector.body ] %32 = icmp ult i64 %index, 2147483647 call void @llvm.assume(i1 %32) %offset.idx = or i64 %index, 1 %33 = getelementptr float , float ∗ %9, i64 %offset.idx %34 = bitcast float ∗ %33 to ∗ %wide.masked. load = c a l l @llvm.masked. load . nxv4f32 . p0nxv4f32(∗ %34, i 3 2 4 , %predicate , undef), !alias.scope !0 %35 = fmul %wide.masked.load , %wide.masked. load %36 = getelementptr float , float ∗ %11, i64 %offset.idx %37 = bitcast float ∗ %36 to ∗ %wide.masked. load41 = c a l l @llvm.masked. load . nxv4f32 . p0nxv4f32(∗ %37, i 3 2 4 , %predicate , undef), !alias.scope !3 %38 = getelementptr float , float ∗ %13, i64 %offset.idx %39 = bitcast float ∗ %38 to ∗ %wide.masked. load45 = c a l l @llvm.masked. load . nxv4f32 . p0nxv4f32(∗ %39, i 3 2 4 , %predicate , undef), !alias.scope !5 %40 = fmul %wide.masked.load45 , shufflevector ( insertelement ( undef , float 4.000000e+00, i32 0), undef , zeroinitializer) 55 MB3 D7.17 - Final report on Arm-optimized Fortran compiler and mathematics libraries Version 1.0

%41 = fmul %wide.masked.load41 , %40 %42 = f s u b %35, %41 %43 = fcmp oge %42, zeroinitializer %44 = c a l l @llvm. sqrt .nxv4f32( %42) %45 = getelementptr float , float ∗ %15, i64 %offset.idx %46 = and %43, %predicate %47 = bitcast float ∗ %45 to ∗ call void @llvm.masked. store .nxv4f32.p0nxv4f32( %44, ∗ %47, i 3 2 4 , %46), !alias.scope !7, !noalias !9 %48 = f s u b shufflevector ( insertelement ( undef , f l o a t −0.000000e+00, i32 0), undef , zeroinitializer), %44 %49 = bitcast float ∗ %33 to ∗ %wide.masked. load49 = c a l l @llvm.masked. load . nxv4f32 . p0nxv4f32(∗ %49, i 3 2 4 , %predicate , undef), !alias.scope !0 %50 = f s u b %48, %wide.masked. load49 %51 = fmul %50, shufflevector ( insertelement ( undef , float 5.000000e −01 , i 3 2 0 ) , undef , zeroinitializer) %52 = bitcast float ∗ %38 to ∗ %wide.masked. load50 = c a l l @llvm.masked. load . nxv4f32 . p0nxv4f32(∗ %52, i 3 2 4 , %predicate , undef), !alias.scope !5 %53 = f d i v %51, %wide.masked. load50 %54 = getelementptr float , float ∗ %17, i64 %offset.idx %55 = and %43, %predicate %56 = bitcast float ∗ %54 to ∗ call void @llvm.masked. store .nxv4f32.p0nxv4f32( %53, ∗ %56, i 3 2 4 , %55), !alias.scope !11, !noalias !12 %57 = and %43, %predicate %58 = bitcast float ∗ %45 to ∗ %wide.masked. load54 = c a l l @llvm.masked. load . nxv4f32 . p0nxv4f32(∗ %58, i 3 2 4 , %57, undef ) , !alias.scope !7, !noalias !9 %59 = bitcast float ∗ %33 to ∗ %wide.masked. load55 = c a l l @llvm.masked. load . nxv4f32 . p0nxv4f32(∗ %59, i 3 2 4 , %predicate , undef), !alias.scope !0 %60 = f s u b %wide.masked. load54 , %wide.masked. load55 %61 = fmul %60, shufflevector ( insertelement ( undef , float 5.000000e −01 , i 3 2 0 ) , undef , zeroinitializer) %62 = bitcast float ∗ %38 to ∗ %wide.masked. load56 = c a l l @llvm.masked. load . nxv4f32 . p0nxv4f32(∗ %62, i 3 2 4 , %predicate , undef), !alias.scope !5 %63 = f d i v %61, %wide.masked. load56 %64 = and %43, %predicate %65 = bitcast float ∗ %45 to ∗ call void @llvm.masked. store .nxv4f32.p0nxv4f32( %63, ∗ %65, i 3 2 4 , %64), !alias.scope !7, !noalias !9 %index.next = add nuw nsw i64 %index, mul (i64 vscale , i64 4) %66 = add nuw nsw i64 %index, mul (i64 vscale, i64 4) %.splatinsert57 = insertelement undef, i64 %66, i32 0 %.splat58 = shufflevector %.splatinsert57 , undef , zeroinitializer %67 = add %.splat58 , stepvector %68 = icmp ult %67, %wide.end.idx. splat %predicate.next = call @llvm. propff .nxv4i1( %predicate , %68) %69 = extractelement %predicate.next, i64 0 br i1 %69, label %vector.body, label %L.LB1 318, !llvm.loop !13

L . LB1 317: ;preds=%L.LB1 317.preheader , %L.LB1 320 %indvars.iv = phi i64 [ %indvars.iv.next, %L.LB1 320 ], [ 1, %L.LB1 317.preheader ] %. dY0001 319.0 = phi i32 [ %96, %L.LB1 320 ], [ %6, %L.LB1 317.preheader ] %70 = getelementptr float , float ∗ %9, i64 %indvars.iv %71 = load float , float ∗ %70, a l i g n 4 %72 = fmul float %71, %71 %73 = getelementptr float , float ∗ %11, i64 %indvars.iv %74 = load float , float ∗ %73, a l i g n 4 56 D7.17 - Final report on Arm-optimized Fortran compiler and mathematics libraries Version 1.0

%75 = getelementptr float , float ∗ %13, i64 %indvars.iv %76 = load float , float ∗ %75, a l i g n 4 %77 = fmul float %76, 4.000000e+00 %78 = fmul float %74, %77 %79 = fsub float %72, %78 %80 = fcmp ult float %79, 0.000000e+00 br i1 %80, label %L.LB1 320, label %L.LB1 344

L . LB1 344: ;preds=%L.LB1 317 %81 = tail call float @llvm.sqrt.f32(float %79) %82 = getelementptr float , float ∗ %15, i64 %indvars.iv store float %81, float ∗ %82, a l i g n 4 %83 = fsub float −0.000000e+00, %81 %84 = load float , float ∗ %70, a l i g n 4 %85 = fsub float %83, %84 %86 = fmul float %85, 5.000000e−01 %87 = load float , float ∗ %75, a l i g n 4 %88 = fdiv float %86, %87 %89 = getelementptr float , float ∗ %17, i64 %indvars.iv store float %88, float ∗ %89, a l i g n 4 %90 = load float , float ∗ %82, a l i g n 4 %91 = load float , float ∗ %70, a l i g n 4 %92 = fsub float %90, %91 %93 = fmul float %92, 5.000000e−01 %94 = load float , float ∗ %75, a l i g n 4 %95 = fdiv float %93, %94 store float %95, float ∗ %82, a l i g n 4 br label %L.LB1 320

L . LB1 320: ;preds=%L.LB1 344 , %L. LB1 317 %indvars.iv.next = add nuw nsw i64 %indvars.iv, 1 %96 = add nsw i32 %.dY0001 319 . 0 , −1 %97 = icmp sgt i32 %.dY0001 319 . 0 , 1 br i1 %97, label %L.LB1 317, label %L.LB1 318, !llvm.loop !16

L . LB1 318: ;preds=%vector.body,%L.LB1 320, %L.entry r e t v o i d } Listing 11: Example LLVM IR code for a Fortran subroutine with a loop that could not be vectorized for SIMD units, yet it could be vectorized for SVE extension.

57 MB3 D7.17 - Final report on Arm-optimized Fortran compiler and mathematics libraries Version 1.0 Acronyms and Abbreviations

• AArch64: the 64-bit execution state including exception model, memory model, program- mers’ model and instruction set support for that state that runs on an Armv8 architecture.

• ANSI – American National Standards Institute: an administrator and coordinator of United States private sector voluntary standardization system

• API – Application Programing interface: a set of definitions, protocols and tools required for building software applications

• AST – Abstract Syntax Tree: a tree-like representation of the abstract syntactic structure of the source code

• ATLAS – Automatically Tuned Linear Algebra Software: an open source BLAS imple- mentation

• Autotools – An Open–Source infrastructure for managing and configuring the build pro- cess of software projects; also known as GNU Build System

• BLAS – Basic Linear Algerbra Subprograms: the standard API for basic matrix and vector operations

• BSD – Berkeley Software Distribution: originator of the BSD license

• CLA – Contributor License Agreement: an agreement which defines the terms under which intellectual property has been contributed to an open source project

• CMake – An Open–Source infrastructure developed by Kitware, Inc. for managing and configuring the build process of software projects

• CORAL – Collaboration of Oak Ridge, Argonne and Lawrence Livermore laboratories

• CPU – Central Processing Unit

• F18 – An open source Fortran compiler currently at its early stage of development intended to replace Flang compiler in the future

• Flang – An open source Fortran compiler compliant with LLVM compilers infrastructure, derived from PGI Fortran Compiler

• FFT – Fast Fourier Transforms: a mathematical algorithm for performing Fouries trans- forms

• FFTW – the Fastest Fourier Transform in the West: an open source library for performing FFTs

• FLOPS – Floating point Operations: a measure of how much floating point work has been done

• FPCR – Floating–point Control Register: An CPU register which controls FPU behavior

• FPU – Floating–point Unit

• FTZ – Flush To Zero: FPU mode in which denormalized numbers are replaced with zeros 58 D7.17 - Final report on Arm-optimized Fortran compiler and mathematics libraries Version 1.0

• GCC – GNU Compiler Collection: widely used open source compilers suite

• GFLOPS – Number of thousand million floating point operations performed

• GFortran – GNU Fortran: an open source Fortran compiler belonging to GCC compilers suite

• glibc – GNU C library: an open source implementation of standard C language library

• GNU – GNU’s not Unix (recursive acronym): free software mass collaboration project, originator of GNU Compiler Collection and GNU Public License

• GOMP – GNU OpenMP: OpenMP runtime library shipped with GNU Compiler Collec- tion

• GPU – Graphics Processing Unit

• HPC – High Performance Computing

• ILI – Intermediate Language Instructions: the second intermediate reprsesentation of a source program generated internally by the Flang compiler

• ILM – Intermediate Language Mnemonics: the first intermediate reprsesentation of a source program generated internally by the Flang compiler

• IR – Intermediate Representation: data structure or code used by a compiler or virtual machine to represent source code

• K&R – Kernighan and Ritchie: the book [KR78] known to C programmers as K&R which served for years as an informal specification of the C language

• LAPACK – Linear Algebra Pack: the standard API for solving linear equation systems, eigenvalue, eigenvector and singular value decomposition problems

• libpgmath – A part of Flang’s Fortran runtime library providing serial and vectorized implementations of mathematical functions

• LLNL – Lawrence Livermore National Laboratory

• LLVM – Low Level Virtual Machine: a collection of modular and reusable compiler and toolchain technologies

• LULESH – Livermore Unstructured Lagrangian Explicit Shock Hydrodynamics

• Makefile – A computer file containing a set of directives instructing how to build given software project from its source code

• MFLOPS – Millions of floating point operations performed

• MIT – Massachusetts Institute of Technology: originator of the MIT license

• MKL – Math Kernel Library: a library of optimized math routines for science, engineering, and financial applications

• MPI – Message Passing Interface: parallel communications standard for distributed mem- ory computing 59 MB3 D7.17 - Final report on Arm-optimized Fortran compiler and mathematics libraries Version 1.0

• NAS – NASA Advanced Supercomputing: a division in NASA that specializes in enabling advances in high-end computing technologies

• NASA – National Aeronautics and Space Administration

• NIST – National Institute of Standards and Technology: U.S. federal technology agency that works with industry to develop and apply technology, measurements, and standards

• NNSA – National Nuclear Security Administration: the U.S. agency responsible for en- hancing national security through the military application of nuclear science

• NPB – NAS Parallel Benchmark: a small set of programs designed to help evaluate the performance of parallel supercomputers

• nroff – ”new roff“: a part of Unix help system being used to format manual pages for print or display

• OBS – Open Build Service: a generic system to build and distribute binary packages from sources in an automatic, consistent and reproducible way

• OpenBLAS – An open source BLAS and LAPACK library with tuned kernels for many architectures and routines

• OpenMP – An Application Programming Interface for multi-platform shared memory multiprocessing programming in C, C++ and Fortran languages

• PGI – Portland Group, Inc.: was a company that produced a set of commercially available Fortran, C and C++ compilers for high-performance computing systems; acquired by NVIDIA in July 2013, currently exists as a brand

• SIMD – Single Instruction Multiple Data: a class of parallelism with multiple processing elements that perform the same operation on multiple portions of data simultaneously

• SNAP – SN Application Proxy: a proxy application to model the performance of a modern discrete ordinates neutral particle transport application

• SoC – System on a Chip: an integrated circuit that integrates all components of a com- puter, including CPU, GPU and various peripherals

• SPACK – A package management tool designed to support multiple versions and config- urations of software on a wide variety of platforms and environments

• SVE – Scalable Vector Extension: HPC–focused SIMD instruction extension to AArch64 architecture

• tcmalloc – Thread-Caching Malloc: Alternative implementation of malloc()/free() func- tions intended to reduce lock contention of multi–threaded programs; distributed as a part of Google Performance Tools suite

• VLA – Vector Length Agnostic: a programming model that can adapt to the available vector length

60 D7.17 - Final report on Arm-optimized Fortran compiler and mathematics libraries Version 1.0 References

[ABB+99] E. Anderson, Z. Bai, C. Bischof, S. Blackford, J. Demmel, J. Dongarra, J. Du Croz, A. Greenbaum, S. Hammarling, A. McKenney, and D. Sorensen. LA- PACK Users’ Guide. Society for Industrial and Applied Mathematics, Philadel- phia, PA, third edition, 1999.

[BAA+16a] Satish Balay, Shrirang Abhyankar, Mark F. Adams, Jed Brown, Peter Brune, Kris Buschelman, Lisandro Dalcin, Victor Eijkhout, William D. Gropp, Dinesh Kaushik, Matthew G. Knepley, Lois Curfman McInnes, Karl Rupp, Barry F. Smith, Stefano Zampini, Hong Zhang, and Hong Zhang. PETSc Web page. http://www.mcs.anl.gov/petsc, 2016.

[BAA+16b] Satish Balay, Shrirang Abhyankar, Mark F. Adams, Jed Brown, Peter Brune, Kris Buschelman, Lisandro Dalcin, Victor Eijkhout, William D. Gropp, Dinesh Kaushik, Matthew G. Knepley, Lois Curfman McInnes, Karl Rupp, Barry F. Smith, Stefano Zampini, Hong Zhang, and Hong Zhang. PETSc users manual. Technical Report ANL-95/11 - Revision 3.7, Argonne National Laboratory, 2016.

[BGMS97] Satish Balay, William D. Gropp, Lois Curfman McInnes, and Barry F. Smith. Ef- ficient management of parallelism in object oriented numerical software libraries. In E. Arge, A. M. Bruaset, and H. P. Langtangen, editors, Modern Software Tools in Scientific Computing, pages 163–202. Birkh¨auserPress, 1997.

[Bra03] Walt Brainerd. The importance of fortran in the 21st century. Journal of Modern Applied Statistical Methods, 2, 2003.

[DCDH90a] J. J. Dongarra, J. Du Croz, I. S. Duff, and S. Hammarling. Algorithm 679: A set of level 3 Basic Linear Algebra Subprograms. ACM Trans. Math. Soft., 16:18–28, 1990.

[DCDH90b] J. J. Dongarra, J. Du Croz, I. S. Duff, and S. Hammarling. A set of level 3 Basic Linear Algebra Subprograms. ACM Trans. Math. Soft., 16:1–17, 1990.

[DCHH88a] J. J. Dongarra, J. Du Croz, S. Hammarling, and R. J. Hanson. Algorithm 656: An extended set of FORTRAN Basic Linear Algebra Subprograms. ACM Trans. Math. Soft., 14:18–32, 1988.

[DCHH88b] J. J. Dongarra, J. Du Croz, S. Hammarling, and R. J. Hanson. An extended set of FORTRAN Basic Linear Algebra Subprograms. ACM Trans. Math. Soft., 14:1–17, 1988.

[FJ05] Matteo Frigo and Steven G. Johnson. The design and implementation of fftw3. In Proceedings of the IEEE: Special Issue on Program Generation, Optimization, and Platform Adaptation, pages 216–231, 2005.

[gli] glibc libm dicumentation. https://www.gnu.org/software/libc/ manual/html_mono/libc.html#Mathematics. Accessed: 2016-08-23.

[ISO99a] ISO/IEC 9899:1999. Programming languages, their environments, and system software interfaces – Floating-point extensions for C – Part 2: Decimal floating- point arithmetic. Technical report, 1999. 61 MB3 D7.17 - Final report on Arm-optimized Fortran compiler and mathematics libraries Version 1.0

[ISO99b] ISO/IEC TS 18661-1:2014. Programming languages, their environments, and system software interfaces – Floating-point extensions for C – Part 1: Binary floating-point arithmetic. Technical report, 1999.

[KR78] Brian Kernighan and Dennis Ritchie. The C Programming Language. Prentice Hall, 1978.

[LA04] Chris Lattner and Vikram Adve. Llvm: A compilation framework for lifelong pro- gram analysis & transformation. In Proceedings of 2004 International Symposium on Code Generation and Optimization, CGO ’04, Mar 2004.

[LHKK79] C. L. Lawson, R. J. Hanson, D. Kincaid, and F. T. Krogh. Basic Linear Algebra Subprograms for FORTRAN usage. ACM Trans. Math. Soft., 5:308–323, 1979.

[Loh10] Eugene Loh. The ideal hpc programming language. maybe it’s fortran. or maybe it just doesn’t matter. ACM Queue, 8, 2010.

[Pet16] Francesco Petrogalli. A sneak peek into sve and vla programming. Technical report, Arm Limited, Nov 2016.

[PMvdG+13] Jack Poulson, Bryan Marker, Robert A. van de Geijn, Jeff R. Hammond, and Nichols A. Romero. Elemental: A new framework for distributed memory dense matrix computations. ACM Trans. Math. Softw., 39(2):13:1–13:24, February 2013.

[WC00] Kevin Wadleigh and Isom Crawford. Software Optimization for High Performance Computing. Prentice Hall, New Jersey, 2000.

62