High-Performance Computing and Freebsd

SEE TEXT ONLY & FreeBSD By Johannes M. Dieterich igh-performance computing encompasses a the source of important basic technologies widely variety of fields and domains, from science used in more general computing and a cauldron of H and engineering to finance and social stud- innovative technologies, e.g., basic numerical ies. The common denominator is the practice of libraries and more recently deep-learning applica- aggregating computing power in a way that deliv- tions. Second, a more in-depth view would reveal a ers much higher performance than one could get much more complex ecosystem also containing communication and the aforementioned numerical from a typical desktop computer or workstation to libraries, toolchains, and programming languages, solve large problems [1]. More concrete examples as well as auxiliary hardware solutions like network include, in order of descending compute share on switches and storage. Lastly, there are plenty of civil installations: materials design, drug discovery, systems, such as developer workstations, that are climate modeling, materials design, computational more diverse, have different capability require- fluid dynamics, economic simulations, and big data ments, and are more accessible to OS alternatives; analysis. e.g., Ubuntu Linux is a prime choice for developer Arguably, we are living in a consolidated HPC workstations, not so much for cluster installations. world in terms of operating system and microarchi- tectures employed, and on a cursory glance, con- 38.93% materials science temporary HPC installations look very much alike. 15.49% computational A typical installation will feature a Beowulf cluster fluid dynamics build from thousands of nodes connected by a fast 11.79% climate & ocean interconnect. As of this writing, the TOP500 list of the fastest supercomputers in the world contains 4.56% biochemistry 91.5% amd64 processors and 99.6% Linux operat- 1.44% molecular chemistry ing systems (OSs) [3]. 0.77% combustion So where is the space for FreeBSD in this mono- culture and why should the FreeBSD community Fig.1 FLOP usage by domain in HPC on a care about HPC? First, HPC continues to be both typical supercomputer. Ref.[2] 30 FreeBSD Journal To some extent, one may even argue that cloud all possible architectures (among them amd64) use computing is repackaged or broken-down HPC for the LLVM-based clang compiler as the base com- customers not traditionally having or being able to piler for licensing reasons and, hence, by extension, afford a stake in or access to large installations. as the default ports compiler. Instead of sorting out Even though FreeBSD already excels as an appliance mixed-compiler environments, it is then quite often solution, other capabilities (hypervisor choice, stor- easier to simply rely fully on the GNU compilers for age, security, network, etc.) matter more than the mixed-language programs. This is a mildly unsatisfy- compute OS. I will focus here on capabilities for ing situation. HPC, either the compute or developer OS, how Is there a more fundamental improvement possi- they map onto more general computing needs, and ble? Maybe. flang is an open, LLVM-based, new try to highlight how FreeBSD would benefit from Fortran compiler. [4] Its frontend is derived from the • some HPC-targeted improvements. well-established Portland Group’s compiler and was As FreeBSD users and developers, we are aware open-sourced and continues to be maintained by of and value the inherent advantages of FreeBSD: a NVIDIA. It since has found support and use by AMD consistent experience of applications and libraries in in its AOCC package as well. flang supports only the ports system, the easy build and deployment of up to Fortran 2003 language features and is limited custom packages potentially with architecture-spe- to 64-bit architectures. FreeBSD now contains cific optimizations, and, in general, a very stable devel/flang as a preview. Certainly, some time operating system experience. Most importantly, the and effort are required to make this port competi- harmonized ports experience allows us to propa- tive with gfortran and vet it properly for HPC gate changes across all parts of the OS interacting uses. Just very recently, plans from NVIDIA became with the user or developer consistently. public to rewrite large parts of flang to improve its feature set and likelihood of being accepted Languages under the official LLVM umbrella. Uptake within the HPC community will be interesting to observe, but What are the tools of HPC—the applications used the added competition should prove beneficial to by researchers to transform FLOPS into knowl- the GNU compiler either way. edge—made from? If we remember that HPC was Let us think beyond the single core. As noted one of the original purposes of computers, the above, a typical cluster contains tens of thousands answer that the majority of computing time is used of cores, and jobs are typically required to use tens by Fortran codes is less of a surprise. More than to hundreds of them in parallel. Two major 65% of compute cycles are spent on Fortran codes approaches exist for scaling out in HPC: OpenMP on a typical installation. This share increases if one [5] and the Message Passing Interface (MPI) [6]. considers the numerical Fortran libraries used in OpenMP mostly targets shared-memory mixed-language codes. Unfortunately, this shows machines with later standards having support for that Fortran is alive and unlikely to go anywhere in accelerator offloading and is discussed below. It is a the near-term to mid-term due to the size and com- relatively simple, pragma-based approach, supports plexity of existing code bases. Hence, HPC support C/C++ and Fortran, and is commonly used to paral- absolutely requires Fortran support. Due to the lelize over loops in a data-parallel fashion. effective absence of the typically used commercial Currently, our ports tree will again default to the compilers on FreeBSD, we fall back to the open- lang/gcc port if a USES= compiler:openmp source alternatives. is encountered. For this reason, ports typically will Currently, if a port specifies USES=fortran all not enable OpenMP by default (including some com- architectures will use the gfortran compiler from monly used ones like graphics/ImageMagick) and lang/gcc. gfortran is a good Fortran compil- hence will be limited to a single core. er that produces stable, fast binaries and supports A better alternative exists for at least amd64 and all relevant Fortran standards. Currently, using i386; the libomp library associated with the gfortran requires explicit specification of the LLVM project since the initial open-source release by rpath to the linker to search for GNU libraries for Intel. It has not been imported into base yet, but a both that port and, if it is a library, all dependent review for an older library version exists. In its hope- ports (or out-of-ports tree applications, for that fully temporary absence, the ports system should matter). While trivial, it significantly increases the use one of the devel/llvm ports that do include maintenance burden of porters and developers. libomp. Multiple integration tests have been done Remedies are being discussed currently, but no and the feature should soon be ready to land. patches have landed in the tree yet. Also recall that Hopefully this will also incentivize fellow developers July/August 2018 31 to add the necessary FreeBSD bits for other archi- provide vendor-tuned, assembly-optimized BLAS tectures supported upstream. My tests indicate that and LAPACK libraries. For FreeBSD, we have multi- libomp integration in LLVM is not ideal from a ple options exposed through the blaslapack performance perspective; in particular, LLVM’s vec- selector. By default, this selector will use torizer does not work (well) if OpenMP’s simd math/blas and math/lapack. These are pragma is used in conjunction with standard Fortran-source, reference implementations and, as OpenMP thread parallelization (e.g., parallel such, not competitive with optimized alternatives for). But certainly, suboptimal parallelization is such as math/blis and math/libflame. Both preferable to no parallelization. the reference implementations and math/open- MPI on the other hand operates through func- blas have a Fortran dependency unlike the FLAME tion calls into the MPI communication library. project’s BLIS and libflame [7]. It features both simple send/receive/broadcast As we can see from Figure 3, both the features as well as more advanced operations OpenBLAS or BLIS implementations show a com- like reductions. MPI is used both for process- manding lead in performance, starting from small based intra-node, as well as inter-node, paral- to medium-sized matrix-matrix multiplications if lelization, and on the hardware level, typically their CPU architecture optimized kernels are used. uses a fast interconnect such as InfiniBand. More importantly, even if BLIS’s source-only refer- Typically, MPI-enabled software packages are ence implementation is used, the implementation compiled using mpicc/mpif90 wrapper scripts and blocking scheme results in an up to 80% per- which configure the underlying compilers to find formance advantage. Additionally, BLIS provides MPI headers and link against the MPI library. the ability to use pthread parallelization, which is Within the ports tree, multiple MPI choices exist of less impact in this test case. The FLAME project and we are only limited by the underlying com- has expressed interest in working with us, accept- piler toolchain for Fortran. ing pull requests of FreeBSD changes and imple- menting features such as runtime kernel selection Numerical Libraries which we need for generic packages. Hence, BLIS is a good candidate to be our default BLAS imple- A large part of the high performance in HPC does mentation and rids us of a low-level Fortran not result from application code, but instead, in dependency.

High-Performance Computing and Freebsd

ARM HPC Ecosystem

0 BLIS: a Modern Alternative to the BLAS

18 Restructuring the Tridiagonal and Bidiagonal QR Algorithms for Performance

0 BLIS: a Framework for Rapidly Instantiating BLAS Functionality

DD2358 – Introduction to HPC Linear Algebra Libraries & BLAS

The BLAS API of BLASFEO: Optimizing Performance for Small Matrices

Accelerating the “Motifs” in Machine Learning on Modern Processors

Introduchon to Arm for Network Stack Developers

Algorithms and Optimization Techniques for High-Performance Matrix-Matrix Multiplications of Very Small Matrices

BLAS-On-Flash: an Alternative for Training Large ML Models?

A Systematic Approach for Obtaining Performance on Matrix-Like Operations Submitted in Partial Fulfillment of the Requirements

Flexiblas - a ﬂexible BLAS Library with Runtime Exchangeable Backends