Flexiblas - a ﬂexible BLAS Library with Runtime Exchangeable Backends

FlexiBLAS - A flexible BLAS library with runtime exchangeable backends Martin Köhler Jens Saak December 20, 2013 Abstract The BLAS library is one of the central libraries for the implementation of numerical algorithms. It serves as the basis for many other numerical libraries like LAPACK, PLASMA or MAGMA (to mention only the most obvious). Thus a fast BLAS implementation is the key ingredient for efficient applications in this area. However, for debugging or benchmarking purposes it is often necessary to replace the underlying BLAS implementation of an application, e.g. to disable threading or to include debugging symbols. In this paper we present a novel framework that allows one to exchange the BLAS implementation at run-time via an environment variable. Our concept neither requires relinkage, nor recompilation of the application. Numerical experiments show that there is no notable overhead introduced by this new approach. For only a very little overhead the framework naturally extends to a minimal profiling setup that allows one to count numbers of calls to the BLAS routines used and measure the time spent therein. Contents 1 Introduction2 2 Implementation Details5 3 Usage 11 4 Numerical Test 15 5 Conclusions 16 6 Acknowledgment 17 1 libblas.so Application liblapack.so libblas.so libumfpack.so Figure 1: Shared library dependencies of an example application. 1 Introduction The BLAS library [2,3,6] is one of the most important basic software libraries in sci- entific computing. The Netlib reference implementation defines the standard Fortran interface and its general behavior. Besides this, optimized variants for nearly every computer architecture exist. These implementations are mostly provided by the CPU or hardware vendors to get the maximum performance out of their specific devices. In addition to the vendor implementations powerful Open Source implementations like OpenBLAS [7] and the Automatically Tuned Linear Algebra Software (ATLAS) [8] exist. Besides the standard Fortran interface Netlib provides a C interface too. The so called CBLAS is normally only a wrapper to the Fortran interface which converts some data types and make the calling sequences more convenient for C programmers in the end. Although one is only interested in using the most efficient BLAS implementation on each hardware, it is often helpful to exchange the BLAS implementation during development. This is, for example, necessary in the debugging process to turn off multithreading or to include additional debug symbols. Benchmarks, comparing different BLAS versions to find the best one for the current hardware, are other reasons to switch the BLAS implementation. Our FlexiBLAS concept accelerates this process and resolves the issues often encountered. The three main categories of these issues, the authors are aware of, are explained below. Normally the exchange of a library is performed by relinking the program or even recompiling the entire software project. This can be a difficult and time consuming process. Unfortunately this procedure does not always solve all the problems regarding the change of the BLAS library in the whole application. The main problems are the following: Problem 1: Dynamically Linked Libraries and their dependencies. To explain this issue, we consider an example application which uses BLAS directly for some com- putations and relies on LAPACK and UMFPACK [1] for more complex operations. If LAPACK and UMFPACK are linked dynamically they are usually linked against BLAS, 2 Symbol Resultion Conflicts libopenblas.so Application liblapack.so libblas.so libumfpack.so Figure 2: Wrong symbol resolution after relinking the example application. as well, since this is necessary to resolve their internal symbols. We assume that the example application is compiled and linked dynamically using gcc Applicaton.c -o Applicaton -lblas -llapack -lumfpack with LAPACK and UMFPACK as shared libraries too. The dependencies between the application and the libraries are shown in Figure1. We observe that the BLAS library is referenced several times. First it is directly connected to the application and second it is an indirect dependency through LAPACK and UMFPACK. If one now wants to use an alternative BLAS implementation, e.g. to check if the application performs well with threading, one might link the above program against OpenBLAS via: gcc Applicaton.c -o Applicaton -lopenblas -llapack -lumfpack If we now take a look at the dependencies of the example application we see in Figure2 that we integrated two different BLAS libraries. The program now tries to resolve all symbols by looking in both libraries and we can not influence which one is used. Pro- gram crashes, segmentations faults or other unwanted effects are possible and indeed often observed in this situation. Problem 2: Incompatibilities Even though the Netlib BLAS implementation defines the standard interface and the behavior, not all available BLAS variants are 100% compatible with this interface. For example the auxiliary functions SCABS1/DZABS1 are only used as tools by some BLAS routines, i.e., they are not part of the official BLAS subroutines [6], still they are provided by the Netlib reference implementation and might be called by users unaware of this fact. In order to end up with an 100% Netlib compatible interface we have to provide these functions even if they are not available in some BLAS implementations like ATLAS or Apple Accelerate. Other examples for incompatibilities are differing interfaces for functions returning complex values inside the Intel® MKL [5]. This affects ZDOTC, ZDOTU, CDOTC and CDOTU. However, there is a GNU compatible interface of the MKL in which these functions behave like the 3 Backends liblapack.so Netlib Application libumfpack.so OpenBLAS libflexiblas.so Intel MKL Environment Variables ... Profiling Figure 3: Application with integrated FlexiBLAS. ones from the reference. If we use the icc/ifort compatible interface the signature of these functions changes, though. Intel® introduced an additional parameter to return the result and not as a normal return value. In turn it changes the Fortran function to a Fortran subroutine. If a program uses at least one of those functions when recompiling against different BLAS version, the source code needs to be modified to match the varying BLAS interfaces. Problem 3: Easy Profiling In many cases it is desirable to count how often a certain BLAS function is called in a given context and how much time is spent inside this function. This type of profiling is normally enabled by special compiler options. If the BLAS library was not compiled with these options and we can not recompile it ourselves to enable them, profiling can be complicated or impossible. This is often the case when we use a vendor provided BLAS implementation like Intel® MKL or AMD ACML. Besides the recompilation, additional tools like gprof are necessary. Also enabling the profiling flags can considerably slow down the entire application because the compiler integrates additional code and information. With FlexiBLAS we try to overcome all three problems by providing a wrapper library which acts as an interface between the application and the real BLAS implementation. The API of the wrapper looks exactly like the standard Netlib Fortran reference implementation and behaves like that independent of the actual implementation used. Additionally FlexiBLAS can also provide the CBLAS reference interface defined by Netlib. The underlying BLAS, the so called backend, is selected via config- uration files or an environment variable. Figure3 shows how the application and its dependencies connect to FlexiBLAS. Inside our wrapper a simple profiling system to analyze the function calls can be integrated very naturally and at very limited cost. The remainder of our article is structured as follows. In the following section we describe the basic techniques behind the FlexiBLAS implementation and its usage. The computational results in Section4 will show that the additional overhead caused by 4 FlexiBLAS is negligible. 2 Implementation Details The main idea behind our wrapper library is to use the dlopen-mechanism specified in POSIX.1-2001 [4]. The mechanism allows one to open shared libraries or other elf-binaries and search for functions or symbols inside of them. This technique is often used for plugin infrastructures where parts of a program are loaded at runtime instead of linking the entire program against them. This mechanism gives us great flexibility to enhance the features of a program or exchange parts of the program without recompiling it. This corresponds to our requirements from Section1. The dlopen-mechanism consists of four functions that we will use in our implementation: dlopen The dlopen function is an interface to the dynamic loader. It loads a shared object or an arbitrary elf-binary in the address space of the current program and returns a handle to deal with the loaded object. The shared object can be addressed either by a relative or an absolute path. If it is an absolute path it is opened directly, otherwise it searches in the default linker paths (as they are defined in ld.so.conf or LD LIBRARY PATH. Additionally the behavior of the symbol resolution can be influenced by a flag. Basically there are two options. The first one, RTLD NOW, resolves all symbols when a shared object is opened. The second one, RTLD LAZY allows one to resolve only the necessary symbols when a function out of the shared object is called. Other flags decide if the symbols of the library are integrated into the global address space or not. dlsym The dlsym function is the look up function of the dlopen mechanism. It searches a given shared object handle for a desired symbol. In this way one can extract pointers to functions or variables inside the previously loaded shared library. If a symbol is not found inside the specified handle it returns a NULL pointer. Using two predefined library handles RTLD NEXT and RTLD DEFAULT one can search inside the current address space for a symbol too.

Flexiblas - a ﬂexible BLAS Library with Runtime Exchangeable Backends

Recursive Approach in Sparse Matrix LU Factorization

Abstract 1 Introduction

Overview of Iterative Linear System Solver Packages

Accelerating the LOBPCG Method on Gpus Using a Blocked Sparse Matrix Vector Product

ARM HPC Ecosystem

Over-Scalapack.Pdf

0 BLIS: a Modern Alternative to the BLAS

Prospectus for the Next LAPACK and Scalapack Libraries

Jack Dongarra: Supercomputing Expert and Mathematical Software Specialist

Distibuted Dense Numerical Linear Algebra Algorithms on Massively Parallel Architectures: DPLASMA

LAPACK WORKING NOTE 195: SCALAPACK's MRRR ALGORITHM 1. Introduction. Since 2005, the National Science Foundation Has Been Fund

18 Restructuring the Tridiagonal and Bidiagonal QR Algorithms for Performance