Arxiv:1807.01775V1 [Cs.MS] 3 Jul 2018 Libraries (FFTW, P3DFFT, PFFT, Cufft)

UP JORS software Latex paper template version 0.1 Software paper for submission to the Journal of Open Research Software To complete this template, please replace the blue text with your own. The paper has three main sections: (1) Overview; (2) Availability; (3) Reuse potential. Please submit the completed paper to: [email protected] (1) Overview Title FluidFFT: common API (C++ and Python) for Fast Fourier Transform HPC libraries Paper Authors 1. MOHANAN Ashwin Vishnua 2. BONAMY Cyrilleb 3. AUGIER Pierreb a LinnéFlow Centre, Department of Mechanics, KTH, 10044 Stockholm, Sweden. b Univ. Grenoble Alpes, CNRS, Grenoble INP1, LEGI, 38000 Grenoble, France. Paper Author Roles and Affiliations 1. Ph.D. student, LinnéFlow Centre, KTH Royal Institute of Technology, Sweden; 2. Research Engineer, LEGI, UniversitéGrenoble Alpes, CNRS, France; 3. Researcher, LEGI, UniversitéGrenoble Alpes, CNRS, France Abstract The Python package fluidfft provides a common Python API for performing Fast Fourier Transforms (FFT) in sequential, in parallel and on GPU with different FFT arXiv:1807.01775v1 [cs.MS] 3 Jul 2018 libraries (FFTW, P3DFFT, PFFT, cuFFT). fluidfft is a comprehensive FFT frame- work which allows Python users to easily and efficiently perform FFT and the associated tasks, such as as computing linear operators and energy spectra. We describe the architecture of the package composed of C++ and Cython FFT classes, Python \operator" classes and Pythran functions. The package supplies utilities to easily test itself and benchmark the different FFT solutions for a particular case and on a particular machine. We present a performance scaling analysis on three different computing clusters and a microbenchmark showing that fluidfft is an interesting solution to write efficient Python applications using FFT. 1Institute of Engineering Univ. Grenoble Alpes UP JORS software Latex paper template version 0.1 Keywords Free and open-source library; Python; Fast Fourier Transform; Distributed; MPI; GPU; High performance computing Introduction Fast Fourier Transform (FFT) is a class of algorithms used to calculate the discrete Fourier transform, which traces back its origin to the groundbreaking work by Cooley & Tukey(1965). Ever since then, FFT as a computational tool has been applied in multiple facets of science and technology, including digital signal processing, image compression, spectroscopy, numerical simulations and scientific computing in general. There are many good libraries to perform FFT, in particular the de-facto standard FFTW(Frigo & Johnson 2005). A challenge is to efficiently scale FFT on clusters with the memory distributed over a large number of cores using Message Passing Interface (MPI). This is imperative to solve big problems faster and when the arrays do not fit in the memory of single computational node. A problem is that for one-dimensional FFT, all the data have to be located in the memory of the process that perform the FFT, so a lot of communications between processes are needed for 2D and 3D FFT. There are two strategies to distribute an array in the memory, the 1D (or slab) decomposition and the 2D (or pencil) decomposition. The 1D decomposition is more efficient when only few processes are used but suffers from an important limitation in terms of number of MPI processes that can be used. Utilizing 2D decomposition overcomes this limitation. Some of the well-known libraries are written in C, C++ and Fortran. The classical FFTW library supports MPI using 1D decomposition and hybrid parallelism using MPI and OpenMP. Other libraries, now implement the 2D decomposition: PFFT(Pippig 2013), P3DFFT(Pekurovsky 2012), 2decomp&FFT and so on. These libraries rely on MPI for the communications between processes, are optimized for supercomputers and scales well to hundreds of thousands of cores. However, since there is no common API, it is not simple to write applications that are able to use these libraries and to compare their performances. As a result, developers are met with a hard decision, which is to choose a library before the code is implemented. Apart from CPU-based parallelism, General Purpose computing on Graphical Process- ing Units (GPGPU) is also gaining traction in scientific computing. Scalable libraries written for GPGPU such as OpenCL and CUDA have emerged, with their own FFT implementations, namely clFFT and cuFFT respectively. Python can easily link these libraries through compiled extensions. For a Python developer, the following packages leverage this approach to perform FFT: • sequential FFT, using: { numpy.fft and scipy.fftpack which are essentially C and Fortran extensions for FFTPACK library. UP JORS software Latex paper template version 0.1 { pyFFTW which wraps FFTW library and provides interfaces similar to the numpy.fft and scipy.fftpack implementations. { mkl fft, which wraps Intel's MKL library and exposes python interfaces to act as drop-in replacements for numpy.fft and scipy.fftpack. • FFT with MPI, using: { mpiFFT4py and mpi4py-fft built on top of pyFFTW and numpy.fft. { pfft-python which provides extensions for PFFT library. • FFT with GPGPU, using: { Reikna, a pure python package which depends on PyCUDA and PyOpenCL { pytorch-fft: provides C extensions for cuFFT, meant to work with Py- Torch, a tensor library similar to NumPy. Although these Python packages are under active development, they suffer from certain drawbacks: • No effort so far to consolidate all possible FFT libraries, both sequential, MPI and GPGPU based under a single package with similar syntax. • Quite complicated even for the simplest use case scenarios. To understand how to use them, a novice user has to, at least, read the FFTW documentation. • No benchmarks between libraries and between the Python solutions and solutions based only on a compiled language (as C, C++ or Fortran). • Provides just the FFT and inverse FFT functions, no associated mathematical operators. The Python package fluidfft fills this gap by providing C++ classes and their Python wrapper classes for performing simple and common tasks with different FFT libraries. It has been written to make things easy while being as efficient as possible. It provides: • tests, • documentation and tutorials, • benchmarks, • operators for simple tasks (for example, compute the energy or the gradient of a field). UP JORS software Latex paper template version 0.1 In the present article, we shall start by describing the implementation of fluidfft including its design aspects and the code organization. Thereafter, we shall compare the performance of different classes in fluidfft in three computing clusters, and also describe, using microbenchmarks, how a Python function can be optimized to be as fast as a Fortran implementation. Finally, we show how we test and maintain the quality of the code base through continuous integration and mention some possible applications of fluidfft. Implementation and architecture The two major design goals of fluidfft are: • to support multiple FFT libraries under the same umbrella and expose the interface for both C++ and Python code development. • to keep the design of the interfaces as human-centric and easy to use as possible, without sacrificing performance. Both C++ and Python APIs provided by fluidfft currently support linking with FFTW (with and without MPI and OpenMP support enabled), MKL, PFFT, P3DFFT, cuFFT libraries. The classes in fluidfft offers API for performing double-precision2 computation with real-to-complex FFT, complex-to-real inverse FFT, and additional helper functions. C++ API The C++ API is implemented as a hierarchy of classes as shown in Fig.1. The naming convention used for the classes (<Type of FFT>With<Name of Library>) is a cue for how these are functioning internally. By utilizing inheritance, the classes share the same function names and syntax defined in the base classes, shown in white boxes in Fig.1. Some examples of such functions are: • alloc array X: Allocates array to store a physical array with real datatype for the current process. • alloc array K: Allocates array to store a spectral array with complex datatype for the current process. • init array X random: Allocates and initializes a physical array with random values. • test: Run tests for a class by comparing mean and mean energy values in an array before and after a set of fft and ifft calls. • bench: Benchmark the fft and ifft methods for certain number of iterations. 2Most C++ classes also support single-precision. UP JORS software Latex paper template version 0.1 BaseFFT BaseFFT2D BaseFFTMPI BaseFFT3D BaseFFT2DMPI BaseFFT3DMPI FFT2DWithFFTW1D FFT2DWithFFTW2D FFT2DWithCUFFT FFT3DWithFFTW3D FFT3DWithCUFFT FFT2DMPIWithFFTW1D FFT2DMPIWithFFTWMPI2D FFT3DMPIWithFFTWMPI3D FFT3DMPIWithFFTW1D FFT3DMPIWithPFFT FFT3DMPIWithP3DFFT Figure 1: Class hierarchy demonstrating object-oriented approach. The sequential classes are shown in red, the CUDA-based classes in magenta and the MPI-based classes in green. The arrows represent inheritance from parent to child class. Remaining methods which are specific to a library are defined in the corresponding child classes, depicted in coloured boxes in Fig.1, for example: • are parameters bad: Verifies whether the global array shape can be decomposed with the number of MPI processes available or not. If the parameters are com- patible, the method returns false. This method is called prior to initializing the class. • fft and ifft: Forward and inverse FFT methods. Let us illustrate with a trivial example, wherein we initialize

Arxiv:1807.01775V1 [Cs.MS] 3 Jul 2018 Libraries (FFTW, P3DFFT, PFFT, Cufft)

How to Write Fast Numerical Code: a Small Introduction

The Design and Implementation of FFTW3 Matteo Frigo and Steven G

Numerical Libraries

The Fastest Fourier Transform in the West (MIT-LCS-TR-728)

Matteo Frigo Steven G. Johnson This Manual Is for FFTW (Version 3.3.10, 10 December 2020)

Sun Performance Library User's Guide for Fortran and C

13.11 Discrete Fourier Transform 431

Fast Fourier Transform Using Parallel Processing for Medical

Matteo Frigo Steven G. Johnson This Manual Is for FFTW (Version 3.3.2, 28 April 2012)

AN ADAPTIVE SOFTWARE ARCHITECTURE for the FFT Matteo Frigo MIT Laboratory for Computer Science 545 Technology Square NE43

Fast Fourier Transform Discrete Fourier Transform Pair

Introduction to Scientific Libraries