C++ and Python) for Fast Fourier Transform HPC Libraries Journal of Open Research Software, 7(1

http://www.diva-portal.org This is the published version of a paper published in . Citation for the original published paper (version of record): Mohanan, A V., Bonamy, C., Augier, P. (2019) FluidFFT: Common API (C++ and Python) for Fast Fourier Transform HPC Libraries Journal of Open Research Software, 7(1) https://doi.org/10.5334/jors.238 Access to the published version may require subscription. N.B. When citing this work, cite the original published paper. Permanent link to this version: http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-248292 Mohanan, A V, et al. 2019 FluidFFT: Common API (C++ and Python) Journal of for Fast Fourier Transform HPC Libraries. Journal of Open Research open research software Software, 7: 10. DOI: https://doi.org/10.5334/jors.238 SOFTWARE METAPAPER FluidFFT: Common API (C++ and Python) for Fast Fourier Transform HPC Libraries Ashwin Vishnu Mohanan1, Cyrille Bonamy2 and Pierre Augier2 1 Linné Flow Centre, Department of Mechanics, KTH, Stockholm, SE 2 University of Grenoble Alpes, CNRS, Grenoble INP, LEGI, Grenoble, FR Corresponding author: Pierre Augier ([email protected]) The Python package fluidfft provides a common Python API for performing Fast Fourier Transforms (FFT) in sequential, in parallel and on GPU with different FFT libraries (FFTW, P3DFFT, PFFT, cuFFT). fluidfft is a comprehensive FFT framework which allows Python users to easily and efficiently perform FFT and the associated tasks, such as computing linear operators and energy spectra. We describe the architecture of the package composed of C++ and Cython FFT classes, Python “operator” classes and Pythran functions. The package supplies utilities to easily test itself and benchmark the different FFT solutions for a particular case and on a particular machine. We present a performance scaling analysis on three different computing clusters and a microbenchmark showing thatfluidfft is an interesting solution to write efficient Python applications using FFT. Keywords: fluid dynamics; Python; FFT; simulations Funding statement: This project has indirectly benefited from funding from the foundation Simone et Cino Del Duca de l’Institut de France, the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation program (grant agreement No 647018-WATU and Euhit consortium) and the Swedish Research Council (Vetenskapsrådet): 2013–5191. We have also been able to use supercomputers of CIMENT/GRICAD, CINES/GENCI (grant 2018-A0040107567) and the Swedish National Infrastructure for Computing (SNIC). (1) Overview FFT, there are two strategies to distribute an array in the Introduction memory, the 1D (or slab) decomposition and the 2D (or Fast Fourier Transform (FFT) is a class of algorithms pencil) decomposition. The 1D decomposition is more used to calculate the discrete Fourier transform, which efficient when only few processes are used but suffers traces back its origin to the groundbreaking work by [3]. from an important limitation in terms of number of MPI Ever since then, FFT as a computational tool has been processes that can be used. Utilizing 2D decomposition applied in multiple facets of science and technology, overcomes this limitation. including digital signal processing, image compression, Some of the well-known libraries are written in C, C++ spectroscopy, numerical simulations and scientific and Fortran. The classical FFTW library supports MPI computing in general. There are many good libraries to using 1D decomposition and hybrid parallelism using perform FFT, in particular the de-facto standard FFTW [4]. MPI and OpenMP. Other libraries, now implement the 2D A challenge is to efficiently scale FFT on clusters with the decomposition for FFT over 3D arrays: PFFT [8], P3DFFT memory distributed over a large number of cores using [7], 2decomp&FFT and so on. These libraries rely on MPI Message Passing Interface (MPI). This is imperative to solve for the communications between processes, are optimized big problems faster and when the arrays do not fit in the for supercomputers and scales well to hundreds of memory of single computational node. A problem is that thousands of cores. However, since there is no common for one-dimensional FFT, all the data have to be located in API, it is not simple to write applications that are able to the memory of the process that perform the FFT, so a lot use these libraries and to compare their performances. As of communications between processes are needed for 2D a result, developers are met with a hard decision, which is and 3D FFT. to choose a library before the code is implemented. To elaborate, there is only one way to apply domain Apart from CPU-based parallelism, General Purpose decomposition for 2D FFT, which is to split them into computing on Graphical Processing Units (GPGPU) is also narrow strips across one dimension. However for 3D gaining traction in scientific computing. Scalable libraries Art. 10, page 2 of 12 Mohanan et al: FluidFFT written for GPGPU such as OpenCL and CUDA have fluidfft in three computing clusters, and also describe, emerged, with their own FFT implementations, namely using microbenchmarks, how a Python function can be clFFT and cuFFT respectively. optimized to be as fast as a Fortran implementation. Python can easily link these libraries through compiled Finally, we show how we test and maintain the quality extensions. For a Python developer, the following packages of the code base through continuous integration and leverage this approach to perform FFT: mention some possible applications of fluidfft. • sequential FFT, using: Implementation and architecture – numpy.fft and scipy.fftpack which are The two major design goals of fluidfft are: essentially C and Fortran extensions for FFTPACK library. • to support multiple FFT libraries under the same – pyFFTW which wraps FFTW library and provides umbrella and expose the interface for both C++ and interfaces similar to the numpy.fft and scipy. Python code development. fftpack implementations. • to keep the design of the interfaces as human-centric – mkl_fft, which wraps Intel’s MKL library and and easy to use as possible, without sacrificing per- exposes python interfaces to act as drop-in replace- formance. ments for numpy.fft and scipy.fftpack. • FFT with MPI, using: Both C++ and Python APIs provided by fluidfft – mpiFFT4py built on top of pyFFTW and numpy. currently support linking with FFTW (with and without fft; and its successor mpi4py-fft. MPI and OpenMP support enabled), MKL, PFFT, P3DFFT, – pfft-python which provides extensions for cuFFT libraries. The classes in fluidfft offers API for PFFT library. performing double-precision1 computation with real-to- • FFT with GPGPU, using: complex FFT, complex-to-real inverse FFT, and additional – Reikna, a pure python package which depends on helper functions. PyCUDA and PyOpenCL. – pytorch-fft: provides C extensions for cuFFT, C++ API meant to work with PyTorch, a tensor library similar The C++ API is implemented as a hierarchy of classes as to NumPy. shown in Figure 1. The naming convention used for the classes (<Type of FFT>With<Name of Library>) Although these Python packages are under active is a cue for how these are functioning internally. By utilizing development, they suffer from certain drawbacks: inheritance, the classes share the same function names and syntax defined in the base classes, shown in white boxes in • No effort so far to consolidate sequential, MPI and Figure 1. Some examples of such functions are: GPGPU based FFT libraries under a single package with similar syntax. • alloc_array_X: Allocates array to store a physical • Quite complicated even for the simplest use case sce- array with real datatype for the current process. narios. To understand how to use them, a novice user • alloc_array_K: Allocates array to store a spectral has to, at least, read the FFTW documentation. array with complex datatype for the current process. • No benchmarks between libraries and between the • init_array_X_random: Allocates and initializes Python solutions and solutions based only on a com- a physical array with random values. piled language (as C, C++ or Fortran). • test: Run tests for a class by comparing mean and • Provides just the FFT and inverse FFT functions, no mean energy values in an array before and after a set associated mathematical operators. of fft and ifft calls. • bench: Benchmark the fft and ifft methods for The Python package fluidfft fills this gap by providing certain number of iterations. C++ classes and their Python wrapper classes for performing simple and common tasks with different FFT Remaining methods which are specific to a library are libraries. It has been written to make things easy while defined in the corresponding child classes, depicted in being as efficient as possible. It provides: coloured boxes in Figure 1, for example: • tests, • are_parameters_bad: Verifies whether the • documentation and tutorials, global array shape can be decomposed with the num- • benchmarks, ber of MPI processes available or not. If the parame- • operators for simple tasks (for example, compute the ters are compatible, the method returns false. This energy or the gradient of a field). method is called prior to initializing the class. • fft and ifft: Forward and inverse FFT methods. In the present article, we shall start by describing the implementation of fluidfft including its design Let us illustrate with a trivial example, wherein we initialize aspects and the code organization. Thereafter, we the

C++ and Python) for Fast Fourier Transform HPC Libraries Journal of Open Research Software, 7(1

How to Write Fast Numerical Code: a Small Introduction

The Design and Implementation of FFTW3 Matteo Frigo and Steven G

Numerical Libraries

The Fastest Fourier Transform in the West (MIT-LCS-TR-728)

Matteo Frigo Steven G. Johnson This Manual Is for FFTW (Version 3.3.10, 10 December 2020)

Sun Performance Library User's Guide for Fortran and C

13.11 Discrete Fourier Transform 431

Arxiv:1807.01775V1 [Cs.MS] 3 Jul 2018 Libraries (FFTW, P3DFFT, PFFT, Cufft)

Fast Fourier Transform Using Parallel Processing for Medical

Matteo Frigo Steven G. Johnson This Manual Is for FFTW (Version 3.3.2, 28 April 2012)

AN ADAPTIVE SOFTWARE ARCHITECTURE for the FFT Matteo Frigo MIT Laboratory for Computer Science 545 Technology Square NE43

Fast Fourier Transform Discrete Fourier Transform Pair