Constructing Parallel Algorithms for Discrete Transforms
Total Page:16
File Type:pdf, Size:1020Kb
Constructing Parallel Algorithms for Discrete Transforms From FFTs to Fast Legendre Transforms Constructie van Parallelle Algoritmen voor Discrete Transformaties Van FFTs tot Snelle Legendre Transformaties met een samenvatting in het Nederlands Proefschrift ter verkrijging van de graad van do ctor aan de Universiteit Utrecht op gezag van de Rector Magnicus Prof dr H O Voorma inge volge het b esluit van het College voor Promoties in het op enbaar te verdedigen op woensdag maart des middags te uur do or Marcia Alves de Inda geb oren op augustus te Porto Alegre Brazilie Promotor Prof dr Henk A van der Vorst Copromotor Dr Rob H Bisseling Faculteit Wiskunde en Informatica Universiteit Utrecht Mathematics Sub ject Classication T Y C Inda Marcia Alves de Constructing Parallel Algorithms for Discrete Transforms From FFTs to Fast Legendre Transforms Pro efschrift Universiteit Utrecht Met samenvatting in het Nederlands ISBN The work describ ed in this thesis was carried out at the Mathematics Department of Utrecht University The Netherlands with nancial supp ort by the Fundacao Co or denacao de Ap erfeicoamento de Pessoal de Nivel Sup erior CAPES Aos carvoeiros In my grandmothers Ega words a family that is always together my family Preface The initial target of my do ctoral research with parallel discrete transforms was to develop a parallel fast Legendre transform FLT algorithm based on the sequential DriscollHealy algorithm To do this I had to study their algorithm in depth This task was greatly simplied thanks to previous work done by David K Maslen with whom Rob H Bisseling and I worked together to crack this nut After understanding and implementing the sequential FLT algorithm I aimed at developing a parallel distributed memory version Using the bulk synchronous parallel BSP mo del and assuming that a parallel fast cosine transform FCT algorithm was available it was easy to devise a basic parallel FLT algorithm Such a basic parallel algorithm was already known though to my knowledge it had never b een implemented With the basic parallel algorithm at hand I still had two things to do develop a parallel FCT algorithm and investigate the p ossibility of improving the basic parallel FLT algorithm Developing a parallel FCT algorithm involved searching for a suitable sequential FCT algorithm ie an algorithm that could b e parallelized with a minimum of communication overhead Since there is substantial knowledge of parallel fast Fourier transforms FFT we chose to restrict our search to the class of FCT algorithms that are based on FFTs ie algorithms which pack the input data transform them using an FFT and then extract the cosine transform from the transformed data After going through many of those algorithms I nally decided to implement Narasimha and Petersons algorithm which I had found in van Loans b o ok The transform phase of this FCT algorithm consists of a real FFT ie an FFT for real input data In turn a real FFT can b e carried out using a complex FFT At this p oint I had to deal with four dierent parallel discrete transform algo rithms FFTs real FFTs FCTs and FLTs After implementing basic versions of those four transforms we turned to the problem of improving the basic parallel FLT algorithm With the introduction of a new data distribution which we call the zigzag cyclic distribution we were able to reduce the communication cost of the parallel real FFT and of the parallel FCT to the same cost as that of a parallel complex FFT v vi Preface of half the size Thus obtaining the data packing and extract phases at no extra communication cost Breaking op en the FCT mo dule inside the FLT algorithm further reduced the communication cost of the basic parallel FLT algorithm by a fac tor of three This last optimization step was greatly facilitated by the introduction of the FCT algorithm an algorithm that computes two discrete cosine transforms simultaneously A pro ject that started with the aim of developing a parallel fast Legendre trans form ended up creating a collection of useful parallel transforms that can b e used throughout in computational sciences From solving numerical dierential equations to signal pro cessing FFTs are among the most used numerical to ols in computa tional sciences As real versions of the FFT the RFFT and the FCT are also very much in use The FLT is still young and has to conquer its space b etween its b etter known cousins The discrete Legendre transform however is widely used as part of twodimensional Legendre transforms or threedimensional spherical harmonic transforms My parallel FLT is only a rst step in developing parallel twodimensional FLTs and parallel fast spherical harmonic transforms Together with this thesis I intend to release a version of the Bulk Synchronous Parallel Fast Transform package BSPFTpack that I developed with the hop e to bring parallel transforms within easy reach of anyone interested in parallel computing This package was written in ANSI C and uses BSPlib as communication library which is freely available An alternative would b e to adapt the program to use another communication library such as MPI Together with the parallel package I also intend to release its sequential version My thesis has the following structure In Chapter I describ e the BSP mo del and discuss relevant asp ects of parallel computing In Chapter I derive a parallel FFT algorithm which serves as the basis for the rest of my thesis In Chapter I derive parallel algorithms for the RFFT the FCT and the FCT In Chapter I introduce the DriscollHealy algorithm and derive a parallel FLT algorithm App en dix A describ es the BSP parameters for the Cray TE which is the parallel computer used for the numerical exp eriments App endix B presents the sinecosine table used in my programs and App endices C and D contain material supplementary to the FLT chapter Contents Preface v List of Algorithms xi 1. Basic Concepts in Parallel Computation Terminology Bulk synchronous parallel mo del Measuring the p erformance of a parallel algorithm the BSP cost function Data distributions and p ermutations BSP algorithms Implementation issues 2. Fast Fourier Transform Introduction Background Brief introduction to parallel radix FFTs The parallel algorithm Basic idea Groupcyclic distribution family Fourier matrix decomp osition Implementation of the parallel algorithm Generalized butteries Parallel bit reversal Permutations involving the groupcyclic family Permutation from blo ck to cyclic distribution BSP cost Variants of the algorithm Parallel FFT using other data distributions vii viii Contents Generalized buttery phase with adjustable size Cachefriendly parallel FFT Exp erimental results and discussion Accuracy Performance of the sequential implementation the need for cache friendly algorithms Scalability of the parallel implementation Alternative algorithms Sixpass algorithms and transp ose algorithms Parallel mixedradix FFT a generalized pass algorithm Conclusions and future work 3. Real Fast Fourier Transform and Fast Cosine Transform Introduction Parallel CFFTbased algorithms and the zigzag cyclic distribution Selection of a suitable sequential algorithm Zigzag cyclic distribution Permutation from blo ck to zigzag cyclic distribution Pairwise op erations using the zigzag cyclic distribution Parallel algorithms and their inverses Parallel real fast Fourier transform algorithm Parallel fast cosine transform algorithm Inverse transforms Simultaneous fast cosine transform of two vectors Derivation Parallel forward transform Parallel backward transform Results and discussion Accuracy Performance of the real FFT Performance of the FCT and FCT Conclusions 4. Fast Legendre Transform Introduction DriscollHealy algorithm Orthogonal p olynomials Derivation of the DriscollHealy algorithm Contents ix Data representation and recurrence pro cedure Early termination Complexity of the algorithm Basic parallel algorithm and its implementation Data structures and data distributions Basic parallel template Improvements of the parallel algorithm Optimization of the main lo op Optimized template Parallel termination Exp erimental results and discussion Accuracy Eciency of the sequential implementation Scalability of the parallel implementation Conclusions and future work Final Remarks and Future Work App endix A BSP Parameters on the Cray TE App endix B Lo okup Table App endix C Related Polynomial Transforms and Algorithms App endix D Precomputation Algorithm for the FLT Bibliography Summary Samenvatting Curriculum Vitae List of Publications Acknowledgments x Contents List of Algorithms Sequential radix FFT algorithm Template for the parallel fast Fourier transform Template for the sequential generalized buttery op erations Template for the parallel bit reversal Template for the parallel p ermutation from r cyclic to r cyclic distribution Template for the parallel p ermutation from cyclic to blo ck distribution Template for the parallel p ermutation from blo ck to cyclic distribution Template for the parallel mixedradix FFT Template for the parallel p ermutation from blo ck to zigzag cyclic distribution Template for the sequential generalized zigzag buttery op erations Template for the sequential extractpack phase of the RFFT Template for the sequential extract phase of the FCT Template for the parallel forward real fast Fourier transform