FDTD) Algorithms on a Selection of High Performance Multiprocessor Computing Systems

A comparative analysis of the performance and deployment overhead of parallelized Finite Difference Time Domain (FDTD) algorithms on a selection of high performance multiprocessor computing systems by RG Ilgner Dissertation presented in fulfilment of the requirements for the degree Doctor of Philosophy in the Faculty of Engineering at Stellenbosch University Promoter: Prof DB Davidson Department of Electrical & Electronic Engineering December 2013 Stellenbosch University http://scholar.sun.ac.za Declaration By submitting this thesis electronically, I declare that the entirety of the work contained therein is my own, original work, that I am the sole author thereof (save to the extent explicitly otherwise stated), that reproduction and publication thereof by Stellenbosch University will not infringe any third party rights and that I have not previously in its entirety or in part submitted it for obtaining any qualification. December 2013 Copyright © 2013 Stellenbosch University All rights reserved Stellenbosch University http://scholar.sun.ac.za Abstract The parallel FDTD method as used in computational electromagnetics is implemented on a variety of different high performance computing platforms. These parallel FDTD implementations have regularly been compared in terms of performance or purchase cost, but very little systematic consideration has been given to how much effort has been used to create the parallel FDTD for a specific computing architecture. The deployment effort for these platforms has changed dramatically with time, the deployment time span used to create FDTD implementations in 1980 ranging from months, to the contemporary scenario where parallel FDTD methods can be implemented on a supercomputer in a matter of hours. This thesis compares the effort required to deploy the parallel FDTD on selected computing platforms from the constituents that make up the deployment effort, such as coding complexity and time of coding. It uses the deployment and performance of the serial FDTD method on a single personal computer as a benchmark and examines the deployments of the parallel FDTD using different parallelisation techniques. These FDTD deployments are then analysed and compared against one another in order to determine the common characteristics between the FDTD implementations on various computing platforms with differing parallelisation techniques. Although subjective in some instances, these characteristics are quantified and compared in tabular form, by using the research information created by the parallel FDTD implementations. The deployment effort is of interest to scientists and engineers considering the creation or purchase of an FDTD-like solution on a high performance computing platform. Although the FDTD method has been considered to be a brute force approach to solving computational electromagnetic problems in the past, this was very probably a factor of the relatively weak computing platforms which took very long periods to process small model sizes. This thesis will describe the current implementations of the parallel FDTD method, made up of a combination of several techniques. These techniques can be easily deployed in a relatively quick time frame on computing architectures ranging from IBM’s Bluegene/P to the amalgamation of multicore processor and graphics processing unit, known as an accelerated processing unit. i Stellenbosch University http://scholar.sun.ac.za Opsomming Die parallel Eindige Verskil Tyd Domein (Eng: FDTD) metode word gebruik in numeriese elektromagnetika en kan op verskeie hoë werkverrigting rekenaars geïmplementeer word. Hierdie parallele FDTD implementasies word gereeld in terme van werkverrigting of aankoop koste vergelyk, maar word bitter min sistematies oorweeg in terme van die hoeveelheid moeite wat dit geverg het om die parallele FDTD vir 'n spesifieke rekenaar argitektuur te skep. Mettertyd het die moeite om die platforms te ontplooi dramaties verander, in the 1980's het die ontplooings tyd tipies maande beloop waarteenoor dit vandag binne 'n kwessie van ure gedoen kan word. Hierdie tesis vergelyk die inspanning wat nodig is om die parallelle FDTD op geselekteerde rekenaar platforms te ontplooi deur te kyk na faktore soos die kompleksiteit van kodering en die tyd wat dit vat om 'n kode te implementeer. Die werkverrigting van die serie FDTD metode, geïmplementeer op 'n enkele persoonlike rekenaar word gebruik as 'n maatstaf om die ontplooing van die parallel FDTD met verskeie parallelisasie tegnieke te evalueer. Deur hierdie FDTD ontplooiings met verskillende parallelisasie tegnieke te ontleed en te vergelyk word die gemeenskaplike eienskappe bepaal vir verskeie rekenaar platforms. Alhoewel sommige gevalle subjektief is, is hierdie eienskappe gekwantifiseer en vergelyk in tabelvorm deur gebruik te maak van die navorsings inligting geskep deur die parallel FDTD implementasies. Die ontplooiings moeite is belangrik vir wetenskaplikes en ingenieurs wat moet besluit tussen die ontwikkeling of aankoop van 'n FDTD tipe oplossing op 'n höe werkverrigting rekenaar. Hoewel die FDTD metode in die verlede beskou was as 'n brute krag benadering tot die oplossing van elektromagnetiese probleme was dit waarskynlik weens die relatiewe swak rekenaar platforms wat lank gevat het om klein modelle te verwerk. Hierdie tesis beskryf die moderne implementering van die parallele FDTD metode, bestaande uit 'n kombinasie van verskeie tegnieke. Hierdie tegnieke kan maklik in 'n relatiewe kort tydsbestek ontplooi word op rekenaar argitekture wat wissel van IBM se BlueGene / P tot die samesmelting van multikern verwerkers en grafiese verwerkings eenhede, beter bekend as 'n versnelde verwerkings eenheid. ii Stellenbosch University http://scholar.sun.ac.za Acknowledgments I would like to thank supervisor Professor David Davidson for his assistance and motivation in the writing of this of dissertation and the lengthy path of enlightenment that has led me here. Thank you to Neilen Marais, Evan Lezar, and Kevin Colville, for their valuable reviews or input. Very many thanks to the helpful technical support crew from the Centre for High Performance Computing for their technical assistance and interesting discussions about High Performance Computing. My gratitude to my family for giving me the time to do the thesis; in particular my thanks to Marie- Claire for her constant support and helping with the formatting and checking of the thesis. There are many others that I am grateful to, but am avoiding writing a lengthy list in case I omit someone. My final acknowledgment is to all of the people who have assisted via articles and discussions in internet forums, blogs and newsgroups. I have been both encouraged by these original interactions, and humbled by the selfless contributions and experience by others. I have found this a source of motivation in completing several part of this thesis. iii Glossary & Acronyms Stellenbosch University http://scholar.sun.ac.za Glossary & Acronyms The reader is reminded that this thesis contains many domain specific acronyms and that these are written out in full at the first mention in every chapter. The Glossary compiled below should serve as a fall-back to identify all the acronyms, some of which are used in the chapter description below. ABC Absorbing Boundary Conditions ALU Arithmetic Logic Unit AMD Advanced Micro Devices; Manufacturer of computer processors API Application Programming Interface APU Accelerated Processing Unit ARM Advanced RISC Machines; Manufacturer of RISC based computer processors ASP Associative String Processor ASIC Application Specific Integrated Circuit ATI Array Technologies Industry; GPU division for AMD processors AUTOCAD Commercial Computer Aided Design software AVX Advanced Vector eXtensions Blade A high performance process board consisting of closely coupled processors C Higher level computing language C++ Higher level computing language using class structures CAD Computer Aided Design CHPC Centre for High Performance Computing in Cape Town CISC Complex Instruction Set Computing CMP Chip-Multiprocessors CMU Computer Memory Unit: A high performance board containing several processors and associated memory; Used in the SUN M9000 Core A core is considered to be a physical component of a processor performing at least one program thread CPU Computer Processing Unit CUDA Compute Unified Device Architecture DirectCompute Application programming interface that supports General Purpose computing on graphical processors DirectX11 Application programming interface for graphics and games programming DRAM Dynamic Random Access Memory DSMP Distributed Shared Memory Processor Dxf Drawing eXchange Format; an AUTOCAD data exchange file format FDTD Finite Difference Time Domain FEKO Electromagnetic modelling software based on MOM FEM Finite Element Method FMA Fused Multiply Add, a hardware based acceleration combining the multiplication and the addition instruction iv Glossary & Acronyms Stellenbosch University http://scholar.sun.ac.za FORTRAN FORmula TRANslation; a higher level computing language FPGA Field Programmable Gate Array FPU Floating Point Unit GB Giga Byte GEMS Electromagnetic modelling software based on FDTD GFLOP Giga Floating point Operations GPGPU General Purpose GPU GPS Global Positioning System; a system of satellites used for global navigation GPU Graphical Processing Unit GPUDirect A technology from Nvidia that will allow the exchange of data between dedicated GPU memories without having to access the Host computer Hadoop High performance

FDTD) Algorithms on a Selection of High Performance Multiprocessor Computing Systems

Slicing (Draft)

Assessing Gains from Parallel Computation on a Supercomputer

Instruction Level Parallelism Example

Minimizing Startup Costs for Performance-Critical Threading

18-447 Lecture 21: Parallelism – ILP to Multicores Parallel

Performance of Physics-Driven Procedural Animation of Character Locomotion for Bipedal and Quadrupedal Gait

Easy Dataflow Programming in Clusters with UPC++ Depspawn

Econstor Wirtschaft Leibniz Information Centre Make Your Publications Visible

Extracting Parallelism from Legacy Sequential Code Using Transactional Memory

Optimization Techniques for Efficient HTA Programs

Introduction to MPI

Performance Loss Between Concept and Keyboard