Parallelising a Simulator for the Analysis of Electromagnetic Radiation Using MUMPS Library∗

Parallelising a simulator for the analysis of electromagnetic radiation using MUMPS library∗ † R. Rico López V. Escuder Cabañas R. Durán Díaz Dep. de Automática Dep. de Automática Dep. de Automática Univ. de Alcalá Univ. de Alcalá Univ. de Alcalá Alcalá de Henares, Spain Alcalá de Henares, Spain Alcalá de Henares, Spain [email protected] [email protected] [email protected] L.E. García-Castillo I. Gómez-Revuelto J.A. Acebrón Dep. de Teoría de la Señal y Dep. de Ingeniería Audiovisual Center for Mathematics and its Comunicaciones y Comunicaciones Applications Univ. Carlos III Univ. Politécnica de Madrid Lisbon, Portugal Madrid, Spain Madrid, Spain [email protected] [email protected] [email protected] ABSTRACT that implements a novel hybrid Finite Element-Boundary The practical experience of parallelising a simulator of gen- Integral, known as Finite Element - Iterative Integral Equa- eral scattering and radiation electromagnetic problems is tion Evaluation (FE-IIEE) method ([11], [12], [10]), based on presented. The simulator stems from an existing sequential the well-known Finite Element Method (FEM). This method simulator in the frequency domain and can be fruitfully used permits an efficient analysis and solution of general problems in applications such as the test of coverage of a wireless net- of radiation and scattering of electromagnetic waves. work, analysis of complex structures, and so on. After the The analysis of the radiation and scattering of electromag- analysis of a test case, two steps were carried out: firstly, netic waves is an important issue that finds applications in a “hand-crafted” code parallelisation was developed within many electromagnetic engineering areas. Currently, compa- the kernel of the simulator. Secondly, the sequential library, nies face the challenge of cutting costs in many areas and in used in the existing simulator, was replaced by the parallel this context, it is very interesting the use of simulators be- MUMPS library in order to solve the associated linear al- fore the actual setting up of several kinds of resources. One gebra problem in parallel. For factorising the matrix with of such resources is, for example, the growing demand of MUMPS, two different ordering methods have been consid- wireless networks in urban areas. In this and other settings, ered. simulators can be advantageously used in order to reduce the time to availability and design cycles of products. Modern radiating structures, which may exhibit complex Keywords configurations with the presence of different materials, call Parallel Computing, MPI, Sparse Direct Solvers, MUMPS for the use of FEM (see, for example, [17]), which is very library flexible and able to handle within the same code, complex geometries, non-canonical surfaces, exotic permeable mate- 1. INTRODUCTION rials, anisotropy, and so on. However, FEM formulation does This paper presents the practical experience and the re- not incorporate the radiation condition. For this reason, sults obtained in the parallelisation of the code of a simulator FEM is hybridised with the use of the Boundary Integral (BI) representation of the exterior field, thus endowing the ∗Jointly funded by Comunidad Autónoma de Madrid FEM analysis with a numerically exact radiation boundary and Universidad de Alcalá under grant number CAM- condition at the mesh truncation boundary. Several hybrid UAH2005/042. Also, supported by the Ministerio de schemes FEM-BI have been proposed (see, for example, [16, Educación y Ciencia, Spain, under project TEC2007- 21, 13]). In contrast with standard FEM-BI approaches, FE- 65214/TCM † IIEE preserves the original sparse and banded structure of Corresponding author. the FEM matrices, allowing the use of efficient FEM solvers. In this work we show the experience and results obtained along the process of parallelising an already existing sequential FE-IIEE simulator. To achieve this goal we identified Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are bottlenecks from the point of view of both memory usage not made or distributed for profit or commercial advantage and that copies and computational load, targeting a modified code which bear this notice and the full citation on the first page. To copy otherwise, to is scalable in the range of a modest number of processors. republish, to post on servers or to redistribute to lists, requires prior specific This result overcomes the limitations of the original simula- permission and/or a fee. tor, especially in terms of memory availability, thus allowing VALUETOOLS 2009, October 20-22, 2009 - Pisa, Italy. the analysis of larger problems. Copyright 2009 ICST 978-963-9799-70-7/00/0004 $5.00. Digital Object Identifier: 10.4108/ICST.VALUETOOLS2009.7456 http://dx.doi.org/10.4108/ICST.VALUETOOLS2009.7456 2. COMPUTATIONAL ANALYSIS OF THE xxxxxxxxx d ata input xxxxxxxxx SIMULATION METHOD xxxxxxxxx xxxxxxxxx In this section we analyse the consumption of computa- xxxxxxxxx xxxxxxxxx tional resources by the different building blocks inside ex- xxxxxxxxx xxxxxxxxx isting the code in order to prioritise their parallelisation. xxxxxxxxx fiactorizat on xxxxxxxxx By computational resources we refer to two clearly different xxxxxxxxx xxxxxxxxx xxxxxxxxx aspects: xxxxxxxxx xxxxxxxxx xxxxxxxxx simulator kernel • xxxxxxxxx time in computational cycles; and xxxxxxxxx xxxxxxxxx accumulated time IiE evaluat on xxxxxxxxx • memory consumption. xxxxxxxxx in 6 iterations xxxxxxxxx upgrade of b. c. xxxxxxxxx xxxxxxxxx The first aspect refers to the total time needed for the appli- xxxxxxxxx A·X B xxxxxxxxx = resolution cation to compute the results, whereas the second aspect has xxxxxxxxx xxxxxxxxx an impact over the size limit of the problems to be solved. xxxxxxxxx xxxxxxxxx xxxxxxxxx xxxxxxxxx 2.1 Methodology xxxxxxxxx xxxxxxxxx The computational analysis of the existing code of the xxxxxxxxx error xxxxxxxxx xxxxxxxxx simulator was carried out by running a test problem. The xxxxxxxxx xxxxxxxxx test problem was selected keeping in mind that the avail- xxxxxxxxx xxxxxxxxx able computational resources were rather limited. The test xxxxxxxxx xxxxxxxxx peost-procss problem consists in the scattering problem of a plane wave xxxxxxxxx xxxxxxxxx incident on a dielectric cube with losses. The number of xxxxxxxxx xxxxxxxxx mesh elements is 2, 684, and the number of unknowns is xxxxxxxxx xxxxxxxxx 18, 288. xxxxxxxxx The available system was a cluster of eight nodes (known xxxxxxxxx new K0 as blades by the manufacturer, Sun Microsystems) model SunFire B1600. The technical details of each node can be found in Table 1. fpinal clean u Table 1: Configuration of every cluster node xxxx xxxx fiactorizat on Processor: AMD Athlon XP-M 1800+ xxxx xxxx IiE evaluat on Clock: 1.5 Ghz xxxx xxxx peost-procss Cache: 256 KB xxxx Main memory: 1 GB Hard disk: 30 GB Figure 1: Program flow and monitorisation points Network interface: 1 Gbps the simulator kernel consists of the evaluation of a double 2.2 Monitorisation integral, the upgrade of the boundary conditions and the The sequential source code was conveniently probed by solution of a linear sparse system, A · X = B. Both the ini- inserting time and memory-size controlling functions at crit- tial phase and the iterative phase are handled using a sparse ical points. solver, since A is a sparse matrix. The original sequential The probed code was sequentially executed for the test implementation used the HSL linear algebra library package problem using only one cluster node of the SunFire B1600. (formerly known as Harwell Subroutine Library, see [2]) as As a result, it became clear that the program exhibits two building block for simulation. separate computational phases: HSL is a collection of Fortran packages for large scale scientific computation that makes extensive use of BLAS • an initial phase, where the matrix is factorised; package (Basic Linear Algebra Subprograms, see [1]). The • an iterative phase, where the code performs a number sequential code uses a Fortran solver: either ME62 or ME42 of iterations inside the simulator kernel before achiev- routines, depending on whether the matrix A is symmet- ing the required accuracy. For the test problem con- ric/hermitian or not. Optionally, those routines may work sidered, obtaining a threshold relative error δ between with direct-access files for the matrix factors so that large −4 subsequent iterations below 10 required a total of problems can be solved using a relatively small in-core mem- six iterations. ory. However, the storage in disk has a limited speed to access factors and the size of the files can render the situation The previous scheme can be repeated for different frequen- unmanageable as the number of unknowns grows. cies. In our test case, only one frequency, denoted K0 in Thus, the parallelisation is motivated to overcome the lim- Fig. 1, was used. itations of the sequential code, so that larger problems can The results obtained with the previous execution are de- be run in a reasonable time frame and with limited memory picted in Fig. 1, showing the points where the monitorisation resources. Specifically, the results relative to the present im- was inserted. Only those blocks that have a relevant com- plementation shown in this paper have been obtained using putational load are shown. In this Fig. it is apparent that ME62 Fortran solver (complex symmetric case) with in-core Digital Object Identifier: 10.4108/ICST.VALUETOOLS2009.7456 http://dx.doi.org/10.4108/ICST.VALUETOOLS2009.7456 storage. 3. HAND-CRAFTED PARALLELISATION The algorithm used for element ordering (crucial for the The parallelisation was carried out using the message pass- performance of the frontal solver) with HSL is a variant ing interface (MPI, [5]), with the support of the MPICH2 of Sloan’s algorithm [20], implemented in MC63 routine of library (specifically, version 1.0.8) HSL. Indirect element ordering with default MC63 param- The hand-crafted code parallelisation targeted essentially eters is used prior to calling the frontal solver factorisation the heavy computation load given by the convolution-type routines.

Parallelising a Simulator for the Analysis of Electromagnetic Radiation Using MUMPS Library∗

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support