MAGNETOHYDRODYNAMICS: Parallel Computation of the Dynamics of Thermonuclear and Astrophysical Plasmas

Home , Magnetohydrodynamics

FOM-INSTITUUT VOCR PLASMA'-'VSICA RIJNHUIZEN!

I1' ' r ASSOCIATIE EURATOM-FOM

MAGNETOHYDRODYNAMICS: parallel computation of the dynamics of thermonuclear and astrophysical plasmas

RIJNHUIZEN REPORT 94-223

FIRST ANNUAL REPORT OF MASSIVELY PARALLEL COMPUTING PILOT PROJECT 93MPR05

VOL.. 2 7 18 1 0 KS001924338 FOM-INSTITUUT AUGUST 1994 R: FI DEOO8O77G1O VOOR PLASMAFYSICA RIJNHUIZEN

ibÈÖÖS077610*

MAGNETOHYDRODYNAMICS: parallel computation of the dynamics of thermonuclear and astrophysical plasmas

RIJNHUIZEN REPORT 94-223

FIRST ANNUAL REPORT OF MASSIVELY PARALLEL COMPUTING PILOT PROJECT 93MPR05

This work was performed as part of the research programme of the POSTBUS 1207 'Stichting voor Fundamenteel Onderzoek der Materie' (FOM) with finan- 3430 BE NIEUWEGEIN cial support from the 'Nederlandse Organisatie voor Wetenschappelijk NEDERLAND Onderzoek' (NWO). This work was sponsored by the Stichting Nationale EDISONBAAN 14 3439 MN NIEUWEGEIN Computerfaciliteiten (National Computing Facilities Foundation, NCF) TEL. 03402 - 31224 for the use of supercomputer facilities, with financial support from the TELEFAX 03402 -31204 Nederlandse Organisatie voor Wetenschappelijk Onderzoek (Netherlands Organization for Scientific Research, NWO). 11 Summary This is the first annual report of the MPP pilot project 93MPR05. In this pilot project four research groups with different, complementary backgrounds collaborate with the aim to develop new algorithms and codes to simulate the magnetohydrodynamics "of thermonuclear and astrophysical plasmas on massively parallel machines. The expected speed-up is required to simulate the dynamics of the hot plasmas of interest which are characterized by very large magnetic Reynolds numbers and, hence, require high spatial and temporal resolutions (for details see section 1). The four research groups that collaborated to produce the results reported here are: the MHD group of Prof.Dr. J.P. Goedbloed at the FOM-Institute for Plasma Physics 'Rijnhuizen' in Nieuwegein, the group of Prof.Dr. H. van der Vorst at the Mathematics Institute of Utrecht University, the group of Prof.Dr. A.G. Hearn at the Astronomical Institute of Utrecht University, and the group of Dr.Ir. H.J.J. te Riele at the CWI in Amsterdam. The full project team met frequently during this first project year to discuss progress reports, current problems, etc. (see section 2). The main results of the first project year are:

• proof of the scalability of typical linear and nonlinear MHD codes

• development and testing of a parallel version of the Arnoldi algorithm

• development and testing of alternative methods for solving large non-Hermitian eigenvalue problems • porting of the 3D nonlinear semi-implicit time evolution code HERA to an MPP system

The steps that were scheduled to reach these intended results are given in section 3. The collaboration turned out to be very successful and stimulating and project progresses on schedule. All the intended results for this first project year were actually obtained. The scalability of both linear and nonlinear MHD computations was demon- strated in tests reported in sections 4.2 and 4.6. Test runs with the newly developed parallel Arnoldi algorithm showed that the applied Arnoldi variant is not very suitable for solving large generalised MHD eigenvalue problems due to the very slow convergence obtained (see section 4.4). Much better convergence rates have been obtained by a variant of the Jacobi-Davidson algorithm with block Jacobi preconditioning. For details see section 4.5. Also, the 3D nonlinear MHD code HERA was ported to the CM-5, i.e. the ± 6000 FORTRAN statements of HERA were re-programmed in CM-Fortran and fine- tuned to the CM-5 architecture (see section 4.7). New parallel GMRES-like methods have been developed and implemented (see section 4.8). The scientific publications produced during this first project year are listed in section 4.9. In conclusion, this first project year has been very fruitful indeed and the results obtained meet our high expectations.

in IV 1 Description of the project 1.1 Introduction It is well known that matter in the universe mainly consists of plasma and that the dynamics of plasmas is essentially controlled by magnetic fields. Surprisingly, this fundamental fact only recently started to play an important role in the description of nature. In particular, research of energy production from controlled thermonuclear fusion has substantially increased our knowledge of the interaction of magnetic fields with hot ionised gases (plasmas). In turn, this has stimulated the research of astrophysical plasmas in which magnetic fields turn out to play an ever more important role. A good example is the magnetohydrodynamics of the sun, in particular the solar corona which consists of very hot plasma, where virtually all phenomena which are important in a tokamak (presently the most promising fusion experiment) return in a modified form. The appropriate instrument for the description of the macroscopic behaviour of plasmas in magnetic fields is magnetohydrodynamics (MHD). Consequently, research of MHD from the two points of view of nuclear fusion and astrophysics has turned out to be an extremely fruitful area. The equations of MHD are nonlinear partial differential equations, which contain those of gas dynamics and fluid mechanics as a special case (vanishing magnetic field). This implies that meaningful theoretical analysis in this area necessarily will centre on large- scale computations and that a vast range of applications is guaranteed. As compared to ordinary fluid dynamics, MHD concerns the interaction with a global structure: the magnetic field. Consequently, the essential problems in fusion and astrophysical plasma dynamics basically involve 3D time-dependent MHD structures characterised by the si- multaneous presence of both very fine-scale and global spatial variations in a complex magnetic field geometry evolving on a number of vastly different time scales. So far, the required spatial and temporal resolutions for realistic problems have surpassed the limits of supercomputer power so that the advent of massively parallel processor systems may lead to a breakthrough in this field. The magnetic fields in tokamaks and in the corona of the sun (and of many other x-ray emitting stars) are complicated spatial structures which virtually always exhibit singu- lar behaviour in the neighbourhood of magnetic surfaces which are resonant with Alfvén waves. For the linearised problem of wave propagation in these structures discretisation by means of finite elements leads to a generalised eigenvalue problem with non-symmetric, but sparse, matrices of very large dimensions. The MHD group at Rijnhuizen (Goedbloed and collaborators) [l]-[20] has built up substantial experience, through national and international collaborations over the past years, both with the solution of this problem by means of a number of numerical methods (QR, inverse vector iteration, Lanczos, etc.) and with the peculiarities of the resulting spectra. In the nonlinear evolution of MHD wave phenomena and instabilities the simplification associated with harmonic temporal behaviour disappears and, moreover, the presence of all kinds of plasma waves severely limits the time step in numerical codes based on explicit algorithms. Hence, in a joint project with the Astronomical Institute of the University of Utrecht (Hearn and collaborators) [2l]-[40], implicit and semi-implicit codes are developed to simulate the nonlinear dissipation of Alfvén waves in stellar coronae and the saturation of magnetic activity of MHD waves in tokamaks. Extension to parallel methods is urgent because the huge CPU- time and memory requirements block the road to new physical insight which demands surveys of the relevant parameter domains. The active collaboration of the Numerical Analysis group of the University of Utrecht (van der Vorst and collaborators) [4l]-[60] could signal the turning point in this area of research, leading to new constructive approaches with a number of applications vastly exceeding what we could possibly work out ourselves, even in the combined effort of the three groups. Shortly after the start of the project, the Large-Scale Computing group of CWI (te Riele and collaborators) [72, 73, 74] joined the effort with support in the area of large-scale linear systems. In the present project, the development of parallel methods is aimed for the solution of large-scale eigenvalue problems and for the calculation of the nonlinear evolution in magnetohydrodynamics (fluid dynamics of magnetised plasmas) with applications to thermonuclear fusion research and plasma-astrophysics. This involves a substantial trans- formation of the numerical tools that are presently being used by the MHD group at Rijnhuizen and the Astronomical Institute of the University of Utrecht. Over the past years, these tools have been in constant use and development by researchers both of the mentioned groups and abroad (KU Leuven, JET Culham, IPP Garching) so that a structure of modular and highly portable codes has grown which is open to the introduction of parallelism of the algorithms used, while maintaining and extending their physical scope. In particular, the big physics questions of the two fields, viz. an explanation of the high temperature of the solar corona and the interpretation of the level of MHD activity in tokamaks, require spatial and temporal resolutions of the codes which by far exceed the power of present supercomputers. The aim of the present project is to overcome these severe limitations by the use of state-of-the-art parallel methods as developed and, when necessary, to be developed by the numerical analysis group at the University of Utrecht.

1.2 Magnetohydrodynamics With respect to physics, the dynamics of controlled fusion plasmas in tokamaks and of astrophysical plasmas in the solar corona are quite different. However, with respect to the computational aspects, of interest here, the two fields permit a single vantage point, viz. that of computational MHD. Hence, in this section plasma dynamics will be considered as one field. The MHD equations are non-linear partial differential equations, which contain those of gas dynamics and fluid mechanics as a special case (vanishing magnetic field). In dimensionless form, these equations can be written as: g = -V.(pV), (1)

p— = -pV- VV- Vp + (VxB)xB, (2) at d 2 ± = -V.(pV)-(7-l)pV-V+(7-lMVxB) , (3)

^ = V x (V x B) - V x (T/V x B), (4) at and VB = 0. (5) Here p, V, p, and B denote the plasma density, the velocity field, the thermal pressure, and the magnetic field, respectively. The ratio of specific heats, 7, is assumed to be 5/3 and 77 is the plasma resistivity. The divergence equation, V • B = 0, serves as an initial a condition on B. The magnetic Reynolds number Rm, defined as Rm = IQVQ/T], with /0 typical length scale and Vo a typical speed. The remainder of this section consists of two subsections corresponding to the two physical pillars supporting the project. Section 1.2.1 describes MHD spectroscopy which includes the linear dissipative MHD studies. The second pillar, the time-dependent nonlinear magnetohydrodynamics, is described in section 1.2.2. In both cases, the present 'state-of-the-art' is given first, followed by the new 'scalable' physics aimed by the present project.

1.2.1 MHD spectroscopy Present status: In the framework of linear dissipative MHD the study of the interaction between plasma and magnetic field leads to large-scale complex eigenvalue problems for free oscillations and to temporal evolution problems for forced (externally driven) oscillations. These problems have been cast in the form of a set of two modular computer programmes, called CASTOR for tokamaks and POLLUX for coronal flux tubes, exploiting identical solvers as the core of the numerical procedure [16]. CASTOR is extensively used at JET (the Joint European Torus experiment at Culham) for the interpretation of measured MHD spectra [8, 11, 14, 15, 17, 18], whereas POLLUX is presently used for the study of Alfvén wave heating of the solar corona [13, If-, 20] and will be used in the near future for the interpretation of observed x-ray emissions from the SOHO satellite which will be launched by ESA and NASA in 1995. Because of the deep analogy with quantum mechanical methods for the elucidation of atomic spectra, we have termed this new field of study 'MHD spectroscopy' [16]. The linear ideal MHD equations contain non-square-integrable singularities in the direction normal to the magnetic flux surfaces. These singularities are resolved in dissipative MHD, but the variations are still strongly localised for realistic values of the plasma resistivity in tokamaks and in the solar corona. This requires special discretisation techniques for which we have chosen a Galerkin method exploiting finite elements for the normal direction and a spectral discretisation (Fourier modes) in the other two spatial directions. For the temporal evolution a fully implicit semi-discretisation method is applied. After the spatial discretisation, the above-mentioned problems take the following general forms: A-x = A B-x (eigenvalue problem) and A-x—B-dx/dt = f (temporal evolution problem). Here, A denotes the (complex) eigenvalue, x denotes a state vector, A is a non-Hermitian matrix, and B is a block tri-diagonal matrix. For the driven problems, f denotes the forc- ing term. The matrix A is a tridiagonal block matrix due to the chosen discretisation and has considerable dimensions because of the large number of finite elements and Fourier modes involved. For realistic resolutions this leads to problems regarding both memory and CPU time consumption. New 'scalable' physics: From a physical point of view, MHD spectroscopy is 'scalable' in two ways: higher magnetic Reynolds numbers induce smaller length scales and longer diffusion time scales on the one hand, the complex geometrical structure and inhomogene- ity of the laboratory and astrophysical plasmas considered induce coupling of the Fourier modes in at least one spatial direction, on the other hand. The magnetic Reynolds number 6 8 10 12 Rm is of the order of 10 -10 for fusion plasmas and 10 —10 for solar coronal plasmas. The cross-sectional shape of fusion plasmas in a tokamak and the density stratifications in solar coronal loops typically require the coupling of 20-100 Fourier modes. Present supercomputers only allow the computation of a full MHD spectrum or the linear dynamic 5 behaviour of Rm = 10 plasmas when less then about 10 modes are coupled. Realistic simulations involving higher magnetic Reynolds numbers and a more complex magnetic structure, evidently require more computer power (typically involving 8 physical variables represented by 2 different finite elements on 1000 radial grid points and 128 x 128 Fourier harmonics). These simulations require parallel computing.

, i 1.2.2 Nonlinear dynamics Present status: For the interpretation of actual plasma dynamics in tokamaks and solar coronal loops it is crucial to understand the nonlinear phase of the evolution of MHD phenomena in these configurations. Nonlinear MHD adds two more complications. First, in addition to the linear mode coupling described above, nonlinear mode coupling occurs involving hundreds of small-scale modes. Even though these small-scale modes contain little energy, they can affect the dynamics of the system considerably by modifying the effective resistivity and viscosity felt by the large-scale modes. Second, the discrepancy between the dynamical Alfvén time scale and the diffusion time scale becomes enormous both in the highly conducting, elongated tokamak plasmas and in the solar coronal magnetic loops. This ir no problem in linear MHD because one can use fully implicit algorithms there. However, nonlinear mode coupling makes implicit time stepping methods extremely complicated and CPU time consuming. We have developed numerical codes that simulate the nonlinear temporal evolution of cylindrical plasma columns (for tokamaks and closed coronal loops) and semi-infinite plasmas (for open coronal holes giving rise to the solar wind) under external excitation. The first type of code has been modelled after the linear codes, with a finite difference discretisation in the radial direction and a spectral discretisation in the two other directions. A relatively simple and flexible semi-implicit predictor-corrector scheme has been used for the time stepping. This allows time steps up to a factor 1000 larger than the largest time steps allowed for explicit methods (limited by the Courant-Friedrichs-Levy condition). Nevertheless, this code remains CPU time consuming. For the second type of code, concerned with the effect of Alfvén waves on the solar wind, the restriction on the time step becomes really prohibitive. Here, the use of implicit methods is considered, fashioned after the hydrodynamic calculation of stellar relaxation oscillations [26]-[3l]. New 'scalable' physics: For the nonlinear temporal evolution the restriction on magnetic Reynolds numbers (i.e. required radial resolution) and number of Fourier harmonics is similar to that of the linear code, except that there is no option now to consider symmetric magnetic configurations since they are destroyed by the nonlinear evolution. The main problem here is the limited number of Fourier modes that can be taken into account as a result of the CPU time consuming convolutions. Consequently, only quasi-linear and weakly nonlinear cases can be simulated. In order to take the linear and nonlinear mode coupling described above fully into account, the convolution sums of the nonlinear terms will have to be computed through a pseudo-spectral or collocation method with the use of fast Fourier transforms. Since there are many more modes involved, due to the nonlinear effects in combination with linear mode coupling, memory requirements and CPU time requirements soon become prohibitive. Hence, even more so than for linear MHD spectroscopy, the new physics in the nonlinear plasma dynamics regime requires parallel computing. 1.3 Numerical analysis, parallel algorithms As discussed in Sec. 1.2, resistive magnetohydrodynamic (MHD) problems, arising in laboratory plasmas (tokamaks) and astrophysical plasmas, lead to large linearised eigenproblems and to large nonlinear systems that have to be solved numerically. Both type$ of problems have been solved with standard numerical techniques but for relatively small model problems only. From these experiments it is well understood that physically realistic computational models soon lead to unacceptable CPU-time and computer memory requirements, if the standard numerical techniques would still be exploited. The state of the art numerical techniques for the solution of very large systems (eigenproblems as well as nonlinear evolutionary problems) rely on the availability of suitable iterative solvers (for linearised subproblems). Not only eigenvalue problems can be solved effectively by inverse iteration techniques (for which solvers need to be available), also Newton's method for nonlinear systems, and stable implicit methods for evolutionary problems, lead to large sparse linear systems that have to be solved and can only be solved by iterative techniques (including multigrid), and in order to obtain solutions in reasonable time this has to be done in an MPP environment. The currently used predictor- corrector scheme will be reconsidered in view of stability and parallelism requirements, and be replaced if desirable. A number of such methods are available (some of these were actually designed by our group), such as nested GMRES [56] and BiCGSTAB [57] for linearised problems, and Arnoldi subspace iteration with Sorenson's implicit shift technique [61],[P2] for eigenproblems. Since these methods are so-called hybrid methods there is enough freedom to adapt them to the physical problem, either through selecting the appropriate hybrid form, or by constructing suitable approximate solvers (preconditioning techniques), or a combination of both. It is a challenge for the numerical analysts to try to adapt known techniques to classes of physical problems in order to make more realistic modelling feasible. This requires a strong collaboration between all groups involved, since we need all the available experience obtained so far, and we need to understand the special properties of the given systems as dictated by the underlying physical models. Then the selected iterative schemes have to be adapted and implemented in order to meet MPP-requirements. For this we have built experience with standard iterative techniques as basic GMRES and CG [44, 49, 51, 55, 59, 62]. In our approach we have succeeded in overlapping communication time with useful computation, so that large speed-ups are achievable. These approaches have to be carried over to the hybrid techniques for the MHD problems. Quite recently we have designed new successful iterative solvers for relevant complex linear systems (as arise in MHD problems) [63]. For the study of Alfvén waves it is, in some models, necessary to determine internal eigenvalues of a given matrix with complex spectrum (in fact this will be a generalised eigenproblem in many cases). For the usual subspace approach this is known to be a very tough problem. By a combination of a new variant of Davidson's method [68] and GMRES it is in principle possible to force convergence of the subspace iteration algorithm towards specified internal eigenvalues. The internal eigenvalues are approximated by so-called harmonic Ritz values, which have been introduced (for symmetric problems) in [60]. These techniques are still in development and we have carried out some preliminary tests for rather small but representative problems and the results have been very promising [P6,P7]. For very large problems these approaches have several possibilities for parallelism. In the present project, parallelism has been obtained by exploiting the structure of the given problems as well as by reformulation of the chosen iterative schemes [P6]. From an algorithmic point of view there is similarity between the methods for eigenproblems and those for linear systems, and this similarity has been exploited in the design of parallel algorithms for either of these [59]. In the present project, we emphasise the scalability of our parallel algorithms, i.e, we take care that for increasing numbers of processors sufficient parallelism can be extracted and that communication does not dominate computation. 2 The project team

The preliminary results presented in this first annual report have been obtained due to an intense cooperation between the FOM-Institute for Plasma Physics, the Astronom- ical Institute, Utrecht University, t'he Mathematical'institute, Uti cht University, and the Centre for Mathematics and Computer Science (CWI). The original applicants of the present project were Prof. Dr. J.P. Goedbloed (FOM-Institute for Plasma Physics 'Rijnhuizen'), Prof. Dr. A.G. Hearn (Astronomical Institute, Utrecht University), and Prof. Dr. H. van der Vorst (Mathematics Institute, Utrecht University). The names of the people involved in the project during the first year are listed below in section 2.1. The form of collaboration is described in section 2.2.

2.1 Names and institutes 1. FOM-Institute for Plasma Physics 'Rijnhuizen', P.O. Box 1207, 3430 BE Nieuwegein

• Prof. Dr. Ir. J.P. Goedbloed 25% • Dr. Ir. P.M. Meijer 100% • Dr. S. Poedts 25% 2. Mathematical Institute, Utrecht University, P.O. Box 80010, 3508 TA Utrecht

• Prof. Dr. H.A. van der Vorst 25% • Dr. M.B. van Gijzen 50% • I. Mugge 10%

3. Astronomical Institute, Utrecht University, P.O. Box 80000, 3508 TA Utrecht

• Prof. Dr. A.G. Hearn 20% • Dr. G.C. Boynton 10% • E. van de Zalm 10%

4. Centre for Mathematics and Computer Science, P.O. Box 4079, 1009 AB Amsterdam

• Dr. Ir. H.J.J. te Riele 20% • Dr. J.G.L. Booten 100% • Drs. M. Louter-Nool 10%

One post-doc, viz. Dr. Ir. P.M. Meijer, has been hired during the first project year (since 1/9/93). The second post-doc, Dr. G. Toth, joins the project team on September 1, 1994. Two other post-docs, viz. Dr. J.G.L. Booten (since 1/9/93) and Dr. M.B. van Gijzen (since 1/1/94), were funded from other sources in order to secure the mathematical backbone of the collaboration. 2.2 Form of collaboration The results reported here have been obtained due to an intense and active collaboration between the people mentioned above. The full project team met regularly to discuss progress reports, current problems, and further continuation of the project. These meetings proved to be extremely interesting and instructive due to the different backgrounds (plasma physics, astrophysics, mathematics) of the project team members which led to extremely stimulating discussions in which different aspects of the same problem were considered. These team meetings were held in turn in each of the collaborating institutes. The project team met on the following dates:

1) 13/10/93 (at FOM) 4) 22/12/93 (at CWI) 7) 26/04/94 (at AI) 2) 03/11/93 (at FOM) 5) 26/01/94 (at FOM) 8) 14/06/94 (at FOM) 3) 24/11/93 (at CWI) 6) 02/03/94 (at MI) 9) 16/08/94 (at FOM)

Apart from these meetings which were attended by the full project team, team members collaborated intensively and regularly to work out suggestions and plans discussed at the progress meetings. 3 Aims and planning of the first year 3.1 Starting date and length The present pilot project was initiated for two years. It started on September 1, 1993. The present report covers the first year of the project. The intended results for this first year of the project are listed below in section 3.2. The steps that were scheduled to reach these intended results, are given in section 3.3. The schedule itself is given in section 3.4.

3.2 Intended results for the first year Two main classes of problems were considered for study, viz. linear eigenvalue systems and nonlinear evolution problems. The intended results for the first year of the project are listed below.

For parallel eigenvalue problem and linear system solvers:

• proof of scalability by means of tests on the 'building blocks' of typical linear MHD codes • development and test of fast parallel solver for linear systems with tridiagonal block coefficient matrices as obtained in typical MHD problems • development and test of a parallel version of the Arnoldi algorithm applied to a typical MHD eigenvalue problem • development and test of alternative methods for solving large non-Hermitian eigenvalue problems such as those dealt with in MHD • formulation of target problems for the second year of the project

For parallel explicit and (semi-)implicit schemes: • proof of scalability by means of tests on the 'building blocks' of typical nonlinear MHD codes (e.g. HERA) • when tests yield the expected positive results: porting of the nonlinear semi-implicit time evolution code HERA to an MPP system • formulation of target problems for the second year of the project

3.3 Steps In order to reach the above-mentioned aims the following steps were planned. The progress according to each step is reported in the corresponding part of Section 4.

1) Selection of MPP systems:

• check of possibilities (available MPP systems in The Netherlands) • investigation of desirable system architecture (distributed memory or (virtual) shared memory machine) • selection of MPP system (s) • application for computer budgets on selected systems • getting started (self-study + courses)

2) Test of scalability of linear MHD computations:

• target problems: "building blocks" of linear MHD codes

3) Parallel solver for linear tridiagonal block matrix systems

4) Parallel version of the Arnoldi method:

• development of parallel version of the Arnoldi method

• testing of parallel version of the Arnoldi method

5) Parallel version of the Davidson-Jacobi method:

• development of parallel version of the Davidson-Jacobi method

• testing of parallel versions of the the Davidson-Jacobi method

6) Test of scalability of nonlinear MHD computations:

• target problems: 'building blocks' of linear MHD codes

7) Porting of the nonlinear MHD code HERA to the CM-5:

• application for substantial computer budget

• porting of HERA to CM-Fortran

8) GMRES-like methods on parallel computers

9) Publications

10) Future directions 3.4 Schedule The steps mentioned in the previous subsection were scheduled as follows: LI

6 7 8 9 10 I I I I I I I I I I I I I I I I I I I I I I I I month: 2 4 6 8 10 12

10 4 Results of the first year 4.1 Selection of MPP systems In recent years, a wide variety of MPP systen ; has appeared on the market. In The Netherlands and, hence, 'available' to us are e.g. Parsytec, IBM-oPl, CM-5, Meiko. Each of these MPP system vendors has its own strategy for parallel processing and this strategy is reflected in both hardware architecture and available compilers on the machines. For the users this has the unfortunate consequence that each machine is different and needs to be programmed in a different manner. Attemps to set a standard, such as PVM, dre presently unsatisfactory because of the extremely low performances obtained as a result of not tuning the programme to the machine. Optimal performance requires the use of the machine-specific software (e.g. Fortran versions) supplied by the vendors. However, the wide variety of system architectures can be divided into two main classes, viz. distributed memory machines and (virtual) shared memory machines. In order to find out the consequences of both strategies for MHD calculations, it was decided to select two machines, one in each class, and to focus the efforts during the first year on these two machines. Depending on the experiences and the results obtained on these two systems, it was agreed to consider the use of other MPP systems in the second year of the pilot project. On the basis of the following selection criteria: • ease of access • available experience / external support • possible access to larger machines (with more nodes) the 512-node Parsytec GCel (a MIMD distributed memory system) and the 16-node CM- 5 (a virtual shared memory system) were chosen. The necessary pilot project computer budgets were acquired. These relatively small budgets were used to learn how to work with the operating system on both machines (PARIX on Parsytec and CMost on the CM-5) and to perform the tests discussed below in sections 4.2, 4.3, 4.4, and 4.6. Two people, viz. P.M. Meijer and S. Poedts, followed the course "Introduction to the CM-5" at the computer center in Groningen (RUG).

4.2 Test of scalability of linear MHD computations The study of the interaction of magnetic fields and a fully ionised plasma is accurately described by the equations of magnetohydrodynamics. The linearisation of these equations leads to spectral MHD theory which requires the solution of large-scale eigenvalue problems. The combination of a finite element discretisation in the direction normal to the magnetic flux surfaces and a Fourier decomposition in the other two spatial directions that span the magnetic flux surfaces, yields a general eigenvalue problem of the form:

Ax = A5x, (6)

with A the complex eigenvalue, A a non-Hermitian matrix, and B a Hermitian, positive- definite matrix. The finite element discretisation results in a block tri-diagonal structure of the matrices A and B. The block size depends linearly on the number of Fourier modes NF and is given by 16Np x 16A^?. The total number of blocks scales linearly with the number of elements (grid-points) in the radial direction.

11 The eigenvalues of Eq. (6) correspond to the three types of modes that are present in linear MHD. viz. the fast and slow magnetosonic waves and the Alfvén waves. The latter are physically the most interesting ones since they determine the stability of a magnet- ically confined plasma. The Alfvén eigenfrequencies A are not the extreme eigenvalues of Eq. (6), but are embedded between the slow and fast magnetosonic eigenfrequencies. These interior Alfvén eigenvalues can be determined by using a shift-and-invert strategy. By using a shift a and an invert strategy in Eq. (6), a new eigenvalue problem ix = (ux (7) is obtained, with A = (A — aB)~lB and n = l/(X — a). Solving Eq. (7) with standard methods gives the extreme eigenvalue of Eq. (7), which is the (interior) eigenvalue closest to the shift of Eq. (6). These standard methods are often based on a Krylov subspace iteration in which the original matrix A is projected onto a small orthonormal subspace. The latter is created by multiplying a sequence of vectors with the matrix A. The eigenvalues of the small projected matrix are an approximation of the eigenvalues of A The standard shift- and invert-methods usually require the LU-decomposition of the matrix (A — aB). This LU-decomposition cannot be performed efficiently on a parallel system and is one of the bottle-necks for an efficient parallel implementation of an eigenvalue solver for interior eigenvalues. As we will demonstrate in section 4.4 and 4.5, the LU-decomposition of (A~aB) can be avoided and, therefore, we will here concentrate on the two other kernels of the eigenvalue solvers, viz. the inner products and matrix-vector multiplications, which are both needed to create the Krylov subspace. The inner products are a severe bottle-neck on a parallel system, because they need a high amount of communication: only one processor accumulates the results of the local inner products, combines them to the global inner product, and broadcasts this result to the other processors. In the meantime, all the other processors are idle and have to wait until they receive the global inner product. Sometimes it is possible to reduce the idle time by overlapping computation and communication. This is done e.g. in the CM-5, where the control network is used to compute the inner products, while the data network and computational power is still at one's disposal. We tested the kernels of the eigenvalue solvers on both the CM-5 and the Parsytec.

4.2.1 Test results on the CM-5 The timings of the inner products on the CM-5 are tabulated in table 1. The results were obtained by using the subroutine 'glbJnner_product_cl_noadd' of the Connection Machine Scientific Subroutine Library (CMSSL). The first column gives the number of processors, the second column the vector length, the third column the total time for one inner product and, finally, the fourth column the performance. The tests were performed on different CM-5 machines at different locations, running different versions of the operating system and software. This explains the small variations in the timings. We tested the scaled speed-up behaviour, i.e. the vector length, being 2560 x Nprocs, increases linearly with the number of processors, Nprocs. The scaled speed-up behaviour demonstrates the influence of the communication; the amount of computational work per processor is kept constant, while the communication time increases with the number of processors. Table 1 and figure 1 clearly illustrate the linear scaling up to Nprocs < 64 and the degradation caused by the communication for Nprocs > 128. Note, however, the high overall performance of 36 Mflops per node (i.e. 31% of peak performance) and the low CPU time of

12 Table 1: Scaling of inner products and matrix-vector multiplications inner product matrix-vector multiplication block-sparse gridsparse Nprocs vector time Mflops Nblocks time Mflops time Mflops length (ms) (M (ms) (/s) (ms) (/s) 1 2560 .57 36 32 88.9 54 105.8 49 4 10240 .44 186 128 85.5 229 104.3 188 16 40960 .63 524 512 89.1 882 106.3 739 32 81920 .55 1192 1024 89.0 1767 106.0 1484 64 163840 .56 2354 2048 85.7 3672 102.9 3057 128 327680 .59 4420 4096 85.7 7343 102.8 6120 256 655360 .68 7686 8192 85.7 14688 101.7 12371

only 0.57 ms per inner product. We remark that the tests were done on different machines, running different versions of the operating system and different versions of the software, some of which are not released yet. This explains part of the 'noise' in the timing. The second test concerned the matrix-vector multiplication for the block tri-diagonal MHD matrices. Two different CMSSL subroutines were tested on the CM-5, viz. the 'block-sparse_matrix_vector_mult' routine and the 'gridjsparse_matrix_vector_mult' routine, for matrices with a block size of l6Np x 16Np (Np = 5). The total number of blocks (32 x Nprocs, actually being the number of grid points) on the diagonal of the tridiagonal system scaled again linearly with the number of processors. The results are given in columns 5-9 of table 1 and in figure 1. The MHD matrix-vector multiplications involves a large number of arithmetic operations and only need the fast neighbour-to-neighbour communication pattern. As the ratio between communication and computation is small, an almost linear scaling with the number of processors and a very high computation speed (54 Mflops per node, i.e. 42% of peak performance) is obtained. The general routines were, in contrast with our expectation, a little faster than the grid routines. It is expected that the performance of the grid routines will exceed the performance of the general routines in the next software release. We remark that the set-up time for the general routine was rather long, viz. 400 ms versus 0.017 ms for the grid routine. However, since we have to perform a large number of matrix-vector multiplications and since the set-up has to be done only once, the general routine is, at this moment, preferred above the grid routine.

4.2.2 Test results on the Parsytec

The above two tests were also performed on the Parsytec GCej located at the Academic Computer Center Amsterdam (SARA). The Parsytec is a distributed memory machine with 512 nodes and is programmed by using message passing. We used a two-dimensional grid topology for the calculations of the inner products. Using this topology, the inner products take at most Nprocs + 1 communication steps for a square (Nprocs x Nprocs) processor grid. The MHD matrix-vector multiplication only requires neighbour-to-neighbour communication and can be efficiently implemented by using a virtual pipeline topology that is embedded in the two-dimensional grid topology. Using this virtual pipeline, only four

13 16

O: inner products 12 X: block_sparse •: grid_sparse o. ° 8 O

X . '• . * o single Cray YIVP node

O 100 Kl 200 300 Nprocs

Figure I: Scaling of the inner products (circles) and matrix-vector multiplications with the block_sparse routine (crosses) and with the grid-sparse routine (filled circles) on the CM-5. local communication steps are needed to compute the matrix-vector multiplication. This number is independent of the number of processors. Table 2 and Figs. 2 and 3 show the results for the inner products and matrix-vector multiplication as a function of the number of Fourier modes NF and the number of grid- points NQ- We obtain an almost linear speed-up with the number of processors for the matrix-vector multiplication for a fixed problem size, see figure 2. This holds also for the scaled speed up, where we have to keep in mind that the dimensions of the matrices scale linearly with the number of grid points NQ and quadratically with the number of Fourier modes Np. For the largest problem size we obtain approximately 38% of the peak performance of the Parsytec. The scaling of the inner products is worse, because of the rather high amount of communication involved. Column 2 of table2 and figure 3 clearly show that the execution times eventually increase, in stead of decrease, with the number of processors. This would certainly hamper the computation when the computational effort is dominated by the inner products. For the MHD problem we have to compute typically 10 inner products for one matrix-vector multiplication. From table 2 it is obvious that the inner products will take only a small amount of the total computation time and that the final scaling is almost not hampered by the inner products. If for certain purposes the inner products become dominant, then we may try to combine the calculation of several inner products. This will certainly increase the scaling properties, but is numerically somewhat more unstable. Note the difference in execution times between the Parsytec and CM-5, caused by the different peak performances of the CM-5 (128 Mflops per node) and the Parsytec (0.8 Mflops per node). We conclude that the linearised MHD problem reduces to an eigenvalue problem that has to be tackled by using parallel computers. We tested the parallel performance of

14 200 400 600 Nprocs

Figure 2: Scaling of the matrix-vector products on the Parsytec for Np = 5 and NG = 512, i.e. for a fixed problem size. The numbers are given in table 2.

200 400 600 Nprocs Figure 3: Scaling of the inner products on the Parsytec for Np = 5 and NQ = 512, i.e. for a fixed problem size. The numbers are given in table 2.

15 Table 2: Timing results for the matrix-vector product (MV) and the inner product (IP) for different MHD problems on a Nprocs\ x Nprocs? processor grid.

MV IP NF NG Nprocs-[ x Nprocs;2 time[ms] Mflops time[ms] Mflops 5 256 8 x 4 4111 9.5 7.46 22.0 8 x 8 2061 19.0 5.39 30.4 16 x 8 1036 37.9 5.42 30.2 16 x 16 522 75.1 6.15 26.6 5 512 8x 8 4111 19.1 8.16 40.1 16 x 8 2061 38.1 6.81 48.1 16 x 16 1036 75.8 6.84 47.9 32 x 16 519 151.3 9.00 36.4 10 512 16 x 16 4077 77.1 8.23 79.6 32 x 16 2048 153.4 9.69 67.6

the kernels (inner products and matrix vector multiplications) of the eigenvalue solvers on two different machines (CM-5 and Parsytec) by using the matrices arising from the linearised MHD equations. Our tests indicate that both machines are well suited for an efficient implementation of the building blocks (total of inner products and matrix vector multiplications) of the eigenvalue solvers.

4.3 Parallel solver for linear tridiagonal block matrix systems For MHD spectroscopy in tokamaks one needs to solve the linearised MHD equations for a plasma in which the perturbations are excited by an external driver (antenna). For this situation, the discretisation of the linearised MHD equations yields the following system:

{A - iwdB) x = f, (8) where u>d is the frequency of the external driver, f is the force term and A, B are block- tridiagonal matrices with a block size of 167Vp x 16iVp and NQ diagonal blocks, where Np denotes the number of Fourier harmonics and NG is the number of grid points. System (8) can be solved on a parallel machine. Parallel performance is obtained by exploiting the block-structure. The data layout is chosen such that for each gridpoint the lower-diagonal, the diagonal, and the upper-diagonal block are kept local to a processor, while the 'grid points' are scattered across the machine. The CMSSL library on the CM-5 provides routines that solve a block tri-diagonal system efficiently. We tested these routines for a modest model problem and obtained performances similar to the Cray-YMP on the 16 node CM-5 in Groningen. The parallelisation on the CM-5 is still in the testing phase. Because of memory allocation problems we could not investigate the scaling properties of the CMSSL routines. This memory allocation problems will be solved in the next software release which offers more flexibility to control the data allocation.

16 1.0

0.8

D 0.6 o o ooo°°°°1 = Xm A O o o ° o o O o

0.4

0.2

0.0 4»—o- -1.0 -0.8 -0.6 -0.4 -0.2 0.0 TleX Figure 4: Alfvén spectrum MHD test problem (n = 416).

4.4 Parallel version of the Arnoldi method Since we plan to solve very large generalised MHD eigenvalue problems we searched for algorithms that avoid the factorisation (inversion) of matrices. Furthermore, as stated before in section 4.2, a factorisation of a matrix and the replacement of a matrix-vector product by the solution of triangular systems makes algorithms inappropriate for efficient parallel implementations. Therefore, methods that involve shift-and-invert strategies become unsuitable for our purposes. The first procedure we tested for a small MHD problem was an Arnoldi method for internal eigenvalues, described in detail in ref. [P6]. The basic idea of this method is to project a matrix A~l onto a subspace AV instead of V, as is done in the usual Arnoldi algorithm for computing the largest eigenvalues of a non-hermitian matrix. It is easy to show that no inversion of A is required in order to compute the projected matrix. This method would be very suitable for parallel machines, since the kernels of the algorithm consist of matrix-vector products, inner products and vector updates, which, for MHD eigenvalue problems, all show good scaling properties on parallel architectures as outlined in section 4.2. In the small test problem we considered the matrices were of order 416 (the order of the blocks was 16). The Alfvén spectrum is shown in figure 4 (the square in the figure indicates the shift we used in the algorithm). However, convergence turned out to be very slow for the Arnoldi method: 460 outer iterations were necessary to achieve reasonable approximations of eigenvalues close to the shift. Moreover, the dimension of the Krylov subspace, built up each iteration, had to be set at 125, i.e., 57500 inner iteration steps (matrix-vector multiplications) were required. The convergence history is displayed in

17 100 200 300 400 500 #iter Figure 5: Convergence history MHD test problem as obtained with the Arnoldi method.

log

10 15 20 25 # iter Figure 6: Convergence history MHD test problem as obtained with the Jacobi-Davidson method.

figure 5 for the norm of the residual vector r of a certain eigenvalue and corresponding eigenvector. Due to the very slow convergence we must conclude that this Arnoldi variant is not very suitable for solving large generalised MHD eigenvalue problems.

4.5 Parallel version of the Davidson-Jacobi method Next we investigated a variant of the recently proposed Jacobi-Davidson method [68]. The basic idea for solving the generalised problem A x = XB x is to project the equivalent standard problem B~lA onto a subspace B*V. This new Jacobi-Davidson algorithm [P7] for computing eigensolutions of matrix pencils has several advantages over other methods. The algorithm is applicable to general matrix pairs and the inversion of a matrix is avoided. Especially the latter property makes this method suitable for solving very large problems. Instead of inverting matrices we have to solve two linear systems approx-

18 imately during each outer iteration, for instance by performing a few iteration steps with a suitable linear system solver. Similarly to the Arnoldi method mentioned in the previous section the time-consuming kernels of the method are matrix-vector products, inner products and vector updates. Therefore the method is suitable for an efficient parallel implementation. However, the two linear systems have to be solved with sufficient accuracy to obtain convergence of the eigenvalues, and this will require good preconditioners in many applications. In order to preserve the suitability of the method for efficient parallel implementations we have to avoid preconditioners of the incomplete decomposition type. We applied the Jacobi-Davidson algorithm to the same MHD test problem as in the previous section. The two linear systems were solved to some accuracy with the well- known GMRES algorithm. It turned out that without preconditioning the systems were not solved sufficiently well to reach convergence to the eigenvalue closest to the shift (the square in fig. 4). Next we tried block Jacobi preconditioning, which is very suitable for parallel distributed memory machines since no communication is involved. It turned out that convergence to the desired eigenvalue was achieved in 26 outer iterations, provided we solved the linear systems with 70 and 125 GMRES iteration steps, respectively, which means that 1750 inner iteration steps (MV-products) were required for linear system 1 and 3125 inner iteration steps (MV-products) for linear system 2. Note the large improvement over the Arnoldi method. The convergence history is plotted in figure 6 (the vector d = B r in this case). It is obvious that we need more sophisticated preconditioners than the one tried here in order to gain faster convergence for MHD eigenvalue problems with the Jacobi- Davidson method. This will be the subject of future research; in particular, we will look for preconditioning techniques which preserve the suitability of the method for efficient parallel implementation. One might think of multigrid-like ideas of using small target problems as preconditioners for larger problems.

4.6 Test of scalability of nonlinear MHD computations For the interpretation of actual plasma dynamics in tokamaks and solar coronal loops it is crucial to understand the nonlinear phase of the evolution of MHD phenomena in these configurations. Nonlinear MHD adds two more complications: nonlinear mode coupling involving hundreds of small-scale modes and the discrepancy between the dynamical Alfvén time scale and the diffusion time scale which becomes enormous in the highly conducting plasmas of interest. The nonlinear mode coupling makes implicit time stepping methods extremely complicated and CPU time consuming. We have developed a numerical code for simulating the 3D temporal evolution of an externally driven cylindrical plasma in nonlinear MHD (Section 4.7). The radial direction is discretised with finite differences on two staggered meshes and a pseudo-spectral discretisation is used for the other two spatial directions. A semi-implicit predictor-corrector scheme is applied for the time stepping. In this scheme the fast magnetosonic modes, and only these, are treated implicitly in the linear start-up phase. This allows time steps up to a factor of 1000 larger than explicit methods, where the time step is limited by the CFL condition. Nevertheless, 3D nonlinear time evolution simulations remain extremely computational intensive and computer memory and CPU time requirements become soon prohibitive when large magnetic Reynolds numbers are considered. Therefore, the possibility of porting this 3D nonlinear MHD code to a CM-5 was investigated.

19 Table 3: Scaling of tri-diagonal system solver and double complex FFTs

tri-diagonal system double FFTs Nprocs Nsystems time MfloDS Nfft time Mflops (ms) (M (s) (/s) 1 32 61.3 23 32 .267 59 4 128 55.1 99 128 .261 241 16 512 57.5 378 512 .258 977 32 1024 54.8 797 1024 .261 1928 64 2048 53.2 1640 2048 .252 3988 128 4096 53.1 328S 4096 .253 7965 256 8192 53.3 6557 8192 .253 15911

As a first step, the performance and scalability of two sub-problems arising in the 3D nonlinear MHD code were studied on the CM-5. The first sub-problem involves the solution of a tri-diagonal system. In the semi-implicit MHD code, such a system has to be solved for each Fourier mode. Again, the test problem was scaled with the machine size: 32 x Nprocs systems were solved. The vector length was fixed to 512. The CMSSL routines gen_banded_factor and gen_banded_solve were used but the timing was done for the solution phase only. The tri-diagonal systems are solved locally (32 systems on each node) and, hence, a linear scaling is obtained as shown in table 3 and figure 7. The performance is 25 Mflops/s/node. Note that, although the performance is much better for the FFTs than for the tri-diagonal system solver, the execution time for the latter is shorter. The second sub-problem studied involves double FFTs. In the pseudo-spectral or collocation method used in the MHD code, such FFTs are used to ensure periodicity in two spatial directions (e.g. the short and the long way around the tokamak). In principle, one computes in Fourier space, and the nonlinear terms are computed via FFTs to real space, a multiplication in real space, and an FFT back to Fourier space. This procedure turns out to be faster than a direct convolution sum in Fourier space when more than 50 Fourier modes are involved. In the test, 64 x 64 complex FFTs were considered and the problem size was again scaled with the number of processors: each processor performed 32 forward and backward FFTs. The CMSSL routine 'fft_detailed' yielded a very high performance: 59 Mflops/s/node. And as the FFTs were done locally, the speed-up scales linearly with the number of processors, as shown in table 3 and figure 7.

4.7 Porting of the nonlinear MHD code HERA to the CM-5 The full MHD equations are nonlinear partial differential equations (PDEs). It is not always necessary to solve these nonlinear PDEs. For instance, the physics of small perturbations can be modeled by using the linearised MHD equations, which leads to large eigenvalue problems; see section 4.2. On the other hand, some applications do require the solution of the full nonlinear MHD equations. A good example is resonant absorption, in which the amplitude of some modes becomes so large that nonlinear effects are visible. At the FOM-Institute for Plasmaphysics, a three-dimensional nonlinear MHD code

20 CXf i i

X: double FFTs 15 r •: tr idiagonal solver yS to

lops / 10 O

5 ^—' -

0 i 100 Nprocs a0° 30° Figure 7: Performance of the FFTs and tridiagonal system solver on the CM-5

(HERA) has been developed that simulates the evolution of an externally driven three- dimensional cylindrical plasma. The code solves the nonlinear MHD equations by using a semi-implicit predictor-corrector scheme for the time-stepping. This scheme allows time- stepping at the physically interesting Alfvén time scale, whereas a fully explicit code would be limited by the time scale of the fast magnetosonic waves, which is up to 3 orders of magnitude smaller than the Alfvén time scale. In HERA, the spatial directions are discretised by using finite differences on two staggered meshes for the radial direction of the cylinder and a pseudo-spectral discretisation for the other two spatial directions. The latter ensures periodicity, while the finite difference scheme is chosen to resolve the large gradients near the resonant layer. The nonlinear terms can be calculated in the Fourier space by means of convolution sums or in the real space by using Fast Fourier Transforms (FFTs). In the remainder we will focus on the FFTs as this scheme is superior to the convolutions when more than 50 Fourier modes are taken into account. The Fast Fourier Transforms are used to transform the quantities from Fourier space to real space where the nonlinear terms are computed. The result is transformed back to Fourier space in which the physical quantities are updated. During the update one still has to solve a tridiagonal system for each fourier mode. These systems arise from the semi-implicit treatment of the time-stepping procedure. The solution of the nonlinear MHD equations requires a tremendous amount of computing time. Because of the nonlinear mode coupling, hundreds of small-scale modes have to be taken into account and, depending on the magnetic Reynolds number, more than five-hundred grid points are needed in order to resolve the resonant layer. It is obvious that these three-dimensional simulations are not only extremely computation intensive, but also require a lot of computer memory. For instance, the serial version of HERA takes 9.67 s on the Cray C-98 for a resolution of only 512 grid-points and 16 x 16 Fourier modes in the two other spatial directions. If one takes into account that typically 20,000 time steps have to be performed, it is clear that increasing the resolution is completely out of the question. The only way to reduce the computation time is to perform parallel computations. Again, one has to find out whether the computation will be seriously ham-

21 pered by the communication. Timings on the Cray C98 indicate that the Fast Fourier Transforms and the solution of tridiagonal systems consume more than 98% of the computation time. In order to test the parallel performance and scaling of these parts, scaling tests were performed on the CM-5 (see Section 4.6). The parallel performance strongly depends on the data lay-out of the arrays. The CM-5 allows two different data layouts, viz. the 'news' layout where the data are allocated to different processors and the 'serial' layout where the data are kept local to a processor. At least one of the array dimensions should be allocated with the 'news' layout. Best performance is achieved if the two other directions have a 'serial' allocation since operations on serial directions do not involve communication. The performance for 'serial' and 'news' operations can differ by more than a factor of 10 in favour of tJu. bcrinl operations. The three-dimensional MHD code HERA allows two different lay-outs, viz. i) scattering the grid-points across the processors and keeping the Fourier modes belonging to the same grid point local to each processor, and ii) scattering the Fourier modes across the processors and storing the radial data belonging to the specific Fourier mode local to the processor. The first data layout results in local Fast Fourier Transforms and global tri-diagonal systems, while the second lay-out yields the opposite behaviour. The test results of section 4.6 indicate that the first data lay-out is to be preferred. Based on the above-mentioned test results (Section 4.6) we decided to port the MHD production code HERA (6000 lines of FORTRAN code) to the CM-5. Approximately 3 months were needed to do this. During this period, fine-tuning was performed in order to optimise the code for the CM-5 architecture. First tests on the Groningen CM-5 (16 nodes) indicate that the code needs 3.2 seconds to perform one time step for 512 radial gridpoints and 16 x 16 Fourier modes. This is more than a factor of 3 faster than the Cray C98.

Support from Thinking Machines

Based on the promising test results, Thinking Machines Corporation decided to pro- vide access to a 64 node CM-5 of GMD (Germany). This machine runs with beta (i.e. not yet officially released) software and faster hardware. It can operate in partitions of 16, 32, and 64 nodes and, hence, makes it possible to test the scaling properties of HERA.

Table 4 denotes the test results. Note the big difference with the Groningen Machine caused by the newer hard- and software releases. The execution time on a 16 node CM-5 is approximately a factor of 9 faster than the Cray C98. The scaled speed-up (compare 256 grid points and 16 nodes with 512 grid points and 32 nodes) is almost linear. The speed-up for a fixed problem size is not as good. This is caused by the incomplete filling of the vector-pipes when going from a 16 to a 32 and 64 node configuration. To fill the pipes completely the number of grid points must exceed 32 x Nprocs. Nevertheless, the obtained speed-ups are impressive and indicate that large MHD simulations can be performed efficiently on the CM-5.

4.8 GMRES-like methods on parallel computers GMRES [69] is among the most popular and most powerful methods for solving large linear nonsymmetric systems. The computationally most intensive parts of the algorithm are the matrix-vector multiplication, the preconditioning operation and the construction

22 Table 4: Scaling results of HERA.

machine Nprocs grid points time[s] CM-5 (Groningen) 16 512 3.19 CM-5 (Think. Mac.) 16 512 1.36 32 512 0.83 64 512 0.64 16 256 0.78 32 256 0.55 64 256 0.54 CRAY C98 1 512 9.67 1 256 4.83

of an orthogonal basis for the Krylov subspace. All these operations should be optimized for a given hardware, if one wants GMRES to run efficiently. This is not a trivial task, especially on parallel machines. We have implemented and built up experience with GMRES and two variants, GM- RESR [56] and GCR [70] mainly on two machines: the CRAY Y-MP 4/464 of SARA in Amsterdam and the MEIKO CS1 of the ACCU in Utrecht. The first machine is a shared memory machine and uses very low grain (do loop) parallelism. The MEIKO is a distributed memory machine and employs large grain parallelism. Hence the architectures are completely different. The test problems we have used stem from finite element discretizations of partial differential equations. This is relevant for the implementation of the matrix-vector product and for the choice of the preconditioner. The matrix-vector product is performed element-by-element. This can be implemented efficiently on a CRAY by making a multi-colour ordering of the elements and by carefully chosing the datastructure. Details are given in [P10]. In order to have an efficient implementation on the MEIKO one should make a domain decomposition. In this approach communication only takes place between neighbouring domains. For the choice of preconditioner we have regarded polynomial preconditioning [Pll] and element-by-element preconditioning [Pi2]. Polynomial preconditioning only requires matrix-vector multiplications, hence parallelisation is trivial once one has a parallel matrix-vector product. On a CRAY, the element-by-element preconditioners can be paral- lelised in a way similar to that of the matrix-vector product [P10]. On a MEIKO one has to develop new variants of the element-by-element preconditioner, since the existing methods require far too much communication. A possible candidate is described in [P13]. The construction of an orthogonal basis for the Krylov subspace can be performed efficiently on a CRAY. To orthogonalize a basis one needs to perform inner products and vector-updates. The latter operation is no problem on the MEIKO either. However, the inner product requires global communication, so not only with neighbouring domains. The approach we have taken so far is to reduce the number of inner products by applying an (expensive) polynomial preconditioner. This approach reduces the number of inner products at the expense of an increase of the number of matrix-vector multiplications. An other approach is to overlap the communication with computations. This is described in detail in [71].

23 4.9 Publications [Pi] J.P. Goedbloed, S. Poedts, G.T.A. Huysmans, G. Halberstadt, H. Holties, A.J.C. Beliën, "Magnetohydrodynamics spectroscopy: Large scale computation of the spectrum of waves in plasmas", Future Generation Computer Systems 10, 339- 343 (1994). [P2] M.N. Kooper, H.A. van der Vorst, S. Poedts, and J.P. Goedbloed, "Application of the Implicitly Updated Arnoldi Method with a Complex Shift and Invert Strategy in MHD", PP S3/061, FOM, Nieuwegein, 1993. [P3] S. Poedts, P.M. Meijer, J.P. Goedbloed, H.A. van der Vorst, A. Jakoby: "Parallel magnetohydrodynamics on the CM-.5", in Lecture Notes in Computer Science (eds. W. Gentzsch and U. Harms), Proc. HPCN Europe '94, Miinchen, April 18-20,1994, 796, 365-370 (1994).

[P4] J.G.L. Booten, P.M. Meijer, H.J.J. te Riele, H.A. van der Vorst: "Parallel Arnoldi method for the construction of a Krylov subspace basis: an application in magnetohydrodynamics" , in Lecture Notes in Computer Science (eds. W. Gentzsch and U. Harms), Proc. HPCN Europe '94, München, April 18-20, 1994, 797, 196-201 (1994).

[P5] P.M. Meijer, S. Poedts, J.P. Goedbloed, and A. Jakoby: "Nonlinear magnetohydrodynamics on parallel computers" in Proc. of the 6th Joint EPS-A PS International Conference on Physics Computing, R. Gruber and M. Tomassini (eds.), August 22-26, 1994, Lugano, Switzerland, 621-624 (1994). [P6] J.G.L. Booten, P.M. Meijer, H.J.J. te Riele, and H.A. van der Vorst: "Parallel Arnoldi method for the construction of a Krylov subspace basis: an application in magnetohydrodynamics", Report NM-R9406, Department of Numerical Mathe- matics, CWI, Amsterdam, 1994. [P7] J.G.L. Booten, H.A. van der Vorst, P.M. Meijer, and H.J.J. te Riele, "A precondi- tioned Jacobi-Davidson method for solving large generalized eigenvalue problems", Report NM-R9414, Department of Numerical Mathematics, CWI, Amsterdam, 1994. [P8] Henk A. van der Vorst: "Recent developments in hybrid CG methods", in: W. Gentzsch and U. Harms (Eds.), High Performance and Networking, Proceedings In- ternational Conference, Munich, April 1994, Volume 2: Networking and Tools, Lec- ture Notes in Computer Science, 797, Springer Verlag, Berlin etc., 1994, pp. 174- 183. [P9] E. de Sturler and H.A. van der Vorst: "Communication Cost Reduction for Krylov Methods on Parallel Computers", in: W. Gentzsch and U. Harms (Eds.), High Performance and Networking, Proceedings International Conference, Munich, April 1994, Volume 2: Networking and Tools, Lecture Notes in Computer Science, 797, Springer Verlag, Berlin etc., 1994, pp. 190-195. [P10] M.B. van Gijzen: "Large scale finite element computations with GMRES-like methods on a Cray Y-MP", Preprint nr. 872, Department of Mathematics, Uni- versity of Utrecht, 1994.

24 [Pil] M.B. van Gijzen: "A Polynomial Preconditioner for the GMRES algorithm", TNO-Report 93-NM-R0536, to appear in Journal of Computational and Applied Mathematics.

[P12] M.B. van Gijzen: "An analysis of element-by-element preconditioners for nonsymmetric problems", in: Computer Methods in Applied Mechanics and Engineering 105, pp. 23-40 (1993). [P13] M.B. van Gijzen: "Clustering of Elements in EBE-Preconditioners", Preprint nr. 850, Department of Mathematics, University of Utrecht, 1994.

4.10 Future directions The experience obtained in this first project year enabled us to get a clear picture of the possibilities of MPP for MHD applications and to give concrete form to our goals for the second year of this pilot project and for the prolongation of the present project. This has lead to the formulation of some 'target problems':

• The first target problem involves fully implicit 2D MHD parallel simulations. The problem here is that implicit methods are so computational intensive that the amount of extra work to be done in each time step, as compared to explicit methods, almost compensates the gain in time step length so that the gain in CPU time is only very modest. The challenge is to adapt recent new numerical algorithms to MPP systems and to develop a fully implicit 2D nonlinear MHD code that is much less CPU consuming than explicit methods.

• The second target problem involves the determination of the resistive MHD spectrum of tokamak plasmas. Present-day algorithms allow only limited spatial resolution, i.e. limited matrix sizes, and the aim of this target problem is to demonstrate that new (parallel) algorithms on MPP systems allow the determination of the full resistive MHD spectrum of fusion-relevant tokamak plasmas.

• The third target problem involves the study of 3D Kelvin-Helmholtz instabilities of resonant layers in coronal loops and holes. Available codes and supercomputers only allow 2D studies of this phenomenon so that effects of line-tying cannot be taken into account in full MHD simulations. Parallel versions of the used algorithms are challenged to include this important effect of anchoring of the magnetic field lines in the dense photosphere.

25 26 5 Conclusions after one year This pilot project aims at studying the applications of massively parallel computers to the computations of linear and nonlinear magnetohydrodynamics. These computations involve the solution of systems of large sparse i. atrices. Nonlinear solutions of the time dependent magnetohydrodynamics equations are being calculated for applications in the study of nuclear fusion devices of the tokamak type in plasma physics and in the study of the heating and acceleration of stellar coronae in astrophysics. The relevant physics means that three-dimensional calculations are necessary with characteristic distance scales which are small compared with the dimensions of the system. This means that the full calculations demand a computing power greatly in excess of the present power of supercomputers. Only with advances in parallel computing in the future will it be possible to do sufficiently detailed calculations for the comparison with observations. In the present pilot project, four different groups of people (institutes) with different backgrounds work closely together on two ambitious target problems in the context of (linear) MHD spectroscopy and nonlinear time-dependent plasma dynamics, respectively. These target problems translate mathematically in large-scale complex non-Hermitian eigenvalue problems and a set of time-dependent nonlinear partial differential equations. The regular progress meetings on which each group brought in its own experience and vision on the problems were found to be extremely stimulating. As a matter of fact, though the pilot project is a very ambitious one, after the first project year it is fair to state that the project progresses on schedule. The first project year has been very fruitful indeed and the results obtained meet our high expectations. Highlights of the first year are: • very promising test results have been obtained on the scalability of the MHD calculations (this is a very important aspect and crucial for the possibility of tackling new physics on larger MPP systems in the near future, • a parallel version of the already existing and used Arnoldi algorithm has been developed,

• a new (better, faster) iteration scheme for the large eigenvalue problems dealt with in MHD spectroscopy is being developed and the tests look very promising,

• a parallel version of one of our major nonlinear MHD production codes, HERA, has been ported to the CM-5 which means that we are ready to tackle new plasma dynamics (more complex 3D structures and/or higher magnetic Reynolds numbers) when larger machines become available.

Hence, we are looking back on this first year of the project with great satisfaction.

27 28 References

[1] J.P. Goedbloed, Phys. Fluids 18, 1258-1268 (1975). "Spectrum of ideal magnetohydrodynamics of axisymmetric toroidal systems". [2] J.P. Goedbloed, Invited Review in Trends in Physics, ed. M.M. Woolfson, Adam Hilger Ltd., Bristol, 324-335 (1978). "Magnetohydrodynamic instabilities in toroidal plasmas". [3] J.P. Goedbloed, Phys. Fluids 25, 852-868; 2062-2072; 2073-2088 (1982). "Free- boundary high-beta tokamaks. I; II; III". [4] J.P. Goedbloed, Invited review in Fronts, Interfaces and Patterns, ed. A.R. Bishop, L.J. Campbell, P.J. Channel, Proc. 3rd Annual Intern. Conf. of the Center for Non- linear Studies, Los Alamos, May 3-6, 1983; Physica 12D, 107-132 (1984). "Plasma- vacuum interface problems in magnetohydrodynarnics". [5] J.P. Goedbloed, Comp. Phys. Comm. 31, 123-135 (1984). "Some remarks on computing axisymmetric equilibria". [6] J.P. Goedbloed, G.M.D. Hogewey, R. Kleibergen, J. Rem, R.M.O. Galvao, and P.H. Sakanaka, Proc. Tenth International IAEA Conference on Plasma Physics and Con- trolled Fusion Research, London, 12-19 September 1984, IAEA, Vienna, 2, 165-172 (1985). "Investigation of high-beta tokamak stability with the program HBT". v [7] R. Kleibergen and J.P. Goedbloed, Plasma Physics and Controlled Fusion 30, 1961- 1987 (1988). "Alfvén wave spectrum of an analytic high-beta tokamak equilibrium". [8] G.T.A. Huysmans, R.M.O. Galvao, J.P. Goedbloed, E. Lazzaro, and P. Smeulders, Plasma Physics and Controlled Fusion 31, 2101-2110 (1989). "Ballooning stability of JET discharges".

[9] J.P. Goedbloed and D.A. D'Ippolito, Physics of Fluids B2, 2366-2372 (1990). "RF stabilization of external kink modes in the presence of a resistive wall". [10] J.P. Goedbloed, Invited review in Trends in Physics 1991, Proc. 8th General Conf. of the European Physical Society, Amsterdam, Sept. 4-8, 1990, III 827-844 (EPS, Prague, 1991). "MHD waves in thermonuclear and solar plasmas". [ll] G.T.A. Huysmans, J.P. Goedbloed, W. Kerner, Proc. Europhysics 2nd Intern. Conf. on Computational Physics, Amsterdam, Sept. 10-14, 1990, ed. A. Tenner (World Scientific, 1991) 371. "Isoparametric Bicubic Hermite Elements for Solution of the Grad-Shafranov Equation". [12] J.P. Goedbloed, Comput. Phys. Commun. 59, 39-53 (1990). "Stability of solar coronal loops". [13] S. Poedts, M. Goossens, W. Kerner, Astrophys. J. 360, 279-287 (1990). "On the efficiency of coronal loop heating by resonant absorption".

[14] G.T.A. Huysmans, T.C. Hender, O.J. Kwon, J.P. Goedbloed, E. Lazzaro, P. Smeul- ders, Plasma Phys. Contr. Fusion 34 487-499 (1992). "MHD stability analysis of high-/? JET discharges".

29 [15] S. Poedts, W. Kerner, J.P. Goedbloed, B. Keegan, G.T.A. Huysmans, E. Schwarz, Plasma Phys. Contr. Fusion 34, 1397-1422 (1992). "Damping of global Alfvén waves in tokamaks due to resonant absorption". [16] J.P. Goedbloed, G.T.A. Huysmans, S. Poedts, G. Halberstadt, W. Kerner, E. Schwarz, Invited Review in Advances in Simulation and Modeling of Thermonu- clear Plasmas, Proc. IAEA Technical Committee Meeting, Montreal, June 15-17 (1992). "Computation of resistive MHD modes in Thermonuclear and Astrophysical plasmas". [17] G.T.A. Huysmans, J.P. Goedbloed, W. Kerner, Physics of Fluids, B3, 1545-1558 (1993). "Free boundary resistive modes in tokamaks". [18] J.P Goedbloed, G.T.A. Huysmans, H. Holties, W. Kerner, S. Poedts, Plasma Phys. Contr. Fusion 35, B277-292 (1993). "MHD spectroscopy: Free boundary modes (ELM's) and external excitation of TAE modes". [19] J.P Goedbloed, G. Halberstadt, Astron. Astrophys. 286, 275-301 (1994). "Magne- tohydrodynamic waves in coronal flux tubes". [20] S. Poedts, J.P Goedbloed, Space Science Reviews 68, 103-108 (1994). "3D nonlinear wave heating of coronal loops". [21] A.G. Hearn, Invited review, Proc. Symposium on Circumstellar matter, eds. I. Ap- penzeller and C. Jordan, Reidel, Dordrecht, 395-408 (1987). "Theory of winds from hot stars".

[22] A.G. Hearn, Astron. Astrophys. 185, 247-252 (1987). "Models for stellar coronae: thin coronae with radiative forces". [23] A.G. Hearn, Invited review, Proc. Workshop on Mass outflows from stars and galactic nuclei, eds. L. Bianchi and R. Gilmozzi, Kluwer, Dordrecht, 79-93 (1988). "The theory of stellar winds from hot and cool stars". [24] A.G. Hearn, Invited review, Proc. Workshop on Pulsation and mass loss in stars, eds. R. Stalio and L. A. Willson, Kluwer, Dordrecht, 211-224 (1988). "Mass loss and non-radial pulsations in Be stars". [25] A.G. Hearn, Invited review, Proc. Coll. on From Miras to planetary nebulae: Which path for evolution?, eds. M. O. Mennessier and A. Omomt, Editions Frontières, Gif sur Yvette, 121-130 (1990). "The physics of mass loss from AGB stars". [26] P. Korevaar, Astron. Astrophys. 226, 209-214 (1989). "Time dependent corona models: coronae with accretion". [27] P. Korevaar, A.G. Hearn, Astron. Astrophys. 220, 177-184 (1989). "Time dependent corona models: dynamical response to perturbations". [28] P. Korevaar, "Time dependent models of stellar coronae", Thesis, University of Utrecht, Ch. 1 (1989). [29] P. Korevaar, P.C.H. Martens, Astron. Astrophys. 226, 203-214 (1989). "Time dependent corona models: scaling laws".

30 [30] F.P. Pijpers, A.G. Hearn, Astron. Astrophys. 209, 198-210 (1989). "A model for a stellar wind driven by linear acoustic waves".

[31] F.P. Pijpers, H.J. Habing, Astron. Astrophys. 215, 334-346 (1989). "Driving the stellar wind of AGB stars by acoustic waves: exploration of a simple model". -,

[32] G. T. Geertsema, H. Burm, C90 Europhysics conference on computational physics, (1991). "Porting a 3D-turbulence simulation to a multiprocessor system". [33] J. P. M. Koninx, A. G. Hearn, Astron. Astrophys. 263, 208-218 (1992). "The role of sound waves in Be star winds".

[34] J. P. M. Koninx, Astron. Astrophys, submitted (1993). "Fast rotating Alfvén wave driven winds - an application to Be stars".

[35] J. P. M. Koninx, S. P. Owocki, Astron. Astrophys, submitted (1993). "Rotation in radiatively driven winds".

[36] J. M. E. Kuijpers, Invited review, Cargèse workshop on Plasma phenomena in the solar atmosphere, eds. M. A. Dubois, F. Bély-Dubau, D. Grésillon, 227-242 (1990).

[37] J. M. Kuijpers, Invited review in Basic plasma physics processes on the Sun, I.A.U. Symp. 142, eds. E. R. Priest, V. Krishan, 365-373 (1990). "Coherent radiation from electrostatic double layers".

[38] M. de Vries, J. M. E. Kuijpers, Astron. Astrophys, submitted (1993). "Magnetic relaxation of a cylinder; a VLBI model".

[39] M. de Vries, Astron. Astrophys, submitted (1993). "Magnetic relaxation of anisotro- pically expanding plasmoids".

[40] M. de Vries, J. M. E. Kuijpers Astron. Astrophys 266, 77-84 (1992). "A magnetic flare model for X-ray variability in Active Galactic Nuclei"

[41] J.A. Meijerink and H.A. van der Vorst, Math, of Comp. 31(137), 148-162 (1977). "An Iterative Solution Method for Linear Systems of Which the Coefficient Matrix is a Symmetric M-Matrix".

[42] H.A. van der Vorst, Math, of Comp. 39(160), 559-561 (1982). "A Generalized Lanc- zos Scheme".

[43] A. van der Sluis and H.A. van der Vorst, Linear Alg. and its Appl. 88/89, 651- 694 (1987). "The Convergence Behaviour of Ritz Values in the Presence of Close Eigenvalues".

[44] H.A. van der Vorst, Parallel Computing 5, 45-54 (1987). "Large Tridiagonal and Block Tridiagonal Linear Systems on Vector and Parallel Computers".

[45] H.A. van der Vorst, Parallel Computing 5, 303-311 (1987). "Analysis of a Parallel Solution Method for Tridiagonal Linear Systems".

[46] P.H. Michielse and H.A. van der Vorst, Parallel Computing 7, 87-95 (1988). "Data Transport in Wang's partition Method".

31 [47] H.A. van der Vorst, IEEE Trans, on Magnetics 24(1), 286-290 (1988). "Parallel Solution of Discretised PDE's".

[48] H.A. van der Vorst, Computer Physics Communications 53, 223-235 (1989). "ICCG and related methods for 3D problems i vectorcomputers".

[49] H.A. van der Vorst, Future Generation Computer Systems 4(4), 285-291 (1989). "Practical aspects of parallel scientific computing". [50] H.A. van der Vorst and P. van Dooren (eds.), Parallel Algorithms for Numerical Linear Algebra, Volume 1 in: Advances in Parallel Computing, Elsevier Science Pub., Amsterdam (1990), ISBN: 0444 88621 4, 330 pages.

[51] Henk Van der Vorst, Supercomputer 7(3), 28-35 (1990). "Experiences with parallel vector computers for sparse linear systems".

[52] H.A. van der Vorst and J.B.M. Melissen, IEEE Trans, on Magnetics 26 (2), 706-708 (1990). "A Petrov-Galerkin type method for solving Ax=b, where A is a symmetric complex matrix".

[53] H.A. van der Vorst, Advances in Water Resources 13 (3), 137-146 (1990). "Iterative Methods for the Solution of Large Systems of Equations on Supercomputers".

[54] J. Dongarra, I. Duff, D. Sorenson and H. van der Vorst, Solving linear systems on vector and shared memory computers, SIAM, Philadelphia, PA. (1991), ISBN 0- 89871-270-X, 266 pages.

[55] E. de Sturler, in Proc. of the fifth Int. Symp. on Numer. Methods in Eng., ed. R. Miller (1991). "A Parallel variant of GMRES(m)".

[56] Van der Vorst, H.A. and Vuik, C, GMRESR: A family of nested GMRES methods, Reports of the Faculty of Technical Mathematics and Informatics no. 91-80, Delft , Netherlands (1991).

[57] H. A. van der Vorst, SIAM J. Sci. Stat. Comput. 13(2), 631-644 (1992). "Bi- CGSTAB: A fast and smoothly converging variant of Bi-CG for the solution of nonsymmetric linear systems".

[58] Jack J. Dongarra and Henk A. van der Vorst, Supercomputer, September 1992. "Performance of Various Computers Using Standard Sparse Linear Equations Solving Techniques".

[59] J. Demmel, M. Heath and H. van der Vorst, "Parallel Numerical Linear Algebra", Acta Numerica 1993, Cambridge University Press, Cambridge, 1993, pp.111-198

[60] C. C. Paige, B. N. Parlett and H. A. van der Vorst, 'Approximate solutions and eigenvalue bounds from Krylov subspaces', Report PAM-579, Center for pure and applied mathematics, University of California, Berkeley, February 23,1993, to appear in Num Lin Algebra with Appl.

[61] D. C. Sorenson, SIAM J. Mat. Anal. Appl. 13(1), 357-385 (1992), "Implicit application of polynomial filters in a k-step Arnoldi method".

32 [62] Lianne G. C. Crone and Henk A. van der Vorst, "Communication aspects of the Conjugate Gradient method on distributed memory machines", Supercomputer, 58, Vol X, number 6, pp. 4-9, 1994.

[63] G.L.G. Sleijpen and D.R. Fokkema, "BICGSTAB(i;) for linear equations involving unsymmetric matrices with complex spectrum, ETNA, 1, 11-32 (1993).

[64] Gerard L. G. Sleijpen, H. A. van der Vorst and D. R. Fokkema, "BICGSTAB(£) and other hybrid Bi-CG methods", Numerical Algorithms, 7, pp. 75-109, 1994.

[65] G.L.G. Sleijpen and H.A. van der Vorst, "Maintaining convergence properties of BiCGstab methods in finite precision arithmetic", Preprint 861, Department of Math- ematics, University Utrecht, Utrecht, 1994.

[66] Richard Barrett, Michael Berry, Tony Chan, James Demmel, June Donato, Jack Dongarra, Victor Eijkhout, Roldan Pozo, Charles Romine, and Henk van der Vorst, "Templates for the solution of linear systems: Building Blocks for Iterative Methods", SIAM, Philadelphia, 1994

[67] Henk A. van der Vorst, "Parallel aspects of iterative methods", in: A.E. Fincham, and B. Ford (Ed), 'Proc. IMA Conf. on Parallel Computing', Oxford University Press, Oxford, UK, 1993, pl75-186

[68] G.L.G. Sleijpen and H.A. van der Vorst, A generalized Jacobi-Davidson iteration method for linear eigenvalue problems, Preprint nr. 856, Department of Mathematics, University of Utrecht, 1994.

[69] Saad, Y. and Schultz, M.H., "GMRES: A generalized minimal residual algorithm for solving nonsymmetric linear systems", SIAM J. Sci. Stat. Comput. Vol 7 No 3, pp. 856-869 (1986).

[70] Eisenstat, S.C.. Elman, H.C and Schultz, M.H., "Variational iterative methods for nonsymmetric systems of linear equations", SIAM J. Numer. Anal. Vol. 20 No 2 (1983).

[71] R. OOSTVEEN, De implementatie van GMRESR op een parallelle distributed memory computer, Masters thesis, University of Utrecht, 1994.

[72] H.J.J. te Riele, Th.J. Dekker, and H.A. van der Vorst (eds.), Algorithms and applications on vector and parallel computers, North-Holland, 1987.

[73] H.J.J. te Riele, Th.J. Dekker, and H.A. van der Vorst (eds.), CWI-IMACS Symposia on Parallel Scientific Computing, Three special issues of the journal: "Applied Nu- merical Mathematics", Vol. 7, Number 5, June 1991; Vol. 8, Number 2, Sept 1991; Vol. 10, Number 1, June 1992.

[74] H.J.J. te Riele, and H.A. van der Vorst (eds.), Proceedings of the CWI-RUU Sym- posia on Massively Parallel Computing and Applications, Special Issue of Applied Numerical Mathematics, to appear early 1995.