SOFSEM ’94 – Milovy, The Czech Republic, 27. 11. – 9. 12. 1994

Current Trends in : From Theory to Practice

Z. Strakoˇs, M. T˚uma

Institute of Computer Science, Academy of Sciences of the Czech Republic Pod vod´arenskou vˇeˇz´ı2, CZ–182 07 Prague, The Czech Republic Email: [email protected] [email protected]

Abstract: Our goal is to show on several examples the great progress made in numerical analysis in the past decades together with the principal problems and relations to other disciplines. We restrict ourselves to numerical linear algebra, or, more specifically, to solving Ax = b where A is a real nonsingular n by n matrix and b a real n dimensional vector, and to computing eigenvalues of a A. We discuss recent− developments in both sparse direct and iterative solvers, as well as fundamental problems in computing eigenvalues. The effects of parallel architectures to the choice of the method and to the implementation of codes are stressed throughout the contribution.

Keywords: numerical analysis, numerical linear algebra, linear algebraic systems, sparse direct solvers, iterative methods, eigenvalue computations, parallel architectures and algorithms

1 Motivation modelling and extensive use of the numerical analysis, mentioning just one important area Our contribution deals with numerical analy- of application. sis. What is numerical analysis? And what Nick Trefethen proposes in his unpublished is its relation to computer science (comput- note [38] the following definition: ing sciences)? It is hard to give a definition. Sometimes “pure” mathematicians do not con- Numerical analysis is the study of sider numerical analysis as a part of mathe- algorithms for mathematical prob- matics and “pure” computer scientists do not lems involving continuous variables consider it as a part of computer science. Nev- ertheless, hardly anyone would argue against The keywords are algorithm and continuous, the importance of this field. Dramatic devel- the last typically means real or complex. Since opment in high technology in the last decades real or complex numbers cannot be represented would not be possible without mathematical exactly on computers, the part of the business 2. Solving large sparse linear algebraic systems of numerical analysis must be to study round- the numerical method to the efficient and reli- ing errors. To understand finite algorithms or able code. We demonstrate that especially on direct methods, e.g. or the current trends in developing sparse direct Cholesky factorization for solving systems of solvers. linear algebraic equation, one have to under- stand the computer architectures, operation 2 Solving large sparse linear alge- counts and the propagation of rounding errors. braic systems This example, however, does not tell you all Although the basic scheme of the symmetric the story. Most mathematical problems involv- Gaussian elimination is very simple and can ing continuous variables cannot be solved (or be casted in a few lines, the effective algo- effectively solved) by finite algorithms. A clas- rithms which can be used for the really large sical example - there are no finite algorithms problems usually take from many hundreds for matrix eigenvalue problems (the same con- to thousands of code statements. The differ- clusion can be extended to almost anything ence in the timings can then be many orders nonlinear). Therefore the deeper business of of magnitude. This reveals the real complex- numerical analysis is approximating unknown ity of the intricate codes which are necessary quantities that cannot be known exactly even in to cope with the large real-world problems. principle. A part of it is, of course, estimating Subsection 2.1 is devoted to the sparse direct the precision of the computed approximation. solvers stemming from this symmetric Gaus- Our goal is to show on several examples the sian elimination. Iterative solvers, which are great achievements of the numerical analysis, based on the approaching the solution step by together with the principal problems and rela- step from some initial chosen approximation, tions to other disciplines. We restrict ourselves are discussed in subsection 2.2. A comparison to numerical linear algebra, or more specifi- of both approaches is given in subsection 2.3. cally, to solving Ax = b, where A is a real nonsingular n by n matrix and b a real n- 2.1 Sparse direct solvers dimensional vector (for simplicity we restrict This section provides a brief description of the ourselves to real systems; many statements ap- basic ideas concerning sparsity in solving large ply, of course, also to the complex case), and linear systems by direct methods. Solving the to computing eigenvalues of a square matrix A. large sparse linear systems is the bottleneck in Much of scientific computing depends on these a wide range of engineering and scientific com- two highly developed subjects (or on closely putations. We restrict to the symmetric and related ones). This restriction allows us to go positive definite case where most of the impor- deep and show things in a sharp light (well, at tant ideas about algorithms, data structures least in principle; when judging this contribu- and computing facilities can be explained. In tion you must take into account our - certainly this case, the solution process is inherently sta- limited - expertise and exposition capability). ble and we can thus avoid numerical pivoting We emphasize two main ideas which can be which would complicate the description other- found behind our exposition and which are es- wise. sential for the recent trends and developments Efficient solution of large sparse linear sys- in numerical linear algebra. First, our strong tems needs a careful choice of the algorith- belief is that any “software solution” without mic strategy influenced by the characteristics → a deep understanding of mathematical (physi- of computer architectures (CPU speed, mem- cal, technical, ...) background of the problem ory hierarchy, bandwidths cache/main mem- is very dangerous and may lead to fatal er- ory and main memory/auxiliary storage). The rors. This is illustrated, e.g., on the prob- knowledge of the most important architectural lem of computing eigenvalues and on charac- features is necessary to make our computations terizing the convergence of the iterative meth- really efficient. ods. Second, there is always a long way (with First we provide a brief description of the many unexpectably complicated problems) from Cholesky factorization method for solving of a 2. Solving large sparse linear algebraic systems sparse linear system. This is the core of the the column j by the column k at the lines (3)– symmetric Gaussian elimination. Basic nota- (5) is denoted by cmod(j, k). Vector scaling at tion can be found, e.g., in [2], [18]. the lines (8)–(10) is denoted by cdiv(j). We use the square root formulation of the The right-looking approach can be de- factorization in the form scribed by the following pseudo-code: (1) for k = 1 to n A = LDLT , (2) dk = ak where L is lower triangular matrix and D is (3) for i = k + 1 to n if aik = 0 6 diagonal matrix. Having L, solution x can be (4) lik = aik/dk computed using two back substitutions and one (5) end i diagonal scaling: (6) for j = k + 1 to n if akj = 0 6 1 T (7) for i = k + 1 to n if lik = 0 Ly¯ = b ; y = D− y¯ ; L x = y. 6 (8) aij = aij likakj − Two primary approaches to factorization (9) end i are as follows (we do not mention the row- (10) end j Cholesky approach since its algorithmic prop- (11) end k erties usually do no fit well with the modern computer architectures). The left-looking approach can be described In the right-looking approach, once a col- by the following pseudo-code: umn k is completed, it immediately generates (1) for j = 1 to n all contributions to the subsequent columns, i.e., columns to the right of it in the matrix. (2) for k = 1 to j 1 if akj = 0 − 6 A number of approaches have been taken in (3) for i = k + 1 to n if lik = 0 6 order to solve the problem of matching nonze- (4) aij = aij likakj − (5) end i ros from columns j and k at the lines (6)-(10) (6) end k of the pseudo-code as will be mentioned later. Similarly as above, operation at the lines (3)– (7) dj = ajj (8) for i = k + 1 to n (5) we denote cdiv(k) and column modification at the lines (7)–(9) is denoted by cmod(j, k). (9) lij = aij/ajj (10) end i Discretized operators from most of the ap- (11) end j plications such as structural analysis, compu- tational fluid dynamics, device and process In this case, a column j of L is computed by simulation and electric power network prob- gathering all contributions from the previously lems contain only a fraction of nonzero ele- computed columns (i.e. the columns to the left ments: these matrices are very sparse. Explicit of the column j in the matrix) to the column consideration of sparsity leads to substantial j. Since the loop at the lines (3)–(5) in this savings in space and computational time. Usu- pseudo-code involves two columns, j and k, ally, the savings in time are more important, with potentially different nonzero structures, since the time complexity grows quicker as a the problem of matching corresponding nonze- function of the problem size than the space ros must be resolved. Vector modification of complexity (memory requirements).

∗ ∗ ∗∗∗∗∗     ∗ ∗ ¯ ∗ ∗ A =   A =    ∗ ∗   ∗ ∗       ∗ ∗   ∗ ∗       ∗∗∗∗∗   ∗ ∗  Figure 1.1: The arrow matrix and in the natural and reverse ordering. 2. Solving large sparse linear algebraic systems

For a dense problem we have CPU time the diagonal. Natural orderings of many appli- proportional to n3 while the necessary mem- cation problems lead to such concentrations of ory is proportional to n2. For the sparse prob- nonzeros. lems arising from the mesh discretization of 3D Define fi = min j aji = 0 for i nˆ. { | 6 } ∈ problems CPU time is proportional (typically) This locates the leftmost nonzero element in 2 4 to n and space proportional to n 3 . each row. Set δi = i fi. The profile is de- n − We will demonstrate the time differences fined by i=1 δi. The problem of concentrat- using the following simple example from [30]: ing elementsP around the diagonal can be thus reformulated as the problem of minimizing the Example 2.1 Elapsed CPU time for the ma- profile using symmetric reordering of matrix trix arising from a finite element approxima- elements. tion of a 3D problem (design of an automo- bile chassis). Dimension of a the matrix: n = ∗ ∗ ∗ 44609. Proportion of nonzeros: 0.1%. Time on   Cray X-MP (1 processor) when considered as ∗∗∗ ∗    ∗ ∗ ∗ ∗  a full (dense) matrix: 2 days. Time on Cray    ∗ ∗ ∗  X-MP when considered as a sparse matrix: 60   seconds.  ∗∗∗∗∗  Figure 1.2: An matrix illustrating the pro- Considering matrix sparsity we must care file Cholesky scheme. We have f1 = 1,f2 = about the positions of nonzeros in the matrix 1,f3 = 2,f4 = 3,f5 = 1. patterns of A and L. For a given vector v Rk ∈ define Sometimes we use a rougher measure of the quality of the ordering - we are minimizing Struct(v)= j kˆ vj = 0 . { ∈ | 6 } only band of the matrix - (often) defined as β = max δi, for i nˆ. Usually, nonzero elements are introduced ∈ into the new positions outside the pattern of A during the decomposition. These new ele- ∗ ∗ ∗ ments are known as fill-in elements. In order   ∗ ∗ ∗ ∗ to reduce time and storage requirements, it is    ∗∗∗∗∗  necessary to minimize the number of the fill-    ∗ ∗ ∗ ∗  in elements. This can be accomplished by a    ∗ ∗ ∗ ∗  combination of a good choice of data structures    ∗ ∗  used for matrix elements, matrix ordering and an efficient implementation of the pseudo-code. Figure 1.3: An example of a A typical example showing how the matrix with β = 2. ordering influences time and storage is the case of an arrow matrix in the Figure 1.1. While in More advanced variant of this principle re- the first matrix A we do not get any fill-in dur- lies on the dynamical reordering of the matrix ing the Cholesky factorization process, reverse to get the nonzeros as close to the diagonal ordering of variables provides matrix A¯, which as possible during the Cholesky factorization. completely fills after the first step of the de- Such an algorithm we call the frontal method. composition: In this case we use in any step only the el- ements of a certain window which is moving 2.1.1 A Special sparsity structure down the diagonal. – Profile, band and frontal For the algorithms to reorder matrices ac- schemes cording to this principles see [18]. Advantages A well-known way to make use of the sparsity of the methods considered in this subsection in the Cholesky factorization is to move the are in their simplicity. To store the nonzeros nonzero elements of A into the area “around” we need to store only that part of the matrix 2. Solving large sparse linear algebraic systems

A which is covered by the elements which de- mobile chassis). Dimension of a the matrix: termine the profile or the band of A or the n = 44609. Memory size used and number of dynamical window of the frontal method. floating-point operations for the factor L for Simple result (see, for instance, [18]) guar- the frontal solver: 52.2 MByte / 25 Billion. antees that all the nonzeros of L are inside Memory size used and number of floating-point the above-mentioned part determined from the operations for L when general sparse solver was pattern of A. This observation justifies the used: 5.2 MByte / 1.1 Billion. three algorithmic types mentioned in this sub- section. The second reason is architectural. Hard- To implement band elimination we need ware gather/scatter facilities used in modern only to store nonzero elements in a rectangular computers (see [26]) caused that even the sim- array of the size β n. Some problems gen- × plicity of data structures for band, profile erate matrices with a main diagonal and with and frontal solvers are not able to guarantee one or more nonzero subdiagonals. In this case competitive computation times. They behave we can store also these diagonals and diagonals worse than general sparse solvers even in the necessary for fill-in as a set of “long” vectors. case when the difference in number of fill-in el- To implement the profile method, we usually ements (size of the factor L) is not so dramatic. need even less nonzero positions and one addi- tional pointer vector to point to the first nonze- 2.1.2 General Sparse Solvers ros in the matrix rows. Frontal elimination To describe basic ideas used in today’s general needs vectors to perform row and column per- sparse solvers we need to introduce some ter- mutations of the system dynamically, through- minology. Undirected graphs are useful tools out the factorization steps. More complicated in the study of symmetric matrices. A given implementation is compensated by the advan- matrix A can be structurally represented by its tage of smaller working space. All these pos- associated graph G(A) = (X(A), E(A)), where sibilities can be considered in both the right- nodes in X(A)= 1,...,n correspond to rows { } and left-looking implementations but the dif- and columns of the matrix and edges in E(A) ferences are not, in general, very large since all correspond to nonzero entries. these models are based on similar principles. Filled matrix F = L + LT contains gener- Although all these three schemes are very ally more nonzeros than A. Structure of F is simple to implement and also the data struc- captured by the filled graph G(F ). The prob- tures are simple (nonzero parts of rows, lem how to get structure of nonzeros of F was columns or diagonals are stored in vectors or solved first in [33] using graph-theoretic tools rectangular arrays as dense pieces), they are to transform G(A) to G(F ). not used very often as in recent sparse sym- An important concept in sparse factoriza- metric solvers. tion is the elimination tree of L. It is defined First reason is algorithmic. General sparse by schemes may have much less fill-in elements than the previous schemes would implicitly parent(j)= min i lij = 0,i>j . suppose. We demonstrate this fact on the pre- { | 6 } viously mentioned example taken from [30]: In other words, column j is a child of column Example 2.2 Comparison of factor size and i if and only if the first subdiagonal nonzero number of floating-point operations for the ma- of column j in L is in row i. Figure 1.4 shows trix arising from the finite element approxi- structures of matrices A and L and the elimi- mation of a 3D problem (design of an auto- nation tree of L. 2. Solving large sparse linear algebraic systems

5 ✉

∗ ∗   ✉ ✉ 4 ∗ ∗ 2    ∗ ∗ ∗   f  ✉ ✉  ∗ ∗ ∗   f   ∗ ∗ ∗  1 3

Figure 1.4: Structures of matrices A, L and of the elimination tree of L. By stars we denote original nonzeros of A, additional fill-in in L is denoted by the letter f.

Following few lemmas recall some impor- Moreover, it can be shown that Tr[i] is a tant properties of elimination trees which will pruned subtree of T [i] and that its leaves can help us to understand basic principles of sparse be easily determined. solvers. For a deeper insight into this subject The important corollary from these consid- we refer to [25]. Note, that elimination tree erations is that structure of rows in L can be can be computed directly from the structure easily determined using the elimination tree. of nonzeros of A in complexity nearly linear in The second important problem concerning n (see [27]). the implementations - determination of the

Lemma 2.1 If lij = 0, then the node i is an structures of columns of L - can be also very 6 ancestor of j in the elimination tree. easily solved also using the elimination tree: This observation provides a necessary con- dition in terms of the ancestor-descendant re- Lemma 2.4 Struct(L j) i lij = i j ∗ ≡ { | 6 ∅∧ ≥ } lation in the elimination tree for an entry to be = Struct(L k) Struct(A j) k is son j ∗ ∗ nonzero in the filled matrix. S in T (A) S 1,...,j 1 Lemma 2.2 Let T [i] and T [j] be two disjoint −{ − } subtrees of the elimination tree (rooted at i and This is a simple corollary of the Cholesky j, respectively). Then for all s T [i] and decomposition step and dependency relations ∈ t T [j], lst = 0. captured by the elimination tree. This formula ∈ means that in order to get the structure of a One important problem concerning the column i in L we need only to merge struc- Cholesky factorization is how to determine row tures of the column i in A with structures of structures of L. For instance, in the left- sons of i in L. Consequently, the algorithm looking pseudo-code, nonzeros in a row k of to determine the column structure of L can be L correspond to columns from 1,...,k 1 { − } implemented in O(m) operations where m is a which contribute to column k. number of nonzeros in L. Lemma 2.3 lij = 0 if and only if the node j is 6 The elimination tree gathers the most im- an ancestor of some node k in the elimination portant structural dependencies in the Cholesky tree, where aik = 0. 6 factorization scheme. Numbering of its vertices This result can be used to characterize the determines the order in which the matrix en- row structure of the Cholesky factor. Define tries are processed by the solver. Moreover, we Tr[i], the structure of the i th row of the can renumber vertices and/or even modify the − Cholesky factor as follows elimination tree while preserving the amount of fill-in elements in the correspondingly per- Tr[i]= j lij = 0, j i . muted factor L. Motivations for such changes { | 6 ≤ } will be described in the following subsection. Then we have Basic structure of the general sparse left- looking solver can then be given in the follow- Tr[i] T [i]. ing four steps (not taking into account the or- ⊆ 2. Solving large sparse linear algebraic systems dering phase): Basic approach how to accomplish that is to use rather operations with blocks of data in- Form the elimination tree. stead of operations with matrix elements. This • important observation is now commonly used Find the structure of columns of L (sym- • in linear algebra operations (see [15]). It was bolic factorization). introduced after the spreading of vector su- percomputer architectures. The main problem Allocate the static data structures for L • with achieving the supercomputer performance based on the result of the previous step. on these architectures was to keep the vector functional units busy - to get enough data for Do numerical updates corresponding to • them. Blocking data to use matrix-vector and the left-looking pseudo-code. matrix-matrix operations was found to be very Implementation of the right-looking solver successful. can be based on the similar scheme. A comparison of number of memory refer- One of the popular type of the right-looking ences and number of flops for the three types algorithm is the multifrontal method (see [12], of basic operations in dense matrix computa- [3]). In that method, the update step of the tions is in the Table 1.1. We consider α to be pseudo-code (steps (6)–(10)) is implemented a scalar, x,y to be vectors and B,C,D to be using dense matrix operations. Updates are (square) matrices; n represents the size of the stored separately, usually in the separate stack vectors and matrices. working area, but some of them can make use operation # m. r. # flops #m.r. of the final space for L. We do not discuss #flops this method as a special case in our overview y αx + y 3n 2n 3 ← 2 of sparse techniques, because its implementa- y Dx + y n2 2n2 1 tions on various computers can mix the ideas ← 2 2 3 2 previously mentioned. B DC + B 4n 2n n Since we can effectively compute structures ← of columns and rows of L, the symbolic over- Table 1.1: Comparison of number of mem- head in the computations is rather small. But ory references (m.r.) and number of flops for this does not mean that our computations are the three types of block hierarchy in dense ma- effective. The problem is that the choice of trix computations. the algorithmic strategy has to match with the computer architecture. We will consider some A similar principle can be applied also issues of this kind in the following subsection. for the scalar computations in general sparse solvers. Note, that vector pipelining is not 2.1.3 Let the architecture reign decisive; much more important is keeping Suppose we have the usual memory hierarchy the CPU unit as busy as possible while with fast and small memory parts at its top minimizing the data transfers. Thus, the (registers, cache) and slower and bigger parts efficient implementations of both the left- at the bottom. Usually, it is not possible to looking and right-looking algorithms require embed the large problems completely into the that columns of L sharing the same sparsity cache and transfer between memory hierarchy structure are grouped together into supern- levels takes considerable time. On the other odes. More formally, the set of continguous side, computations are the most effective only columns j, j + 1,...,j + t constitutes a su- { } if the active data are as close to the top as pernode if Struct(l k) = Struct(l k+1) k ∗ ∗ ∪ { } possible. This implies that we must in some for j k j + t 1. Note that these columns ≤ ≤ − way maximize the proportion of the number of have a dense diagonal block and have identical floating-point operations (flops) to number of column structure below row j +t. A supernode memory references which enable us to keep the can be treated as a computational and storage data in the cache and registers. unit. Following example shows the difference 2. Solving large sparse linear algebraic systems in timings for the sparse Cholesky right-looking Example 2.4 Comparison of number of page decomposition using supernodes. faults for a matrix arising from the 9-point dis- Example 2.3 Elapsed time for the linear sys- cretization of a regular 2D 180 180 grid (di- × tem with the SPD matrix of the dimension n = mension n = 32400). Number of page faults 5172 arising from the 3D finite element dis- for an level-by-level ordering of the elimination cretization. Computer: 486/33 IBM PC com- tree from the bottom to the top: 1.670.000. patible. Virtual paging system uses the mem- Number of page faults using postordering of ory hierarchy: registers /cache /main mem- the elimination tree: 18.000. ory /hard disk. General sparse right-looking Another strategy to obtain the equivalent solver. Elapsed time without block implemen- reordering that can reduce the active storage tation: 40 min. Elapsed time with supernodal is to rearrange the sequences of children in the implementation: 4 min. elimination tree. The situation is depicted on the Figure 1.6. While on the left part of the Another way how to decrease the amount figure we have a tree with some initial pos- of communication in the general sparse solver tordering, on the right side we are numbering is to try to perform the update operations in “large” children first. We are doing it in any such a way that the data necessary in a step node and recursively. For instance, considering of the decomposition are as close together as vertex 18, we have numbered “largest” subtree possible. Elimination tree can serve as an ef- of the right side elimination tree first. ficient tool to describe this principle. Having More precisely, the new ordering is based an elimination tree, we can renumber some of on the structural simulation of the Cholesky its vertices in such a way that the . Active working space in each decomposition with the corresponding permu- step can be determined for various renumber- tation will provide the factor L of the same ings of children of any node. In this way, size. Consider the renumbering of the elimina- the recursive structural simulation can deter- tion tree from the Figure 1.4. This elimination mine a new renumbering permuting the sub- tree is renumbered by a so-called postordering trees corresponding to the tree nodes in order (a topological ordering numbering any subtree to minimize the active working space without by an interval of indices). This reordering is changing the elimination tree. Consider, for in- equivalent to the original one in the sense that stance, vertex 18 in the elimination trees. De- it provides the same factor L. cision, in which order we will process its sons 5 (and, of course, all its subtrees considering pos- ✉ tordered elimination tree) is based on the re- cursively computed value of the temporary ac- ✉ ✉ 4 1 tive working space. The natural idea is to start with processing of “large” subtrees. Both the ✉ ✉ children rearrangements are postorderings of 2 3 the original elimination tree. The actual size of the working space depends directly on whether we are minimizing working space for the mul- Figure 1.5: Postordering of the elimination tifrontal method or for some left-looking or tree of the Figure 1.4 right-looking out-of-core (out-of-cache) solver. Since the postordering numbers indices in If we are looking for the equivalent transfor- such a way that vertices in any subtree are mations of the elimination tree in order to min- numbered before giving numbers to any dis- imize the active working space, we can not only joint subtree, we can expect much less data shrink vertices into supernodes and renumber communications than in the previous case. elimination tree. We can also to change the The difference is described by the number of whole structure of the elimination tree. Theo- page faults in the virtual paging system in the retical basis of this transformation is described following example (see [31]). in [29]. 2. Solving large sparse linear algebraic systems

✉ 19 ✉ 19 17 18 18 ✉ ✉ ✉ 6 ✉ 16✉ ✉ 11 5✉ ✉ 17 5 10 11 ✛ ✲ 14✉✉ 15 ✉ ✉ 3✉✉ 4 ✉16 ✉

✉✉✉✉✉✉3 4 8 ✉✉✉✉✉✉9 14 9 1 2 10 15 12 13 ✉ ✉ ✉ ✉ ✉ ✉ ✉ ✉ 7 8 12 13 12 67

Figure 1.6: Rearranging of the children sequences in the elimination tree.

Instead of the balanced elimination tree as cal parameters we are not able even to de- provided by some matrix reorderings minimiz- cide which combination of the left-looking and ing fill-in elements we can use unbalanced elim- right-looking techniques is optimal. There are ination tree which can in practice decrease the many open problems in this area. Theoretical active working space about the size up to 20%. investigation leads often to the directly appli- Balanced and unbalanced elimination trees cable results. are schematically depicted on the Figure 1.7. 2.2 Recent development in iterative In practice, all these techniques deal with solvers: steps towards a black box the supernodes rather than with the individual iterative software? entries. A large amount of black box software in the Balanced and effective use of the described form of mathematical libraries as LAPACK techniques is a difficult task strongly depend- (LINPACK), NAG, EISPACK, IMSL etc. and ing on the computer architecture for which the general sparse solvers as MA27, MA42, MA48, solver is constructed. For computers with rel- UMFPACK, etc. have been developed and are atively quick (possibly vectorizable) floating- widely used in many applications. point operations and slow scalar arithmetics, Users can exploit this software with high one can effectively merge into the supernodes confidence for general problem classes. Con- more vertices despite the fact that the resulting cerning systems of linear algebraic equations, structure of L would have additional nonzeros codes are based almost entirely on direct meth- (see [5]). On the other hand, sometimes it is ods. Black box iterative solvers would be necessary to construct smaller supernodes by highly desirable and practical - they would breaking large blocks of vertices into pieces (see avoid most of the implementation problems re- [35]). This is the case of workstation and also lated to exploiting the sparsity in direct meth- of some PC implementations. Elimination tree ods. In recent years many authors devote a rearrangements provide an a posteriori infor- lot of energy into the field of iterative methods mation for the optimal partitioning of blocks and a tremendous progress have been achieved. of vertices. For a recent survey we refer e.g. to [14] and [8]. Computer architecture is the prominent Sometimes the feeling is expressed that this source of information for implementing any progress have already established a firm base general sparse solver. Without knowledge of for developing a black box iterative software. the basic architectural features and techni- This is, however, very far from our feeling. 2. Solving large sparse linear algebraic systems

✉ ✉

✉ ✉

✉ ✉ ✛ ✲ ✉ ✉

✉ ✉ ✉ ✉ ✉

✉ ♣✉ ✉ ✉ ✉ ✉ ✉ ✉ ✉

✉ ✉

✉ ✉

✉ ✉

✉ ✉

Figure 1.7: Balanced and unbalanced elimination trees.

Strictly speaking, we do not believe that about the one previous step and it is station- any good (fast, precise, reliable and robust) ary because neither B nor c depend upon the black box iterative software for solving systems iteration step k. Everyone knows examples as of linear algebraic equations will be available in the Richardson method, the Jacobi method, the near future. This section describes several the Gauss-Seidel method, the Successive Over- good reasons supporting our opinion. Many it- relaxation method (SOR) and the Symmetric erative methods have been developed; for the Successive Overrelaxation method (SSOR). We excellent surveys of the classical results we re- are not going to repeat the formulas or the the- fer to [39], [41], [7] and [22]. For historical rea- ory of these methods here, that can be found sons we recall briefly the basic iterative meth- elsewhere. Instead, we recall the very well ods and then turn to the state-of-the-art: the known fact, that these simple methods (espe- Krylov space methods. cially SOR and SSOR) may show an excellent The best known iterative methods - the lin- performance while carefully tuned to a specific ear stationary iterative methods of the first problem, but their performance is very prob- kind - are characterized by the simple formula lem - sensitive. This lack of robustness avoid their general use for a wide class of problems. x(k) = Bx(k 1) + c − Nonstationary iterative methods differ where x(k) is the current approximation to the from stationary methods in that the param- solution at the k-th step, (x(0) is given at the eters of the formula for computing the current beginning of computation), the n by n matrix approximation depend on the iteration step. B and vector c characterize the method. Any Consequently, these methods are more robust; method of this type is linear because x(k) is in many cases they are characterized by some given as a linear function of the previous ap- minimizing property. In the last years most of proximations; it is of the first kind because the the effort in this field is devoted into the Krylov iteration formula involves the information just space methods. 2. Solving large sparse linear algebraic systems

Krylov space methods for solving linear U ∗U = UU ∗ = I, I is the identity matrix, into systems start with an initial guess x(0) for the the polynomial formulation of the methods. solution and seek the k-th approximate solu- For nonnormal matrices, hovewer, no unitary tion in the linear variety eigendecomposition exist. Despite that, many authors extend intuitively the feeling that the (k) (0) (0) x x + Kk(A, r ) spectrum of the coefficient matrix plays deci- ∈ sive role in the characterization of convergence (0) (0) where r = b Ax is the initial residual and even in the nonnormal case. This is actually (0) − Kk(A, r ) is the k-th Krylov space generated a very popular belief which is (at least implic- (0) by A, r , itly) present in discussions of the experimental

(0) (0) (0) k 1 (0) results in many papers. This belief is, however, Kk(A, r ) = span r , Ar ,...,A − r . { } wrong! Then, the k-th error and the k-th residual are As an example we can take the GMRES written in the form method. GMRES approximations are chosen to minimize the Euclidean norm of the resid- (k) (0) ual vector r(k) = b Ax(k) among all the Krylov x x = pk(A)(x x ) − − (k) (0)− space methods, i.e., r = pk(A)r , r(k) = min b Au . where pk k, k denotes the set of polyno- (0) (0) ∈ P P k k u x +Kk(A,r ) k − k mials p(λ) of the degree at most k satisfying ∈ p(0) = 1. Based on that, Krylov space meth- Residual norms of successive GMRES approx- ods are also referred to as polynomial methods. imations are nonincreasing, since the residuals An enormous variety of the Krylov space meth- are being minimized over a set of expanding ods exists, including the famous conjugate gra- subspaces. Notice that due to the minimizing dient method (CG), the conjugate residual property, the size of GMRES residuals forms a method (CR), SYMMLQ, the minimal resid- lower bound for the size of the residual of any ual method (MINRES), the biconjugate gra- other Krylov space method. In other words, dient method (BiCG), the quasiminimal resid- GMRES shows how small residual can be found ual method (QMR), the generalized minimal in the variety residual method (GMRES), giving just a few (0) (0) names. r + AKk(A, r ). In the rest of this subsection we will concen- trate on the Krylov space methods and show If GMRES performs poorly, then any other several principial questions which must be sat- Krylov space method performs poorly as well isfactorily answered prior to constructing any (measured by the size of the residual). But the good black box iterative solver. size of the residual - that is the only easy-to- First question can be formed as: What compute convergence characteristic. characterizes convergence of the Krylov space A key result was proven in [21] (for the method? This is certainly a key question. other related results see also [20]). It can be Without giving an answer one can hardly build reformulated in a following way. Given a non- up theoretically well justified preconditioned increasing positive sequence f(0) f(1) ≥ ≥ methods. For the Hermitian and even the nor- ...f(n 1) 0, and a set of nonzero com- − ≥ mal systems (i.e. the systems characterized by plex numbers λ , . . . λn , there is a n by n { 1 } the Hermitian or normal matrix) the answer matrix A having eigenvalues λ , . . . λn , and { 1 } is: the rate at which a Krylov space method an initial residual r(0) with r(0) = f(0) k k converges is determined by the eigenvalue dis- such that the GMRES algorithm applied to tribution and the initial approximation (ini- the linear system Ax = b, with initial residual tial error, initial residual). It can be clearly r(0), generates approximations x(k) such that seen by substituting the unitary eigendecom- r(k) = f(k), k = 1, 2,...,n 1. In other k k − position of the the matrix A, A = UΛU ∗, words, any nonincreasing convergence curve 2. Solving large sparse linear algebraic systems can be obtained with GMRES applied to a ma- cannot say anything about the size of the ul- trix having any desired eigenvalues! The re- timate (or “final”) residual in practical com- sults of [20] and [21] demonstrate clearly that putations. Consequently - we cannot predict a eigenvalues alone are not the relevant quanti- priori the precision level on which the iteration ties in determining the behavior of GMRES for should be stopped! For more details we refer nonnormal matrices. It remains an open prob- to [11] and [36]. lem to determine the most appropriate set of One can form many other questions of sim- system parameters for describing the GMRES ilar importance. As stated earlier, a good it- behavior. erative black box software must be fast, pre- A second question is: What is the precision cise, reliable and robust. In all these at- level which can be achieved by iterative meth- tributes it must compete with highly effective ods and which stopping criteria can be effec- (sparse) modern direct codes. We have dis- tively used? The user will always ask: Where cussed here some troubles related to the first to stop the iteration? Stopping criteria should two attributes. Even from the short character- guarantee a small error. If error is considered ization of iterative methods given above it is as a distance to the true solution (measured clear that the third and fourth attributes cause e.g. in the Euclidean norm), then the question also a lot of problems which are unresolved yet is too hard - one usually has no tools to esti- (lacking in space we are not going into details mate directly this so called forward error. The here). Based on that, we do not believe in con- other possibility is to consider the error in the structing competitive black box iterative solvers backward sense, i.e., consider the approxima- in the near future. tion x(k) to the solution x as the exact solution of a perturbed system 2.3 Direct or iterative solvers? In this section we will first give some considera- (A +∆A)x(k) = b +∆b, tions concerning complexity of direct and itera- tive methods for the solution of linear systems and try to make the perturbations ∆A and ∆b arising in one special but important applica- as small as possible. It is well known, see [24], tion (see [40]). Then we will state objections that for a given x(k) such a minimal perturba- against the straightforward generalization of tions, measured in the Euclidean resp. spectral this simple case. norms, exist, and their size is given by The matrix for our simple comparison arises from the self-adjoint elliptic boundary- k min ν : (A + δA)x( ) = b +∆b, value problem on the unit cube in 2D or 3D. { ∆A / A ν, The domain is covered with a mesh, uniform k k k k≤ ∆b / b ν = and equal in all 2 or 3 dimensions with mesh- k k k k≤ } width h. Discretizing this problem we get the = b Ax(k) /( A x(k) + b ). k − k k kk k k k symmetric and positive definite matrix A of the dimension n and the right-hand side vector b. Consequently, to guarantee a small backward Consider the iterative method of conjugate error, it is sufficient to use the stopping criteria gradients applied to this system. Considering based on the value of exact arithmetics, the error reduction per it- √κ 1 (k) (k) eration is − , where κ is the condition b Ax /( A x + b ). ∼ √κ+1 k − k k kk k k k number of A defined as κ = A A 1 . || |||| − || The problem seems to be solved, because the Relation among the condition number κ residual is always available and the spectral and problem dimension as follows: for a 3D 2 2 norm of A can be approximated e.g. by the problem we have κ h n 3 , for a 2D prob- ∼ − ≈ easily computable Frobenius norm. A careful lem we have κ h 2 n. ∼ − ≈ reader will, however, raise a question about Then, for the error reduction below the rounding errors. This is a real crucial point. level ǫ, we have that Without a careful rounding error analysis we 3. Computing eigenvalues - a principal problem!

j 1 1 √κ 2 j 2j torization (see [13]) when only one right-hand − (1 ) exp( − ) <ǫ = 1+ 1 √κ √κ  √κ  ≈ − ≈ ⇒ side is present. j log ǫ √κ. Summarizing, sparse direct solvers would ∼− 2 Assume the number of flops per iteration to win as a general purpose codes up to the very be fn (f is a small integer standing for the large size of the problems n. For specific ap- ∼ average number of nonzeros per row and the plications, or extremely large n, the iteration overhead introduced by the iterative scheme). with preconditioning might be a better or even Then the number of flops for the convergence the only alternative. This conclusion repre- 4 sents the state of knowledge in the early 90’s below the level ǫ is proportional to fnj n 3 3 ∼ and are, of course, subject to change depending for 3D problems and fnj n 2 for 2D prob- ∼ ∼ lems. on the progress in the field. It is known, that many preconditioners (see 3 Computing eigenvalues - a princi- [10]) are able to push the condition number of pal problem! 1 the system down to O(h− ). Then the number 7 To show how hopeless and dangerous might be of flops per reduction to ǫ is given by n 6 and 5 ∼ a naive “software solution” approach with- n 4 for 3D and 2D problem, respectively. → ∼ out understanding the “nature” of the problem Consider now a direct method. Using ef- we consider an “elementary” problem - com- fective ordering strategies, we can have for the puting eigenvalues λ , λ , . . . , λn of a n by n { 1 2 } matrix mentioned above number of operations matrix A. 2 4 n and the size of the fill-in n 3 in 3 di- ∼ ∼ We will not specify the algorithm; just sup- mensions. For 2D problem the corresponding 3 pose that it is backward stable, i.e., the com- numbers are n 2 for the number of opera- ∼ puted approximations µ1,µ2,...µn are ex- tions and n log n for the fill-in size. The { } ∼ act eigenvalues of the matrix A which is just a corresponding estimates and their relation to slight perturbation of the matrix A, practical problems can be found in [34] and 4 [1]. Back substitution can be done in n 3 A = A + E, E δ, ∼ k k≤ operations for the 3D problem and in n log n operations for the 2D problem. where δ is small (proportional to the ma- chine precision). That is the best we can If we have to solve one system at a time hope for. A question is, however, how close then for large ǫ (small final precision) or very are µ ,...,µn to λ , . . . , λn . We define large n, iterative methods may be preferable. { 1 } { 1 } the (optimal) matching distance between the Having more complicated mesh structure or eigenvalues of A and A as more right-hand sides, direct methods can be usually preferable up to very large matrix di- md(A, A) = min max λπ(i) µi mensions. Iterative methods are usually more π { i | − |} susceptible to instabilities (or slowing down where π is taken over all permutations of the convergence) in finite precision arithmetics. 1,...,n . Using a naive approach, one might { } Moreover, notice that the additional effort due say: eigenvalues are continuous functions of to the computation and use of the precondi- the matrix coefficients. Therefore we can ex- tioner are not reflected in the asymptotic for- pect that for A close to A, the corresponding mulas. For many large problems we need so- eigenvalues will be also close to each other and phisticated preconditioners which increase sub- md(A, A) will be small. stantially the computational effort described The last conclusion is, of course, wrong! for the model problem. For a general matrix A, there is no bound on The amount of memory needed for com- the md(A, A) linear in E , i.e., we can k k putations makes also an important defference. not guarantee anything reasonable about the This is usually much smaller for the iterative precision of the computed eigenvalue approx- methods. On the other side, we can use space imations based on the size of the backward of the size O(n) for the sparse Cholesky fac- error. Even worse, for any small δ and any 4. Sparse linear solvers: parallelism in attack or in defense? large ω, one can find matrices A, A such that counterparts. One could add that for the E = A A δ and md(A, A) ω. Any greater complexity and irregularity, sparse ma- k k k − k≤ ≥ small perturbation of the matrix (any small trix computations are more realistic represen- backward error) may in principle cause an ar- tatives of typical scientific computations, and bitrary large perturbation of eigenvalues! In therefore more useful as benchmark criteria, other words – even the best software gives you, than the dense matrix computations, that usu- in general, no guarantee on the precision of the ally played this role. computed results. This is certainly a striking Despite the difficulty with sparse matrix statement. computations on the advanced computer ar- Fortunately, there is an important class of chitectures, some noticeable success has been matrices for which the situation is more opti- achieved in attaining very high performance mistic. We recall the following theorem (for (see [9]) and the needs of sparse matrix com- details see, e.g., [37]): putations have had notable effect on computer Theorem 3.1 Let A be normal. Then design (indirect addressing with gather/scatter md (A, A) (2n 1) E . facilities). Nevertheless, it is ironic that sparse ≤ − k k For Hermitian matrices even stronger results matrix computations contain more inherent can be proven. We see that for normal ma- parallelism than the corresponding dense ma- trices the size of the backward error essentially trix computations, yet typically show signifi- determine the precision of the computed eigen- cantly lower efficiency on today’s parallel ar- value approximations. chitectures. To summarize, for normal matrices any Roughly speaking, the most widely avail- good (i.e. backward stable) method will give able and commercially successful parallel archi- us what we want - good approximation to the tectures fall into three classes : shared-memory true eigenvalues. For highly nonnormal ma- MIMD computers, distributed-memory MIMD trices, however, the computed approximation architectures and SIMD computers. Some ma- may be very far from the true eigenvalues even chines have an additional level of parallelism in if the best software is used. the form of vector units within each individual In this context please notice that many processor. We will concentrate on the general times authors just plot the computed eigen- and widely applicable principles which can be value approximations and declare it as the true used in wide variations of these computing en- eigenvalues without paying any attention to vironments. the normality (or other relevant properties) of In the sparse Cholesky decomposition we their matrices. can analyze the following levels of parallelism (see [28]): 4 Sparse linear solvers: parallelism Large-grain parallelism in which each • in attack or in defense? computational task is the completion of To show the difficulties related to parallel im- all columns in a subtree of the elimina- plementations of linear solvers, we concentrate tion tree. here on sparse direct solvers. Description of the Medium-grain parallelism in which each parallel implementation of the iterative meth- • ods is much more simple and can be found else- task correspond to one simple cycle of where. column update cmod or column scaling Dense matrix computations are of such ba- cdiv operation in the left- and right- sic importance in scientific computing that looking pseudo-codes. they are usually among the first algorithms Fine-grain parallelism in which each task implemented in any new computing environ- • is a single floating-point operation or a ment. Sparse matrix computations are equally multiply-add pair. as important, but both their performance and their influence on computer system design have Fine-grain parallelism can be exploited in tended to lag those of their dense matrix two distinctly different ways: 4. Sparse linear solvers: parallelism in attack or in defense?

1. Vectorization of the inner cycles on vec- processors, the algorithm of sparse Cholesky tor processors. decomposition is scalable. (by a scalable al- gorithm we call an algorithm that maintains 2. Parallelizing the rank-one update in the efficiency bounded away from zero as the num- right-looking pseudo-code. ber p of processors grows and the size of the Vectorization of the operations is one of data structures grows roughly linearly in p, the basic tools to improve effectiveness of see [19]). Up to date, however, even this the sparse solvers using array processors, vec- method has not achieved high efficiency on a tor supercomputers or RISC processors with highly parallel machine. With this note we left some pipelining. Efficient vectorization was a the realm of the highly-parallel and massively- very strong argument to promote band, pro- parallel SIMD machines aside and we will turn file and frontal solvers when first vector pro- to the parallelism exploited in the most pop- cessors appeared. As noted above, except for ular parallel implementations: large-grain and special types of discretized partial differential medium-grain left-looking algorithms and mul- equations, they are not very important now, tifrontal codes. and other concepts are used for the Cholesky Let us turn now to the problem of medium- decomposition of general matrices. This is grain algorithms. Of the possible formulations caused by enormous work which was done of the sparse Cholesky algorithms, left-looking in the research of direct sparse solvers, by algorithm is the simplest to implement. It gather/scatter facilities in today’s computers is shown in [16], that the algorithm can be for scientific computing and by the high-speed adapted in a straightforward manner to run ef- scalar arithmetics in workstations. ficiently in parallel on shared-memory MIMD Hardware gather/scatter facilities can usu- machines. ally reach no more than 50% of the perfor- Each column j corresponds to a task mance of the corresponding dense vector oper- ations. No wonder, that the use of dense vec- T col(j)= cmod(j, k) k Struct(l j) { | ∈ ∗ } tors and/or matrices in the inner cycles of the cdiv(j) . ∪{ } Cholesky decomposition is still preferable. The These tasks are given to a task queue in above-mentioned multifrontal implementation the order given by some possible rearrange- of the right-looking algorithm widely exploits ment of columns and rows of A, i.e., given this idea. The structure of the elimination tree by some renumbering of the elimination tree. enables to deal only with those elements which Processors obtain column tasks from this sim- correspond to nonzeros in the factor L. ple “pool of tasks” in this order. The basic To obtain better performance using vec- form of the algorithm has two significant draw- tor functional units, we usually strive to have backs. First, the number of synchronization dense vectors and matrices of the sufficiently operations is quite high. Second, since the al- large dimensions (we are not going into the gorithm does not exploit supernodes, it will not careful analysis of the situation which is usu- vectorize well on vector supernodes with mul- ally quite more complex). Thus, forming su- tiple processors. The remedy is to use the su- pernodes is usually highly desirable since it pernodes to decrease also the synchronization helps to increase the dimension of the subma- costs. trices processed in the inner cycle. Algorithms for distributed-memory ma- The problem of parallelizing rank-one up- chines are usually characterized by the a pri- date is a difficult one, and research on this ori distribution of the data to the processors. topic is still in its infancy (see [19]). Note, In order to keep the cost of the interproces- that the right-looking algorithm presents for sor communication at acceptable levels, it is SIMD machines much better alternative that essential to use local data locally as much as the column left-looking approach. When rows possible. The distributed fan-out (see [17]), and columns of the sparse matrix A are dis- fan-in (see [6]), fan-both (see [4]) and multi- tributed to the rows and columns of a grid of frontal algorithm (see overview [23]) are typi- 4. Sparse linear solvers: parallelism in attack or in defense? cal examples of the implementations. All these this algorithm are usually much lower than by algorithms are designed in the following frame- the historically older fan-out algorithm. work: The fan-both algorithm was described as an intermediate parametrized algorithm parti- They require assignment of the matrix tioning both the subtasks T col(j) and T sub(j). • columns to the processors. Processors are sending both the aggregate col- umn updates and completed columns. They use the column assignment to dis- • The distributed multifrontal algorithm par- tribute the medium-grained tasks in the titions among the processors the tasks upon outer loop of left- or right-looking sparse which the sequential multifrontal method is Cholesky factorization. based:

The differences among these algorithms 1. Partial dense right-looking Cholesky fac- stem from the various formulations of the torization for the vertices of independent sparse-Cholesky algorithm. subtrees of the elimination tree. The fan-out algorithm is based on the right-looking Cholesky algorithm. We will de- 2. Medium-grain or large-grain subtasks of note the k th task performed by the outer loop the partial Cholesky factorizations of the − of the algorithm (lines (3)–(10) of the right- dense matrices. looking pseudo-code) by T sub(k), which is de- The first source of the parallelism is prob- fined by ably the most natural. Its theoretical justifi- T sub(k)= cdiv(k) { } cation is given by the Lemma 1.2. Indepen- cmod(j, k) j Struct(l k) . dent branches of the elimination tree can be ∪{ | ∈ ∗ } That is, T sub(k) first forms l k by scal- eliminated independently. Towards the root, ∗ ing of the k th column, and then perform number of independent tasks corresponding to − all column modifications that use the newly these branches decreases. Then tasks cor- formed column. The fan-out algorithm par- responding to partial updates of the right- titions each task T sub(k) among the proces- looking pseudo-code near to the root can be sors. It is a data-driven algorithm, where the partitioned among the processors. Combining data sent from one processor to another repre- these two principles we obtain the core of the sent the completed factor columns. The outer distributed-memory (but also of the shared- loop of the algorithm for a given processor reg- memory) multifrontal method. All the parallel ularly checks the message queue for the incom- right-looking implementations of the Cholesky ing columns. Received columns are used to factorization are essentially based on the same modify every column j owned by the processor principles. for which cmod(j, k) is required. When some Large-grain parallelism present in the con- column j is completed, it is sent immediately temporary implementations is usually very to all the processors, by which it is eventually transparent. We do not care very much about used to modify subsequent columns of the ma- the mapping relation column–processor. The trix. actual architecture provides hints how to solve The fan-in algorithm is based on the left- this problem. Using for instance hypercube looking Cholesky pseudo-code. It partitions architectures, the subtree-subcube mapping is each task T col(j) among the processors in a very natural and we can speak about this type manner similar to the distribution of tasks of parallelism. T sub(k) in the fan-out algorithm. It is a Shared-memory vector multiprocessors demand-driven algorithm where the data re- with a limited number of processing units rep- quired from a processor pa to complete the resent a similar case. The natural way is to j th column on a given processor pb are gath- map rather large subtrees of the elimination − ered in the form of results cmod(j, k) and sent tree to the processor. The drawback in this sit- together. Communication costs incurred by uation can be the memory management, since 5. Concluding remarks we need more working space than if purely Naive hopes that with more processors we scalar computations are considered. could avoid in some extent the difficulties faced So far we have concentrated on the issues in scalar Cholesky decomposition came to an related to the numerical factorization. We left disappointment. We are still trying to find bet- out the issues of the symbolic decomposition, ter algorithmic alternatives which make both initial ordering and making other symbolic ma- the scalar and parallel computations more ef- nipulations in parallel. Although in case of fective. scalar computations the numerical phase tim- ing is usually dominant, in the parallel environ- 5 Concluding remarks ment it is not always so. The following exam- ple shows the proportion of the time spent in The world of numerical linear algebra is de- the symbolic and numeric phases of the sparse veloping very fast. We have tried to show Cholesky factorization. It is taken from [31]. that many problems which are considered by Example 4.1 Comparison of ordering and the numerical analyst working in this area are numeric factorization time for the matrix com- very complicated and still unresolved despite ing from structural mechanics. Dimension n = the fact, that most of it has been formulated a 217918, number of nonzeros in A is 5.926.567, few decades or even one or two hundred years number of nonzeros in L is 55.187.841. Or- ago. dering time was 38s, right-looking factorization Among the current trends in this field we time was 200s. want to point out the strong movement to- Optimizing parallel performance of the wards the justification of the numerical algo- symbolic manipulations (including matrix or- rithms. A method or algorithm programmed dering, computation of row and column struc- to a code should not only give some output, tures, tree manipulations, ...) is an important but it should guarantee an exactly defined re- challenge. For the overview of the classical lation of the computed approximation to the techniques see [23]. unknown true solution or warn the user about Number of parallel steps to compute the the possible incorrectness of the result. Our Cholesky decomposition is determined by the intuition must be checked carefully by formal height of the elimination tree. But, while in proofs or at least by developing theories offer- the one processor case we preferred the unbal- ing a deep insight into a problem. Attempts to anced form of the elimination tree (see Figure solve problems without such insight may fail 1.7), the situation is different now. Unbalanced completely. Another trend can be character- tree induces more parallel steps. Therefore, for ized by the proclamation: There is no simple the parallel elimination, the balanced alterna- general solution to the difficult tasks as, e.g., tive seems to be better. On the other side, the solving large sparse linear systems. Consider- cumulative size of working space is in this case ing parallel computers, things are even getting higher. worse. A combination of different techniques Also the problem of the renumbering of the is always necessary, most of which use very ab- elimination tree is casted into another light in stract tools (as the graph theory etc. ) to the parallel case. The level-by-level renum- achieve very practical goals. There is where bering and balanced children sequences of the the way from theory to practice is very short. elimination tree are the objects of further re- search. References Even in the “scalar” case, all the implemen- tations are strongly connected with computer [1] Agrawal, A. - Klein, P. - Ravi, R.: Cut- architectures. It is only natural that, in the ting down on fill using nested dissection: parallel environment, where some features of provably good elimination orderings, in: the computing facilities (e.g. , communication) George, A. - Gilbert, J.R. - Liu, J.W.H., provide even more variations, this dependence eds. : Graph Theory and Sparse Matrix is also more profound. Computation, Springer, 1993, 31–55. 5. Concluding remarks

[2] Aho, A.V. - Hopcroft, J.E. - Ullman, J.D.: [13] Eisenstat, S.C. - Schultz, M.H. - Sherman, Data Structures and Algorithms, Addison A.H.: Software for sparse Gaussian elimi- - Wesley, Reading, MA, 1983. nation with limited core memory, in: Duff, I.S. - Stewart, G.W. eds., Sparse Matrix [3] Ashcraft, C.: A vector implementation of Proceedings, SIAM, Philadelphia, 1979, the multifrontal method for large sparse, 135–153. symmetric positive definite linear systems, Tech. Report ETA-TR-51, Engineering [14] Freund, R. - Golub, G. - Nachtigal, N: Technology Application Division, Boeing Iterative solution of linear systems. Act. Computer Services, Seattle, Washington, Numerica 1, 1992, pp 1-44. 1987.

[4] Ashcraft, C.: The fan-both family of [15] Gallivan, K.A. - Plemmons, R.J. - Sameh, column-based distributed Cholesky fac- A.H.: Parallel algorithms for dense linear torization algorithms, in: George, A., algebra computations, SIAM Review, 32 Gilbert, J.R., Liu, J.W.H., eds. : Graph (1990), 54-135. Theory and Sparse Matrix Computation, Springer, 1993, 159–190. [16] George, A. - Heath, M. - Liu, J.W.H.: Solution of sparse positive definite sys- [5] Ashcraft, C. - Grimes, R.: The influence tems on a shared-memory multiproces- of relaxed supernode partitions on the sor, Internat. J. Parallel Programming, 15 multifrontal method, ACM Trans. Math. (1986), 309–325. Software, 15 (1989), 291–309.

[6] Ashcraft, C. - Eisenstat, S. - Liu, J.W.H.: [17] George, A. - Heath, M. - Liu, J.W.H.: A fan-in algorithm for distributed sparse Sparse Cholesky factorization on a local- numerical factorization, SIAM J. Sci. memory multiprocessor, SIAM J. Sci. Stat. Comput., 11 (1990), 593–599. Stat. Comput., 9 (1988), 327–340.

[7] Axelsson, O.: Iterative Solution Methods. [18] George, A. - Liu, J.W.H.: Computer So- Cambridge University Press, Cambridge, lution of Large Sparse Positive Definite 1994. Systems, Prentice-Hall, Englewood Cliffs, N.J., 1981. [8] Barrett, R. - et al.: Templates for the so- lution of linear systems: Building blocks [19] Gilbert, J. - Schreiber, R.: Highly parallel for iterative methods. SIAM, Philadelphia, sparse Cholesky factorization, Tech . Re- 1994 port CSL-90-7, Xerox Palo Alto Research [9] Browne, J. - Dongarra, J. - Karp, A. - Center, 1990. Kennedy, K. - Kuck, D.: 1988 Gordon Bell Prize, IEEE Software, 6 (1989), 78–85. [20] Greenbaum, A. - Strakoˇs, Z.: Matrices that generate the same Krylov residual [10] Chandra, R.: Conjugate gradient meth- spaces. In: Recent Advances in Iterative ods for partial differential equations, PhD. Methods, G. Golub et.al. eds., Springer- Thesis, Yale University, 1978. Verlag, New York, 1994. [11] Drkoˇsov´a, J - Greenbaum, A. - Rozloˇzn´ık, M. - Strakoˇs, Z.: of the [21] Greenbaum, A. - Pt´ak, V. - Strakoˇs, Z.: GMRES method. Submitted to BIT, 1994. Any nonincreasing convergence curve is possible for GMRES (in preparation) [12] Duff, I.S. - Reid, J.: The multifrontal solu- tion of indefinite sparse symmetric linear [22] Hageman, L. - Young, D.: Applied itera- equations, ACM Trans. Math. Software, 9 tive methods. Academic Press, New York, (1983), 302–325. 1981 5. Concluding remarks

[23] Heath, M.Y. - Ng, E. - Peyton, B.W.: Par- ization, presented at the IBM Europe In- allel algorithms for sparse linear systems, stitute, 1990. in: Gallivan, K.A. - Heath, M.T. - Ng, E. - Ortega, J.M. - Peyton, B.W. - Plem- [32] Ng, E.G. - Peyton, B.W.: Block sparse mons, R.J. - Romine, C.H. - Sameh, A.H. - Cholesky algorithms on advanced unipro- Voigt, R.G.: Parallel Algorithms for Ma- cessor computers, SIAM J. Sci. Comput., trix Computations, SIAM, Philadelphia, 14 (1993), 1034–1056. 1990, 83–124. [33] Parter, S.: The use of linear graphs in [24] Higham, N.J. - Knight, P.A.: Component- Gaussian elimination, SIAM Review, 3 wise error analysis for stationary iterative (1961), 364–369. methods, in preparation. [34] Pissanetzky, S.: Sparse Matrix Technol- [25] Liu, J.W.H.: The role of elimination trees ogy, Academic Press, 1984. in sparse factorization, SIAM J. Matrix Anal. Appl. 11 (1990), 134-172. [35] Rothberg, E. - Gupta, A.: Efficient sparse matrix factorization on high-performance [26] Lewis, J. - Simon, H.: The impact of hard- workstations - exploiting the memory hi- ware gather/scatter on sparse Gaussian erarchy, ACM Trans. Math. Software, 17 elimination, SIAM J. Sci. Stat. Comput., (1991), 313–334. 9 (1988), 304–311. [36] Rozloˇzn´ık, M. - Strakoˇs, Z.: Variants [27] Liu, J.W.H.: A compact row storage of the residual minimizing Krylov space scheme for Cholesky factors using elimi- methods. Submitted to BIT, 1994 nation trees, ACM Trans. Math. Software, 12 (1986), 127–148. [37] Stewart, G.W. - Sun, J.: Matrix perturba- [28] Liu, J.W.H.: Computational models tion theory. Academic Press, Boston, 1990 and task scheduling for parallel sparse [38] Trefethen, N.: The definition of numeri- Cholesky factorization, Parallel Comput- cal analysis. Dept. of comp. Sc., Cornell ing, 3 (1986), 327–342. Univ., 1991 [29] Liu, J.W.H.: Equivalent sparse matrix reordering by elimination tree rotations, [39] Varga, R.: Matrix iterative analysis. SIAM J. Sci. Stat. Comput., 9 (1988), Prentice-Hall Inc., NJ, 1961 424–444. [40] Van der Vorst, H.: Lecture notes on iter- [30] Liu, J.W.H.: Advances in direct sparse ative methods, 1992, manuscript. methods, manuscript, 1991. [41] Young, D.: Iterative solutions of large lin- [31] Liu, J.W.H.: Some practical aspects of ear systems. Academic Press, New York, elimination trees in sparse matrix factor- 1971