Nonsymmetric Multigrid Preconditioning for Conjugate

arXiv:1212.6680v3 [cs.NA] 21 Jun 2013 n utgi SG ehdi rvddb the by Semicoarsen- provided geometric is The method (SMG) Laplacian. Multigrid 3D ing eigenval the and for linear gradient solvers as conjugate post- used in off when preconditioner multigrid a turning geometric of in possibility (relaxation) smoothing the analyze numerically We ABSTRACT LBC)mto vial in available gradie method conjugate using preconditioned solved (LOBPCG) block are optimal problems locally eigenvalue the The descent steepest methods. con- preconditioned two (PSD) preconditioned and using the (PCG) systems of gradient jugate flexible) linear and solve We (standard variants package. software allel . stand- a as used is multigrid ac- chosen full into the If to taking geometric addition in by procedure. In grid, fine generated smoothing the of is problem. geometry grid the grid count often coarse coarse is the the it multigrid, construct since implicitly) fundamental (possibly to more is used matrix is system smoothing the available, mult only algebraic In where technique. grid, multigrid of cornerstones two INTRODUCTION 1. LOBPCG. convergen BLOPEX; hypre; descent; software; steepest parallel gradient; precond conjugate post-smoothing; tioning; pre-smoothing; smoothing; grid; is jus- acceleration theoretical provided. of A is effect tification interconnection. memory the of that independent off. show is experimentally post-smoothing flexible We SMG and if PSD-SMG similarly converge that PCG-SMG systems linear numericall We relativel for and demonstrate speed. smoothing convergence of in costs decrease ac high insignificant overall the in to resulting due avoided, celeration, show be results can numerical post-smoothing our PCG-SMG. that LOBPCG, standard and post-smoothing the PCG the down flexible For slows off dramatically turning SMG that in observe We software. mohn rlxto)adcas-rdcreto r the are correction coarse-grid and (relaxation) Smoothing Keywords: osmercmlirdpeodtoigfrconjugate for preconditioning multigrid Nonsymmetric [email protected] nvriyo ooaoDenver Colorado of University ercsBouwmeester Henricus evr O80217-3364 CO Denver, iereutos ievle trtv;multi- iterative; eigenvalue; equations; linear apsBx170 Box Campus . .Bx173364 Box P.O. hypre hog BLOPEX through rdetmethods gradient nvriyo ooaoDenver Colorado of University [email protected] evr O80217-3364 CO Denver, hypre nrwDougherty Andrew apsBx170 Box Campus . .Bx173364 Box P.O. par- ce; ue nt i- i- y y - iinri trtv ieradegnau ovr o the for solvers eigenvalue precon- and a in Laplacian linear as 3D used iterative when in multigrid ditioner geometric in smoothing n S ymdfigthe modifying already by is PSD PCG and standard The in coded methods. (PSD) steep- preconditioned descent and [ est (PCG) e.g., gradient conjugate flexible, tioned and (standard variants hypre [ (SM see Multigrid method, Semicoarsening the un- preco nu- specifically, multigrid our of ditioning, geometric describe practical influence in to the smoothing is unpre- with balanced paper experimenting be this experience may of merical goal behavior are main convergence solvers The and iterative dictable. valid, of longer theory convergence no assumptions standard precon- Traditional the multigrid of nonsymmetric. the becomes post-smoothing, ditioner no but smoothing, ieyisgicn eraei ovrec pe.Orob- rela- Our and smoothing speed. overall convergence of in case, in costs results decrease latter high insignificant SMG the the tively in to In due post-smoothing acceleration, avoided. the off be turning can show tests post-smoothing numerical our LOBPCG, that and PCG However, flexible the PCG-SMG. for standard the down slows dramatically h ievlepolm r ovduigtelclyop- locally in (LOBPCG) available the gradient readily conjugate using method, preconditioned solved block are timal problems eigenvalue The a eeog) n neulnme f r-adpost- and pre- of, number [ equal e.g., see, an steps; and one smoothing even definite enough), practice, positive (in be enough pre- The may performing linear by solves. obtained fixed is coarse restri ness a or smoothing, get prolongation, nonlinear to using tion, order restriction avoids In one the conditioner, and choosing pair. pre- prolongation properly balanced pre- and by using precon- the and by the of achieved context, post-smoothing, is multigrid symmetry lin- symmetry the the fixed ditioner In preserve a to also system. conditioned be operator, to SPD required ear typically is preconditioner if happens what off. an check partially in to turned preconditioner tempted is a is smoothing as for one used necessary method, is iterative absolutely multigrid is If smoothing convergence. proper solver, alone rpen saalbea http://arxiv.org/abs/1212.66 at available is prepreint A enmrclyaayetepsiiiyo unn ffpost- off turning of possibility the analyze numerically We fsotigi naacd .. hr soese fpre- of step one is there e.g., unbalanced, is smoothing If eosreta unn ffteps-mohn nSMG in post-smoothing the off turning that observe We o ymti oiiedfiie(P)lna ytm,the systems, linear (SPD) definite positive symmetric For [ 1 ]. hypre 13 ehv rte h oe fflxbePCG flexible of codes the written have We . ,poie yteprle otaepackage software parallel the by provided ], hypre isbsiEeti Research Electric Mitsubishi [email protected] nrwV Knyazev V. Andrew abig,M 02145 MA Cambridge, eslelna ytm sn two using systems linear solve We . 0 Broadway 201 Laboratories 5 ]. hypre hypre hog LPX[ BLOPEX through tnadPGfunction. PCG standard 6 )o h precondi- the of ]) 2 ]. G) 80 n- c- - servations are also generally applicable for algebraic multi- for the flexible PCG. grid preconditioning for graph Laplacians, appearing, e.g., We note that in using (2), we are merely subtracting one in computational photography problems, as well as in 3D term, (sk,rk−1), in the numerator of (1), which appears in mesh processing tasks [11]. the standard CG algorithm. If T is a fixed SPD matrix, this A different case of non-standard preconditioning, specifi- term actually vanishes; see, e.g., [8]. By using (2) in a com- cally, variable preconditioning, in PCG is considered in our puter code, it is required that an extra vector be allocated to earlier work [8]. There, we also find a dramatic difference either calculate rk rk−1 or store αkApk, compared to (1). in convergence speed between the standard and flexible ver- The associated costs− may be noticeable− for large problems sion of PCG. The better convergence behavior of the flex- solved on parallel computers. Next, we numerically evaluate ible PCG is explained in [8] by its local optimality, which the extra costs by comparing the standard and flexible PCG guarantees its convergence with at least the speed of PSD. with no preconditioning for a variety of problem sizes. Our numerical tests there show that, in fact, the conver- Our model problem used for all calculations in the present gence of PSD and the flexible PCG is practically very close. paper is for the three-dimensional negative Laplacian in a We perform the same comparison here, and obtain a similar brick with homogeneous Dirichlet boundary conditions ap- result. We demonstrate for linear systems that PSD-SMG proximated by the standard finite difference scheme using converges almost as fast as the flexible PCG-SMG if SMG the 7-point stencil with the grid size one in all three direc- post-smoothing is off in both methods. tions. The initial approximation is (pseudo)random. We Our numerical experiments are executed in both a strictly simply call a code, called struct, which is provided in hypre shared memory environment and in a distributed memory to test SMG, with different command-line options. environment. Different size problems and with different To generate the data to compare iterative methods, we shapes of the bricks are solved in our shared versus dis- execute the following command, tributed tests, so the results are not directly comparable mpiexec -np 16 ./struct -n $n $n $n -solver 19 with each other. Our motivation to test both shared and distributed cases is to investigate how the effect of accelera- for the shared memory experiments and tion depends on the memory interconnection speed. The rest of the paper is organized as follows. We formally mpiexec -np 384 ./struct -n $n $n $n -solver 19 describe the PSD and PCG methods used here for testing, for the distributed memory experiments, where $n runs from and explain their differences. We briefly discuss the SMG 10 to 180, and determines the number of grid points in preconditioning in hypre and present our numerical results each of the three directions per processor. The size of the for linear systems. One section is dedicated to eigenvalue brick here is $np-times-$n-by-$n-by-$n, i.e., the brick gets problems. Our last section contains the relevant theory. longer in the first direction with the increase in the number of cores. For example, using the largest value $n=180, 2. PSD ANDPCG METHODS the maximum problem size we solve is 16x180-by-180-by- For a general exposition of PSD and PCG, let SPD ma- 180=93,312,000 unknowns for $np=16 and 384x180-by-180- trices A and T , and vectors b and x0 be given, and denote by-180=2,239,488,000 unknowns for $np=384. The option rk = b Axk. Algorithm 1 is described in [8] -solver 19 tells the driver struct to use no preconditioning. − The MPI option -np 16 means that we run on 16 cores and we restrict to using only one node for the shared memory Algorithm 1: PSD and PCG methods whereas for distributed memory we use 16 cores on 24 nodes 1 for k = 0, 1,... do with the MPI option -np 384. In fact, all our tests in this 2 sk = Trk paper are performed on either 16 cores on one node or 16 3 if k = 0 then cores on each of the 24 nodes, so in the rest of the paper we 4 p0 = s0 always omit the “mpiexec -np 16(384)” part of the execution 5 else command for brevity. 6 pk = sk + βkpk−1 (where βk is either (1) or (2) Our comprehensive studies are performed on a 24 node for all iterations) cluster where each node is a dual-socket octo-core machine 7 end based on an Intel Xeon E5-2670 Sandy Bridge processor run- (sk,rk) ning at 2.60 GHz with 64 GB RAM of shared memory per 8 αk = (pk,Apk) node. Switched communications fabric between the nodes 9 xk+1 = xk + αkpk is provided by an InfiniBand 4X QDR Interconnect. Each 10 rk+1 = rk αkApk experiment is performed using 16 cores on one node using − 11 end shared memory and also using 16 cores on all 24 nodes via distributed memory. In Figure 1, for the CG method without preconditioning Various methods are obtained by using different formulas we see a 20-25% cost overhead incurred due to the extra for the scalar βk. We set βk = 0 for PSD, storage and calculation for the flexible variant, (2), relative to the standard variant, (1). Note that in each instance of (sk,rk) βk = (1) the problem, the number of iterations of the CG method is (sk−1,rk−1) the same regardless if either (1) or (2) is used for βk. From Figure 1 one observes that the distributed memory for the standard PCG, or (DM) case is not perfectly scalable, as there is about a 50% (sk,rk rk−1) increase when the size of the problem is increased 24 times, βk = − (2) (sk−1,rk−1) proportionally to the number of processors. The same com- Timing results of SD and CG Final residual error of SD and CG (16 processors on 1 node) (16 processors on 1 node) 80 100 CG with (1) 70 CG with (2)

60 10−5

50 C || b || /

40 C || −10 r 10 || 30 cpu clock time (sec) 20 SD −15 10 CG with (1) 10 CG with (2)

0 20 40 60 80 100 120 140 160 180 20 40 60 80 100 120 140 160 180 n n (a) Shared memory Figure 2: Accuracy comparison of SD and CG using (1) and (2). Timing results of SD and CG (16 processors each on 24 nodes) smoothing steps is controlled by a command line parameter 80 in the struct test driver. The data in Figures 3 and 4 is CG with (1) obtained by 70 CG with (2) ./struct -n $n $n $n -solver 10 -v 1 1 60 in which the -solver 10 option refers to the SMG precondi- 50 tioning, and the number of pre- and post-relaxation smoothing steps is speciﬁed by the -v ﬂag—one step each of pre- 40 and post-relaxation in this call. Although using (2) introduces some extra overhead, in the 30 case of the SMG preconditioning, this is negligible as can

cpu clock time (sec) be seen in Figure 3, since the SMG preconditioning, first, is 20 relatively expensive computationally and, second, makes the PCG converge much faster, see Figure 4, compared to the 10 non-preconditioned case displayed in Figure 2, although the comparison is a a little confusing, since Figure 2 is a residual 0 20 40 60 80 100 120 140 160 180 error plot (iterations were capped at 100) while Figure 4 n shows the iteration number (convergence was obtained prior (b) Distributed memory to 100 iterations). As in Figure 2, for each instance of the Figure 1: Timing of unpreconditioned CG using (1) and (2). PCG method, the number of iterations is equal as depicted in Figure 4. The PSD method performs a bit worse (only 2-3 more iterations) than either variant of PCG and not nearly parison actually holds for all other (a) and (b) figures up to as much worse as in Figure 2, which clearly indicates the Figure 8. high quality of the SMG preconditioner for these problems. In Figure 2 we see, as expected without preconditioning, Using the same number of pre- and post-relaxation steps that the relative final residuals for both CG algorithms are creates a symmetric preconditioner, in this test, an SPD identical, and that the SD algorithm of course performs preconditioner. Thus both PCG-SMG, (1) and (2), are ex- much worse than either. The number of iterative steps is pected to generate identical (up to round-off errors) iterative capped by 100, so most of Figure 2 shows the residual after approximations. We indeed observe this effect in Figure 4, 100 iterations, except for a small straight part of the CG where the data points for the PCG-SMG with (1) and (2) line for $n=10,20,30, where the iterations have converged. are indistinguishable. If no post-relaxation is performed within the SMG preconditioner, the convergence of the standard PCG method, 3. PRECONDITIONING WITH SMG i.e., with (1), is significantly slowed down, almost to the In order to help overcome the slow convergence of the level where preconditioning is useless. The command used CG method, as observed in Figure 2, it is customary to in- for this comparison is: troduce preconditioning. Here, we use the SMG solver as ./struct -n 80 80 80 -solver 10 -v 1 0 a preconditioner, provided by hypre. The SMG solver/preconditioner uses plane-relaxation as a smoother at each level in which the unbalanced relaxation is specified by the -v in the V-cycle [1]. The number of pre- and post-relaxation flag and the grid is 1280x80x80. In Figure 5, we compare Timing results of PCG with SMG preconditioner Iteration results of PCG and PSD with SMG preconditioner (16 processors on 1 node) (16 processors on 1 node) 350 PSD-SMG 18 300 PCG-SMG with (1) 16 PCG-SMG with (2) 250 14

12 200 10 150 8 # iterations

cpu clock time100 (sec) 6

4 PSD-SMG 50 PCG-SMG with (1) 2 PCG-SMG with (2) 0 20 40 60 80 100 120 140 160 180 0 n 20 40 60 80 100 120 140 160 180 n (a) Shared memory Figure 4: Iteration count for PSD-SMG and PCG-SMG Timing results of PCG with SMG preconditioner with (1) or (2) (16 processors each on 24 nodes) 350 PSD-SMG -v 1 0 ) in the SMG preconditioner to Figures 3 and 4 to ob- 300 PCG-SMG with (1) tain Figures 6 and 7. The number of iterations of the solver PCG-SMG with (2) does increase a bit for both (2) and for the PSD method 250 with no post-relaxation in the SMG preconditioner, but not enough to outweigh the cost savings of not using the post- 200 relaxation in the SMG preconditioner. The overall improve- ment is 43% for all problems tested. 150

cpu clock time (sec) 4. PRECONDITIONING OF LOBPCG 100 WITH SMG 50 The LOBPCG method [7] computes the m smallest eigen- values of a linear operator and is implemented within hypre 0 through BLOPEX; see [2]. We conclude our numerical ex- 20 40 60 80 100 120 140 160 180 n periments with a comparison of the use of balanced and (b) Distributed memory unbalanced relaxation in the SMG preconditioner for the Figure 3: Cost for storage and calculation in PCG-SMG LOBPCG method with m = 1. with (1) or (2) In these tests, the matrix remains the same as in the previous section, i.e., corresponds to the three-dimensional negative Laplacian in a brick with homogeneous Dirichlet boundary conditions approximated by the standard finite convergence of PCG-SMG using (1) and (2) and PSD-SMG, difference scheme using the 7-point stencil with the grid size where the SMG preconditioner has no post-relaxation. In of one in all three directions. But in this section we compute PCG-SMG using (2) without post-relaxation of the SMG the smallest eigenvalue and the corresponding eigenvector, preconditioner, we still achieve a very good convergence rate, rather than solve a linear system. The initial approximation as well as for PSD-SMG. to the eigenvector is (pseudo)random. The convergence behavior in Figure 5 is similar to that We generate the data for Figures 8 and 9 with the com- observed in [8]. There, a variable SPD preconditioner makes mands: the standard PCG, i.e., using (1), almost stall, while the ./struct -n $n $n $n -solver 10 -lobpcg -v 1 0 convergence rates of PSD and flexible PCG, i.e., with (2), ./struct -n $n $n $n -solver 10 -lobpcg -v 1 1 are good and close to each other. However, the SMG preconditioner is fixed and linear, ac- where $n runs from 10 to 120. We also use the -P option cording to its description in [13] and our numerical verifi- in struct to make the brick more even-sided. In the shared cation, so the theoretical explanation of such a convergence memory experiment we use -P 4 2 2 which will create a 4n- behavior in [8] is not directly applicable here. Moreover, by-2n-by-2n brick and in the distrubuted case we use -P 8 turning off the post-relaxation smoothing in a multigrid pre- 8 6 which will create an 8n-by-8n-by-6n brick. conditioner makes it nonsymmetric—the case not theoreti- For this experiment, as seen in Figure 9, the number of cally covered in [8], where the assumption is always made LOBPCG iterations for the non-balanced SMG precondi- that the preconditioner is SPD. tioner is roughly 50% more than that for the balanced SMG We add the data for the absence of post-relaxation (option preconditioner. However, the cost savings, which is about Iteration error of PCG and PSD with SMG preconditioner Timing results of PCG with SMG preconditioner (16 processors on 1 node) (16 processors on 1 node) 350 0 10 PCG-SMG with (1) and balanced relaxation 300 PCG-SMG with (2) and balanced relaxation PCG-SMG with (2) and no post-relaxation 250 − 10 5 C

|| 200 b || / C

|| 150 r || − 10 10 cpu clock time100 (sec)

PCG-SMG with (1) and no post-relaxation PSD-SMG and no post-relaxation 50 − 10 15 PCG-SMG with (2) and no post-relaxation 0 2 4 6 8 10 12 14 20 40 60 80 100 120 140 160 180 iteration number n (a) Shared memory Figure 5: Iteration Error for PCG-SMG and PSD-SMG

Timing results of PCG with SMG preconditioner (16 processors each on 24 nodes) 30-50%, of not using the post-relaxation in the SMG precon- 350 ditioner outweigh the higher number of iterations, as shown PCG-SMG with (1) and balanced relaxation in Figure 8, and thus justify turning oﬀ the post-relaxation 300 PCG-SMG with (2) and balanced relaxation in this case. From Figure 8 one observes that the distributed PCG-SMG with (2) and no post-relaxation memory case is not perfectly scalable, as there is about a 50% increase when the size of the problem is increased 24 250 times, proportionally to the number of processors. Finally, we note that the existing LOBPCG convergence theory in [7, 200 10] requires an SPD preconditioner T , and does not explain convergence if T is nonsymmetric. 150

cpu clock time100 (sec) 5. THEORETICAL JUSTIFICATION OF THE OBSERVED NUMERICAL 50 BEHAVIOR 0 For linear systems with SPD coefficient matrices, the use 20 40 60 80 100 120 140 160 180 of nonsymmetric preconditioning has been justified, e.g., in n [3, Section 12.3], [4, 12], and [14, Section 10.2]. The ar- (b) Distributed memory guments used there are applicable for nonsymmetric and Figure 6: Cost comparison of relaxation in PCG-SMG us- variable preconditioning. ing (1) or (2) Specifically, it is shown in [14, Section 10.2] that the flexible PCG, i.e., using (2), is locally optimal, i.e., on every step it converges not slower than PSD. The convergence rate where the right-hand side is defined via bound for the PSD with nonsymmetric preconditioning es- ′ (rk) Trk tablished in [14, Section 10.2] is cos (∠A−1 rk,ATrk )= | | . { } rk A−1 ATrk A−1 k k k k rk+1 A−1 δ rk A−1 , (3) Proof. k k ≤ k k Identity (5) is actually proved, although not ex- plicitly formulated, in the proof of [14, Theorem 10.2]. Al- under the assumption ternatively, identity (5) is equivalent to

−1 I AT A δ < 1, (4) ek+1 A/ ek A = sin (∠A ek,TAek ) , (6) k − k ≤ k k k k { } where A−1 denotes the operator norm induced by the where Aek = rk, which is the statement of [8, Lemma 4.1]. k · k ′ −1 corresponding vector norm x A−1 = √x A x. We note that [8] generally assumes that the preconditioner The key identity for PSD,k thatk easily leads to bound (3), T is SPD, but this assumption is not actually used in the is presented in the following theorem. proof of [8, Lemma 4.1].

Theorem 1. The identity holds, Assumption (4) is very simple, but has one signiﬁcant drawback—it does not allow arbitrary scaling of the precon- rk+1 A−1 / rk A−1 = sin (∠A−1 rk,ATrk ) , (5) ditioner T , while the PCG and PSD methods are invariant k k k k { } Iteration results of PCG and PSD with SMG preconditioner Timing of LOBPCG-SMG with and without balanced relaxation (16 processors on 1 node) (16 Processors on 1 node) 70 18 LOBPCG-SMG with balanced relaxation 60 LOBPCG-SMG with no post-relaxation 16

14 50

12 40 10 30 8 # iterations

6 cpu clock time20 (sec) PSD-SMG with no post-relaxation 4 PCG-SMG with (2) and no post-relaxation 10 PCG-SMG with (1) and balanced relaxation 2 PCG-SMG with (2) and balanced relaxation 0 0 20 40 60 80 100 120 20 40 60 80 100 120 140 160 180 n n (a) Shared memory

Timing of LOBPCG-SMG with and without balanced relaxation Figure 7: Iteration count for PSD-SMG and PCG-SMG using (1) or (2) (16 processors each on 24 nodes) 70 LOBPCG-SMG with balanced relaxation with respect to scaling of T. The way around it is to scale 60 LOBPCG-SMG with no post-relaxation the preconditioner T before assumption (4) is verified. We now illustrate such a scaling under an additional assump- 50 tion that T is SPD, following [8]. We start with a theorem, connecting assumption (4) with its equivalent, and probably 40 more traditional form. 30 Theorem 2. Let the preconditioner T be SPD. Then assumption (4) is equivalent to cpu clock time20 (sec) I TA T −1 δ < 1. (7) k − k ≤ Proof. Since T is SPD, on the one hand, the matrix 10 product AT is also SPD, but with respect to the A−1 scalar 0 product. This implies that assumption (4) is equivalent to 20 40 60 80 100 120 the statement that Λ(AT ) [1 δ, 1+ δ] with δ < 1, where n ∈ − Λ( ) denotes the matrix spectrum. On the other hand, the (b) Distributed memory · −1 matrix product TA is SPD as well, with respect to the T Figure 8: Cost for higher relaxation level for LOBPCG-SMG scalar product. Thus, assumption (7) is equivalent to the statement that Λ(TA) [1 δ, 1+δ]. This means the equiva- lence of assumptions (4∈) and− (7), since Λ(AT )=Λ(TA). for symmetric eigenvalue problems have been obtained in Let us now, without loss of generality, as in [9, p. 96] and the last decade, and only for the simplest preconditioned [8, pp. 1268–1269], always scale the SPD preconditioner T method; see [9, 10] and references therein. A possibility in such a way that of using nonsymmetric preconditioning for symmetric eigenvalue problems has not been considered before, to our knowl- max Λ(TA) + min Λ(TA) = 2. { } { } edge. However, our check of arguments of [9] and preceding Then we have δ =(κ(TA) 1)/(κ(TA) + 1) and, vice versa, works, where a PSD convergence rate bound is proved as- κ(TA) = (1+ δ)/(1 δ),− where κ( ) denotes the matrix suming (4) and SPD preconditioning, reveals that the latter spectral condition number.− The convergence· rate bound (3) assumption, SPD, is actually never significantly used, and for the PSD with nonsymmetric preconditioning in this case can be dropped without affecting the bound. turns into the standard PSD convergence rate bound for the The arguments above lead us to a surprising determina- case of SPD preconditioner T ; see. e.g., [8, Bound (1.3)]. tion that whether or not the preconditioner is SPD is of Moreover, [8, Theorem 5.1] shows that this convergence rate no importance for PSD convergence, given the same qual- bound is sharp for PSD, and cannot be improved for flexible ity of preconditioning, measured by (4) after preconditioner PCG, i.e., using (2), if the SPD preconditioner T changes on prescaling. If the preconditioner is fixed SPD then the stan- every iteration. The latter result naturally extends to the dard PCG is the method of choice. The cases, where the case of nonsymmetric preconditioning of [14, Section 10.2]. preconditioner is variable or nonsymmetric, are similar to Compared to linear systems, eigenvalue problems are sig- each other—the standard non-flexible PCG, i.e., using (1), nificantly more complex. Sharp convergence rate bounds stalls, while the flexible PCG converges, due to its local opti- Iteration count of LOBPCG-SMG with and without balanced relaxation 7. ACKNOWLEDGMENT (16 Processors on 1 node) The authors would like to thank Rob Falgout, Van Hen- 15 son, Panayot Vassilevski, and other members of the hypre 14 LOBPCG-SMG with balanced relaxation team for their attention to our numerical results reported 13 LOBPCG-SMG with no post-relaxation here and the Center for Computational Mathematics Uni- 12 versity of Colorado Denver for the use of the cluster. This 11 work is partially supported by NSF awards CNS 0958354 10 and DMS 1115734. 9 8 7 8. REFERENCES 6 # iterations [1] HYPRE User’s Manual. Center for Applied Scientific 5 Computing, Lawrence Livermore National Lab, 2012. 4 [2] M. E. Argentati, I. Lashuk, A. V. Knyazev, and E. E. 3 Ovtchinnikov. Block locally optimal preconditioned 2 eigenvalue Xolvers (BLOPEX) in HYPRE and 1 PETSC. SIAM J. Sci. Comput., 29(4):2224–2239, 0 20 40 60 80 100 120 2007. n [3] O. Axelsson. Iterative solution methods. Cambridge (a) Shared memory University Press, Cambridge, 1994. [4] R. Blaheta. GPCG-generalized preconditioned CG Iteration count of LOBPCG-SMG with and without balanced relaxation method and its use with non-linear and non-symmetric (16 processors each on 24 nodes) displacement decomposition preconditioners. Numer. 10 Linear Algebra Appl., 9(6-7):527–550, 2002. LOBPCG-SMG with balanced relaxation [5] J. H. Bramble and X. Zhang. The analysis of 9 LOBPCG-SMG with no post-relaxation multigrid methods. In P. Ciarlet and J. Lions, editors, 8 Solution of Equation in Rn (Part 3), Techniques of 7 Scientific Computing, volume 7 of Handbook of Numerical Analysis, pages 173–415. Elsevier, 2000. 6 [6] G. H. Golub and Q. Ye. Inexact preconditioned 5 conjugate gradient method with inner-outer iteration. 4 SIAM J. Sci. Comput., 21(4):1305–1320, 1999. # iterations [7] A. V. Knyazev. Toward the optimal preconditioned 3 eigensolver: Locally optimal block preconditioned 2 conjugate gradient method. SIAM J. Sci. Comput., 23(2):517–541, 2001. 1 [8] A. V. Knyazev and I. Lashuk. Steepest descent and 0 20 40 60 80 100 120 conjugate gradient methods with variable n preconditioning. SIAM J. Matrix Anal. Appl., (b) Distributed memory 29(4):1267–1280, 2007. Figure 9: Iteration comparison of relaxation levels for [9] A. V. Knyazev and K. Neymeyr. A geometric theory LOBPCG-SMG for preconditioned inverse iteration. III. A short and sharp convergence estimate for generalized eigenvalue problems. Linear Algebra Appl., 358:95–114, 2003. mality, but may not be much faster compared to PSD. This [10] A. V. Knyazev and K. Neymeyr. Gradient flow explains the numerical results using nonsymmetric precon- approach to geometric convergence analysis of ditioning reported in this work, as related to results of [8] preconditioned eigensolvers. SIAM J. Matrix Anal. for variable SPD preconditioning. Appl., 31(2):621–628, 2009. [11] D. Krishnan, R. Fattal, and R. Szeliski. Efficient 6. CONCLUSION Preconditioning of Laplacian Matrices for Computer Graphics. In ACM Transactions on Graphics (Proc. Although the flexible PCG linear solver does require a bit SIGGRAPH 2013), volume 32(4). ACM SIGRAPH, more computational effort and storage as compared to the July 2013. standard PCG, within the scope of preconditioning the extra [12] Y. Notay. Flexible conjugate gradients. SIAM J. Sci. effort can be worthwhile, if the preconditioner is not fixed Comput., 22(4):1444–1460 (electronic), 2000. SPD. Moreover, our numerical tests showed that the behav- [13] S. Schaffer. A semicoarsening multigrid method for ior is similar in both shared and distributed memory. Thus elliptic partial differential equations with highly our conclusion is that the effect of acceleration is indepen- discontinuous and anisotropic coefficients. SIAM J. dent of the speed of the node interconnection. The use of Sci. Comput., 20(1):228–242, 1998. geometric multigrid without post-relaxation is demonstrated [14] P. S. Vassilevski. Multilevel block factorization to be surprisingly efficient as a preconditioner for locally op- preconditioners. Springer, New York, 2008. timal iterative methods, such as the flexible PCG for linear systems and LOBPCG for eigenvalue problems.