Bull. Korean Math. Soc. 53 (2016), No. 3, pp. 853–874 http://dx.doi.org/10.4134/BKMS.b150384 pISSN: 1015-8634 / eISSN: 2234-3016

A PARALLEL FINITE ELEMENT ALGORITHM FOR SIMULATION OF THE GENERALIZED STOKES PROBLEM

Yueqiang Shang

Abstract. Based on a particular overlapping domain decomposition technique, a parallel finite element discretization algorithm for the gener- alized Stokes equations is proposed and investigated. In this algorithm, each processor computes a local approximate solution in its own subdo- main by solving a global problem on a mesh that is fine around its own subdomain and coarse elsewhere, and hence avoids communication with other processors in the process of computations. This algorithm has low communication complexity. It only requires the application of an existing sequential solver on the global meshes associated with each subdomain, and hence can reuse existing sequential software. Numerical results are given to demonstrate the effectiveness of the parallel algorithm.

1. Introduction The generalized Stokes problem arises naturally in the time discretization of time-dependent Stokes problem when implicit Euler scheme is employed. It also consists of the key and most time-consuming part of the solution process of time-dependent Navier-Stokes equations adopting, for instance, semi-implicit schemes or else splitting methods. Therefore, the development of efficient al- gorithms for the generalized Stokes problem is of great significance in the sim- ulation of incompressible flows. Thanks to the development of modern high performance parallel computers, parallel computations attract more and more attentions nowadays. To effi- ciently utilize the computational power of modern parallel computers, efficient parallel algorithms are indispensable, and play a critical role in exploiting the full potential of modern parallel computers in use. Many of today’s parallel

Received May 21, 2015; Revised September 8, 2015. 2010 Mathematics Subject Classification. Primary 65N30, 65N55, 76D07, 76M10. Key words and phrases. generalized Stokes problem, finite element, parallel algorithm, parallel computing, domain decomposition. This work was supported by the Natural Science Foundation of China (No. 11361016), the Project Sponsored by the Scientific Research Foundation for the Returned Overseas Chinese Scholars, State Education Ministry, the Fundamental Research Funds for the Central Uni- versities (No. XDJK2014C160, SWU113095), and the Science and Technology Foundation of Guizhou Province of China (No. [2013]2212).

c 2016 Korean Mathematical Society 853 854 Y. SHANG computers are distributed memory parallel architectures with high latency and low bandwidth (i.e., the network connection is one that generally suffers from long delays in processing of network data, and the width of the communications band between processors is small) such as a cluster of workstations. In such environments, the communication cost is high compared to the computation cost. As a result, for an application, the communication frequency and size of messages transmitted between processors must be kept small for good parallel performance. Motivated by this, we present a parallel finite element discretiza- tion algorithm with low communication complexity for the generalized Stokes problem where the zeroth order term dominates the equations. In this algo- rithm, each subproblem is defined in the entire domain with the majority of the degrees of freedom associated with the particular subdomain that it deals with, and hence can be solved in parallel with other subproblems using an existing sequential solver without extensive recoding. The idea of computing a local approximation in a given subdomain by using a locally refined global mesh was first proposed by Xu and Zhou in [40, 41] for a class of linear and nonlinear elliptic boundary value problems. This approach was then applied to the standard steady Stokes equations [14, 34] and the steady Navier-Stokes equations [13] by He et al. In [35], by combining this ap- proach with classical iterative methods for the steady Navier-Stokes equations, Shang and He designed several parallel iterative algorithms for the Navier- Stokes equations. It was also combined with the defect-correction method and the subgrid stabilization method in [33] and [32], respectively. Similar meth- ods, for instance, the Bank-Holst parallel adaptive paradigm (cf. [3, 4, 5]) and the full domain partition approach (cf. [25, 26]) were also proposed in different contexts. Numerical tests showed the effectiveness of this type of algorithms. However, to our knowledge, there is a few reports on time-dependent problems or those problems where the zeroth order term dominates the equation(s). It is well known that exact solution of the problem may contain sharp layers where gradient of the solution changes abruptly when the problem is zeroth order term-dominated. By combining this local finite element computation approach with domain decomposition, we design a parallel algorithm for the generalized Stokes prob- lem. Concretely speaking, we first decompose the solution domain into disjoint subdomains, and then enlarge each subdomain by a small amount and assign each processor one enlarged subdomain. Each processor computes a local finite element solution of the generalized Stokes equations in its own subdomain us- ing a global grid that is fine with size h in the particular enlarged subdomain and coarse with size H > h far away from the subdomain. This algorithm has low communication complexity. It can be implemented easily on top of existing sequential software. Moreover, by suitably choosing a ratio of coarse mesh size H to fine mesh size h, our parallel algorithm can yield an approximate solution with an accuracy comparable to that of the standard Galerkin finite element A PARALLEL ALGORITHM FOR STOKES PROBLEM 855 solution. However, compared to the standard Galerkin finite element method, our algorithm can save a large amount of computational time. There have been some domain decomposition methods for the incompressible Stokes equations and related problems in literature, for examples, the overlap- ping Schwarz methods (cf. [10, 20, 31]), the non-overlapping Schwarz-type algorithms (cf. [17]), FETI [22, 38], FETI-DP [16, 18, 39] and BDDC methods [24, 37], balancing Neumann-Neumann methods (cf. [29]), block preconditioner methods (cf. [19, 21]), the iterative substructuring methods [6, 7, 27, 28]) and those in [2, 8, 15, 23, 36], among others. Our method differs from those men- tioned methods in that it is designed as a discretization scheme, rather than an iterative approach for the discretized system. The paper is organized as follows. In the next section, the generalized Stokes problem and its mixed finite element approximation are provided. In Section 3, a parallel finite element algorithm based on fully overlapping domain decom- position is designed and discussed. In Section 4, numerical results are given to demonstrate the effectiveness of the algorithm. Finally, the article is concluded in Section 5.

2. Preliminaries Let Ω be a bounded domain with Lipschitz-continuous boundary ∂Ω in Rd (d = 2, 3). As usual, for a nonnegative integer m, we denote by Hm(Ω) the Sobolev space of functions with square integrable distribution derivatives up to order m in Ω, equipped with the standard norm k · km,Ω; while we denote 1 1 by H0 (Ω) the closed subspace of H (Ω) consisting of functions with zero trace on ∂Ω; see, e.g., [1, 9]. 2.1. The generalized Stokes problem We consider the following generalized Stokes problem (1) αu − ν∆u + ∇p = f in Ω, (2) div u = 0 in Ω, (3) u = 0 on ∂Ω, T T where u = (u1, . . . , ud) is the velocity, p the pressure, f = (f1, . . . , fd) the prescribed body force, ν the kinematic viscosity and α a positive constant which plays the role of the inverse of time-step size when equations (1)-(3) are derived from the time discretization of time-dependent Stokes or Navier-Stokes equations. 1 d 2 The variational form of (1)-(3) reads: find a pair (u, p) ∈ H0 (Ω) × L0(Ω) such that 1 d 2 (4) a(u, v) − b(v, p) + b(u, q) = (f, v), ∀(v, q) ∈ H0 (Ω) × L0(Ω), where Z 2 2 L0(Ω) = {q ∈ L (Ω) : qdx = 0}, Ω 856 Y. SHANG

(·, ·) is the standard inner-product of L2(Ω)l (l = 1, 2, 3), the bilinear terms a(·, ·) and b(·, ·) are defined as 1 d 2 a(u, v) = (αu, v)+ν(∇u, ∇v), b(v, q) = (divv, q), ∀u, v ∈ H0 (Ω) , q ∈ L0(Ω). 2.2. Mixed finite element spaces Assume T h(Ω) = {K} to be a shape-regular triangulation (see, e.g., [9, 12]) of Ω into triangles or quadrilaterals (if d = 2), or tetrahedrons or hexahedrons 1 d 2 (if d = 3) with mesh size 0 < h < 1, and Xh(Ω) ⊂ H0 (Ω) ,Mh(Ω) ⊂ L0(Ω) two finite element subspaces satisfying the following approximation properties: for each (u, p) ∈ Hk+1(Ω)d × Hk(Ω)(k ≥ 1), there exists an approximation (πhu, ρhp) ∈ Xh(Ω) × Mh(Ω) such that s s (5) ku − πhuk1,Ω ≤ ch kuk1+s,Ω, kp − ρhpk0,Ω ≤ ch kpks,Ω, 0 ≤ s ≤ k, together with the so-called inf-sup inequality (LBB condition): there exists a constant β > 0 such that (div v, q) (6) βkqk0,Ω ≤ sup , ∀q ∈ Mh(Ω). v∈Xh(Ω),v6=0 k∇vk0,Ω Under the above assumptions, the mixed finite element approximation of problem (4) reads: find a pair (uh, ph) ∈ Xh(Ω) × Mh(Ω) such that

(7) a(uh, v) − b(v, ph) + b(uh, q) = (f, v), ∀(v, q) ∈ Xh(Ω) × Mh(Ω). k−1 d 1 k+1 d It is well known that given f ∈ H (Ω) , if (u, p) ∈ (H0 (Ω) ∩ H (Ω)) × 2 k (L0(Ω) ∩ H (Ω)), then the finite element solution (uh, ph) of problem (7) has the following error estimate (cf. [12, 30]): (8) s s ku−uhk1,Ω + kp−phk0,Ω ≤ ch (kuks+1,Ω +kpks,Ω) ≤ ch kfks−1,Ω, 1 ≤ s ≤ k.

3. Parallel finite element algorithm 3.1. Fully overlapping domain decomposition technique

Let us first divide Ω into a number of disjoint subdomains D1,...,DJ , and then enlarge each Dj to obtain Ωj such that Dj ⊂⊂ Ωj ⊂ Ω (here Dj ⊂⊂ Ωj ⊂ Ω means that dist(∂Dj\∂Ωj, ∂Ωj\∂Ω) > 0). Each processor is assigned one subdomain and independently generates a global mesh that is fine (with size h) around its own subdomain and coarse (with size H > h) far away H,h from the subdomain. We denote by Tj (Ω) these global meshes associated H,h 0 with subdomains Ωj (j = 1, 2,...,J). These meshes Tj (Ω) s are different triangulations of Ω. They can be independently obtained by locally refining a H coarse grid T (Ω) in subdomains Ωj (j = 1, 2,...,J), then using some adaptive processes to ensure compatibility and shape-regularity (a mesh or triangulation is said to be compatible if the intersection of two elements is either empty, a common vertex, a common side or a common face; see, e.g., [9, 12]). These A PARALLEL ALGORITHM FOR STOKES PROBLEM 857

H,h 0 Tj (Ω) s are a fully overlapping domain decomposition of Ω; see Figure 1 for a case of J = 2.

Figure 1. A fully overlapping domain decomposition with two subdomains.

3.2. Parallel finite element algorithm

j 1 d j 2 Let XH,h(Ω) ⊂ H0 (Ω) ,MH,h(Ω) ⊂ L0(Ω) be the corresponding finite H,h element spaces associated with the meshes Tj (Ω) (j = 1, 2,...,J). Our parallel algorithm reads: Algorithm 1. Parallel finite element algorithm. H,h H,h j j 1. Find (uj , pj ) ∈ XH,h(Ω) × MH,h(Ω) (j = 1, 2,...,J) in parallel: H,h H,h H,h j j (9) a(uj , v)−b(v, pj )+b(uj , q)=(f, v), ∀(v, q) ∈ XH,h(Ω) × MH,h(Ω). h h H,h H,h 2. Set (u , p ) = (uj , pj ) in Dj (j = 1, 2,...,J). From the above algorithm, we can see that all subproblems in the algo- rithm are global problems defined in the entire domain with the majority of the degrees of freedom associated with the particular subdomain that they are responsible for, and hence can be solved using an existing sequential solver with a small investment in recoding: given an existing or black-box sequential solver, our algorithm only requires the application of the solver on the global meshes H,h Tj (Ω) (j = 1, 2,...,J). Moreover, there is no communication between pro- cessors during the solution process, and hence good parallel performance can be obtained. This is a very attractive property of our algorithm because, for a parallel algorithm, in addition to the efficient parallelization, the communi- cation cost between processors and the simplicity of its implementation must be taken into account since the merit of efficient parallelization may be can- celed out by large communication cost and extreme human effort required for writing, debugging and modifying a code based on a complex algorithm. Remark 1. Unlike those parallel algorithms where domain decomposition meth- ods are used to devise iterative methods for solving a given discretization 858 Y. SHANG scheme, Algorithm 1 is not iterative where the fully overlapping domain de- composition technique is used to design a discretization scheme. Remark 2. In Algorithm 1, each processor (or subproblem) computes a local finite element solution in its own (disjoint) subdomain using a global multiscale mesh that is locally refined around its own (enlarged) subdomain (see Figure 1 for example). Usually, the coarse grid regions have an effect on the quality of the local finite element solution in the interested subdomain. This is why we extend each disjoint subdomain with a certain distance to construct an enlarged subdomain, and use a global mesh that is fine around the enlarged subdomain to compute a local finite element solution in the disjoint subdomain, so as to alleviate the effect of coarse grid regions. Moreover, theory shows that by taking a suitable ratio of coarse mesh size H to fine mesh size h, we can obtain an asymptotically optimal error locally (see, e.g., [13, 35, 40]). 3.3. Error estimate Define piecewise norms 1/2  J  X k|u − uhk| = ku − uhk2 , 1,Ω  1,Dj  j=1 (10) 1/2  J  X k|p − phk| = kp − phk2 . 0,Ω  0,Dj  j=1 In [14], by using a two-grid method and local refinement technique, He et al. proposed a similar algorithm for the Stokes problem. Under some assumptions on the mesh and the mixed finite element spaces (namely, approximation, in- verse estimate, superapproximation and stability), they derived the following error estimate for their algorithm: h h s s+1 (11) k|u − u k|1,Ω + k|p − p k|0,Ω ≤ c(h +H )(kuks+1 +kpks), 1 ≤ s ≤ k, where h and H are the sizes of fine and coarse grids, respectively. Under similar assumptions, following the framework of [14] and making some simple modifications, we can also derive the following error estimate for our parallel algorithm: h h s s+1 (12) k|∇(u − u )k|0,Ω + k|p − p k|0,Ω ≤ c(h + H )kfks−1,Ω, 1 ≤ s ≤ k, where c = c(ν, α, Ω) and 1/2  J  X k|∇(u − uh)k| = k∇(u − uh)k2 . 0,Ω  0,Dj  j=1 Here we skip the proof details of (12); the interesting readers are referred to [14] for a reference. Instead, we put our emphasis on the numerical verification of the proposed algorithm. A PARALLEL ALGORITHM FOR STOKES PROBLEM 859

Equation (12) shows that if the ratio of coarse mesh size H to fine mesh size h is suitably chosen, Algorithm 1 can yield the same order of convergence rate as the standard Galerkin finite element method and may obtain an asymptotically optimal error for the approximate solution.

4. Numerical results In this section, we shall carry out some numerical tests to demonstrate the ef- fectiveness of our parallel algorithm. The numerical experiments are performed on a Dawning parallel cluster composed of 16 nodes, each with 2.0 GHz CPU, 2G×8 DRAM and connected together by 20Gbps InfiniBand. The Message Passing Interface (MPI) is used for data exchanges between processors. 4.1. Analytical solution

In this example, the solution domain Ω = [0, 1] × [0, 1] ⊂ R2. We set f and the boundary conditions such that the exact solution of the generalized Stokes equations is given by 2 2 u1 = 10x (x − 1) y(y − 1)(2y − 1), 2 2 (13) u2 = −10y (y − 1) x(x − 1)(2x − 1), p = 10(2x − 1)(2y − 1). The unstructured meshes consist of triangular elements; see Figure 1. The MINI mixed finite elements are used. Firstly, in order to test the asymptotic errors provided by our algorithm, we divide Ω = [0, 1] × [0, 1] into four disjoint subdomains

D1 = (0, 1/2) × (0, 1/2),D2 = (0, 1/2) × (1/2, 1),

D3 = (1/2, 1) × (0, 1/2),D4 = (1/2, 1) × (1/2, 1), and then extend each Dj (j = 1, 2, 3, 4) with an extra layer of size H to con- struct Ωj (j = 1, 2, 3, 4); see Figure 2.

D2 D4

D1 D3

Figure 2. A decomposition of Ω into subdomains.

−2 We compute the finite element solutions on meshes with√ size h = n (n = 1 6, 8, 10, 12, 14, 16) and corresponding H satisfying H = 4 h, where α = 1000 860 Y. SHANG

Table 1. Errors of the approximate solutions: α = 1000, ν = 0.01, MINI elements.

h h k|∇(u−u )k|0,Ω k|p−p k|0,Ω Method h H NbT NbV CPU(s) u 1 rate p 2 rate k∇uk0,Ω kpk0,Ω H L Algorithm 1 1/36 1/24 2078 1094 2.15 0.177252 7.36155e-04 1/64 1/32 5426 2795 4.5 0.0933046 2.02289e-04 1.1153 2.24509 1/100 1/40 11434 5829 12.72 0.0502795 8.32615e-05 1.38537 1.98911 1/144 1/48 21776 11035 42.52 0.0279435 4.02857e-05 1.61093 1.99096 1/196 1/56 36798 18584 135.93 0.0175232 2.39248e-05 1.51364 1.69016 1/256 1/64 60108 30282 337.36 0.0115908 1.51299e-05 1.54765 1.71588 Standard FE 1/36 - 3112 1629 3.15 0.169594 5.96814e-04 1/64 - 9674 4966 13.73 0.0920593 1.78959e-04 1.06189 2.09336 1/100 - 24058 12230 99.04 0.045929 6.58539e-05 1.55805 2.24008 1/144 - 48962 24770 435.47 0.0272096 3.36659e-05 1.43572 1.84003 1/196 - 90474 45630 1416.02 0.0173638 1.88225e-05 1.45696 1.88592 1/256 - 154628 77827 3948 0.0111088 1.0479e-05 1.67244 2.19306 and ν = 0.01. The corresponding linear algebraic system is solved by LU factorization. The numerical results are listed in Table 1 (top), where the CPU time is the wall clock time of the parallel program which includes the mesh generation time, the time spent on solving the problem and computing the error H,h time. In Table 1, NbT is the minimum number of elements in Tj (Ω) (j = 1, 2,...,J) and NbV the corresponding number of vertices. Let (u, p) be the exact solution of the generalized Stokes equations and (uh, ph) be the approximate solution computed by our parallel finite element algorithm. According to the mixed finite element spaces we chosen and the relationship between the mesh sizes H and h, i.e., H = O(h1/2), by (12), we have h h (14) k|∇(u − u )k|0,Ω + k|p − p k|0,Ω ≈ ch. The results shown in Table 1 support the above estimate. Actually, our nu- merical test shows a better order of convergence rate than theory predicted. To compare our parallel algorithm with the standard finite element method (i.e., the corresponding serial version of our parallel algorithm on the whole domain), in Table 1 (bottom), we also listed the errors of the standard finite element solutions computed on a quasi-uniform unstructured mesh with size h, in the same software and bardware environment. From Table 1, we can see that our algorithm yielded a convergence rate of the same order as the standard finite element method. The accuracy of the computed solutions by our parallel algorithm and the standard finite element method is very comparable. How- ever, our parallel algorithm saves a large amount of CPU time, especially, when the scale of computation is big. It saves 31.7%, 67.2%, 87.1%, 90.2%, 90.4% and 91.5% CPU time when h = 1/36, 1/64, 1/100, 1/144, 1/196 and 1/256, respectively. When h = 1/100, 1/144, 1/196 and 1/256, the computational A PARALLEL ALGORITHM FOR STOKES PROBLEM 861

Table 2. Errors of the approximate solutions: α = 1000, h = 1/100, H = 1/40.

h h k|∇(u−u )k|0,Ω k|p−p k|0,Ω Method ν CPU(s) Kdiv k∇uk0,Ω kpk0,Ω Algorithm 1 0.1 12.65 0.0226763 1.20165e-04 4.57258e-07 0.01 12.72 0.0502795 8.32615e-05 9.7653e-07 0.001 12.72 0.183681 1.05554e-04 3.58243e-06 0.0001 11.92 0.275122 1.21988e-04 5.54328e-06 0.00001 11.43 0.290692 1.24595e-04 5.96312e-06 Standard FE 0.1 99.36 0.0212084 8.00298e-05 6.14478e-07 0.01 99.04 0.045929 6.58539e-05 7.77365e-07 0.001 99.19 0.172309 7.49327e-05 1.47614e-06 0.0001 100.79 0.261903 8.6294e-05 1.80486e-06 0.00001 101.4 0.277283 8.84194e-05 1.87375e-06 time is much less than the expected ideal time that 75% could be saved since four subdomains (or processes) are used by our parallel algorithm. Secondly, to check the influence of the parameter ν on the accuracy of the solution, we set α = 1000, h = 1/100,H = 1/40, and then compute the solution by our parallel algorithm and the standard finite element method with ν = 10−n (n = 1, 2, 3, 4, 5), respectively. Table 2 reports the numerical results, in which Z h Kdiv = max | div u dx| H,h K∈T (D ) j j K j=1,2,...,J which measures the approximation of the incompressibility of the fluid. The results show that with different ν, there is no obvious difference in error of the solutions and approximation of the incompressibility of the fluid between our parallel algorithm and the standard finite element method. An interesting finding is that as the value of the parameter ν decreases, both the accuracy of the velocity and the approximation of incompressibility of the fluid slowly decrease, while that of the pressure has no obvious change. As mentioned in Section 2, the parameter α in equation (1) plays a role of the inverse of time-step size when the generalized Stokes problem results from the time discretization of non-stationary Stokes equations or Navier-Stokes equations. In our numerical tests, we also compute the finite element solutions with different values of the parameter α to test the robustness of our parallel algorithm. Table 3 reports the numerical results which shows that there is no obvious difference among small and moderate values of α both for our parallel algorithm and the standard finite element method. However, as α varies from 1000 to 10000, the accuracy of the approximate velocity was greatly improved. To further investigate this phenomenon and relationship between α and h, we compute the finite element solutions with α = 0.1 × 1/h, 1/h, 10 × 1/h, 1/h2, 862 Y. SHANG

Table 3. Errors of the approximate solutions: ν = 0.01, h = 1/100, H = 1/40.

h h k|∇(u−u )k|0,Ω k|p−p k|0,Ω Method α CPU(s) Kdiv k∇uk0,Ω kpk0,Ω Algorithm 1 0 12.65 0.0601913 7.47231e-05 1.3809e-06 1 12.6 0.0601712 7.47272e-05 1.38029e-06 10 13.15 0.0600445 7.47462e-05 1.37497e-06 100 13.14 0.0588533 7.49844e-05 1.32406e-06 1000 12.72 0.0502795 8.32615e-05 9.7653e-07 10000 12.64 0.029718 1.93024e-04 7.69132e-07 Standard FE 0 99.15 0.0540983 6.57799e-05 1.08859e-06 1 100.69 0.0540878 6.57792e-05 1.08806e-06 10 100.66 0.0539892 6.57696e-05 1.08295e-06 100 100.28 0.0530346 6.56834e-05 1.03427e-06 1000 99.04 0.045929 6.58539e-05 7.77365e-07 10000 99.2 0.0273947 7.82808e-05 7.20375e-07

Table 4. Errors of the computed velocity with ν = 0.01 and different values of α and h.

h H α = 0.1 × 1/h α = 1/h α = 10 × 1/h α = 1/h2 α = 10 × 1/h2 1/36 1/24 0.440764 0.414238 0.274619 0.154732 0.0761833 1/64 1/32 0.138473 0.1340757 0.105351 0.0578962 0.0408333 1/100 1/40 0.0600445 0.0588533 0.0502795 0.029718 0.0260313 1/144 1/48 0.0305499 0.0301885 0.0273758 0.0183488 0.0175911 1/196 1/56 0.0181929 0.0180572 0.0169662 0.0126677 0.0128094 1/256 1/64 0.011724 0.0116734 0.0112453 0.0092267 0.00956686 and 10×1/h2, respectively, with different values of h. The numerical results are shown in Table 4 which shows that the bigger the α, the better the computed velocity. Noting that α plays the role of inverse of time-step size, this can be explained well by the theory of discretization methods for time-dependent Stokes or Navier-Stokes equations. Finally, to investigate the parallel performance of our parallel algorithm, we decompose the solution domain into 1 × 2, 2 × 2, 2 × 3, 3 × 4 and 4 × 4 sub- domains, respectively, then assign each processor one subdomain to compute the corresponding solution by Algorithm 1. Table 5 reports the wall clock time of the parallel program and the corresponding speedup and parallel efficiency computed by T (2) 2 × T (2) (15) S = ,E = , p T (J) p J × T (J) where J ≥ 2 is the number of processors, T (2) and T (J) are the wall time of the parallel program with 2 and J processors, respectively. From Table 5, we A PARALLEL ALGORITHM FOR STOKES PROBLEM 863

T (2) Table 5. Wall time T (J) in seconds, speedup Sp = T (J) and 2×T (2) parallel efficiency Ep = J×T (J) : α = 1000, ν = 0.01.

J h = 1/196,H = 1/56 h = 1/256,H = 1/64 NbT NbV T(J) Sp Ep NbT NbT T(J) Sp Ep 2 53864 27187 260.87 1.0 1.0 92542 46595 699.49 1.0 1.0 4 36798 18584 135.93 1.92 0.96 60108 30282 337.36 2.07 1.04 6 29115 14730 75.95 3.43 1.14 46419 23421 203.47 3.44 1.15 8 25799 13067 64.96 4.02 1.00 40706 20557 169.89 4.19 1.05 12 21178 10744 36.75 7.10 1.18 32793 16584 90.19 7.76 1.29 16 19220 9760 25.71 10.15 1.27 28986 14673 70.23 9.96 1.24 can see that our parallel algorithm has good parallel performance. Superlinear speedups and parallel efficiencies bigger than one have been observed. 4.2. Backward-facing step problem The second example we consider here is the backward-facing step problem defined on [−4, 16] × [−1, 2]. The flow region and boundary conditions are shown in Figure 3(a). We decompose the domain into five disjoint subdomains Dj (j = 1,..., 5) (see Figure 3(b)), then extend each Dj (j = 1,..., 5) outside with an extra layer of size H to get Ωj (j = 1,..., 5). We decompose the domain into five subdomains just because that we want the backward-facing step corner of the channel to lie on the interface of two subdomains and let the subdomains be in equal size with respect to x-coordinate. The meshes consist of triangular elements and the mesh sizes are set as h = 1/24,H = 1/12. The body force f = 0 and the kinematic viscosity ν = 0.001. The Taylor-Hood mixed finite elements are used.

(a) (b)

Figure 3. (a) The backward-facing step flow problem and (b) nonoverlapping domain decomposition.

To compare methods, we compute a reference solution by the standard Galerkin finite element method on a mesh with size h = 1/24. Like [11], we compare the computed velocities and pressures at locations of 1/2 and 1/4 of the channel length (here x = 1 and x = 6). Figures 4-6 plot the computed results with α = 100, 1000 and 10000, respectively, from which we can see that there are only very small differences between the computed velocities by our algorithm and the standard finite element method around y = 0.1 and near 864 Y. SHANG

2 2 2

Standard FE (x=1) 1.5 1.5 Present (x=1) 1.5 Standard FE (x=6) Standard FE (x=1) Present (x=6) Present (x=1) Standard FE (x=1) 1 1 1 Standard FE (x=6) Present (x=1) Present (x=6) Standard FE (x=6) Present (x=6) y 0.5 y 0.5 y 0.5

0 0 0

−0.5 −0.5 −0.5

−1 −1 −1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 −0.14 −0.12 −0.1 −0.08 −0.06 −0.04 −0.02 0 0.02 400 450 500 550 600 650 700 u velocity u velocity pressure 1 2

(a) u1-velocity (b) u2-velocity (c) pressure

Figure 4. Comparison of computed results at x = 1 and x = 6 for backward-facing step flow with α = 100.

2 2 2

1.5 1.5 1.5

1 Standard FE (x=1) 1 1 Standard FE (x=1) Present (x=1) Present (x=1) Standard FE (x=6) Standard FE (x=1) Standard FE (x=6) y 0.5 Present (x=6) y 0.5 Present (x=1) y 0.5 Present (x=6) Standard FE (x=6) Present (x=6) 0 0 0

−0.5 −0.5 −0.5

−1 −1 −1 0 0.1 0.2 0.3 0.4 0.5 −0.12 −0.1 −0.08 −0.06 −0.04 −0.02 0 0.02 3500 4000 4500 5000 5500 6000 u −velocity u −velocity pressure 1 2

(a) u1-velocity (b) u2-velocity (c) pressure

Figure 5. Comparison of computed results at x = 1 and x = 6 for backward-facing step flow with α = 1000.

2 2 2

1.5 1.5 1.5

Standard FE (x=1) Present (x=1) 1 1 1 Standard FE (x=6) Standard FE (x=1) Present (x=6) Present (x=1) y 0.5 Standard FE (x=6) y 0.5 Standard FE (x=1) y 0.5 Present (x=6) Present (x=1) Standard FE (x=6) 0 0 Present (x=6) 0

−0.5 −0.5 −0.5

−1 −1 −1 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 −0.08 −0.06 −0.04 −0.02 0 1.5 2 2.5 3 u −velocity u −velocity pressure 4 1 2 x 10

(a) u1-velocity (b) u2-velocity (c) pressure

Figure 6. Comparison of computed results at x = 1 and x = 6 for backward-facing step flow with α = 10000. y = 2 when x = 1. The agreement indicates the effectiveness of our parallel n 2 algorithm. Figure 7 describes the evolution of kinetic energy kuhk0/2 with α, while Figures 8-10 depict the contours of computed velocity and pressure, com- pared with those computed the standard finite element method, with different A PARALLEL ALGORITHM FOR STOKES PROBLEM 865

8

Standard FE 7 Present

6

5 Kinetic energy 4

3

2 0 2000 4000 6000 8000 10000 α

Figure 7. Kinetic energy of backward-facing step flow with different values of α. values of α. It is noted that the vertical lines at x = 0, 4, 8, 12 in the left sub- figures of Figures 8-10 are the artificial boundaries of the disjoint subdomains Dj (j = 1,..., 5). This test case illustrated the effectiveness of the proposed algorithm. It is shown that the parallel algorithm proposed in this paper has no problem in handling such backward-facing step problems.

4.3. Flow around a cylinder In this test case, we consider the benchmark problem of 2D channel flow around a cylinder. The problem is defined on Ω = [0, 6] × [0, 1], where a circle of radius 0.15 centers at (x, y) = (1, 0.5). The inflow velocity profile are given by

u1(x, y) = 6y(1 − y), u2(x, y) = 0, and the boundary condition of the outlet is set as ∂u −pI + = 0, ∂n while no-slip boundary conditions are imposed on the top and bottom of the channel as well as the surface of the cylinder. The Taylor-Hood elements are used for finite element discretization. The viscosity is set as ν = 0.001 and the mesh size is h = 1/32,H = 1/16, where the solution domain is decomposed into two subdomains. Table 6 lists the com- puted kinetic energies with different values of α, compared with those computed by the standard finite element method on a mesh with size h = 1/32, while Figure 11-Figure 13 describe the contours of velocity and pressure computed by our parallel algorithm and the standard finite element method, which show that the proposed algorithm in this paper predicts the flow structures well and demonstrate the effectiveness of the algorithm. 866 Y. SHANG

(a) α = 0

(b) α = 1

(c) α = 10

(d) α = 102

(e) α = 103

(f) α = 104

Figure 8. Contours of the u1 component of velocity for backward-facing step flow: Algorithm 1(left) and standard fi- nite element method (right). A PARALLEL ALGORITHM FOR STOKES PROBLEM 867

(a) α = 0

(b) α = 1

(c) α = 10

(d) α = 102

(e) α = 103

(f) α = 104

Figure 9. Contours of the u2 component of velocity for backward-facing step flow: Algorithm 1 (left) and standard finite element method (right). 868 Y. SHANG

(a) α = 0

(b) α = 1

(c) α = 10

(d) α = 102

(e) α = 103

(f) α = 104

Figure 10. Contours of the computed pressure for backward- facing step flow: Algorithm 1 (left) and standard finite element method (right). A PARALLEL ALGORITHM FOR STOKES PROBLEM 869

(a) α = 0

(b) α = 1

(c) α = 10

(d) α = 102

(e) α = 103

(f) α = 104

Figure 11. Contours of the u1 component of velocity for flow around a cylinder: Algorithm 1 (left) and standard finite ele- ment method (right). 870 Y. SHANG

(a) α = 0

(b) α = 1

(c) α = 10

(d) α = 102

(e) α = 103

(f) α = 104

Figure 12. Contours of the u2 component of velocity for flow around a cylinder: (a) Algorithm 1; and (b) standard finite element method. A PARALLEL ALGORITHM FOR STOKES PROBLEM 871

(a) α = 0

(b) α = 1

(c) α = 10

(d) α = 102

(e) α = 103

(f) α = 104

Figure 13. Contours of the computed pressure for flow around a cylinder: (a) Algorithm 1; and (b) standard finite element method. 872 Y. SHANG

Table 6. Computed kinetic energies for flow around a cylinder.

Method α = 0 α = 1 α = 10 α = 100 α = 1000 α = 10000 Algorithm 1 3.68425 3.20653 3.13231 3.1178 3.11698 3.11064 Standard FE 3.68425 3.2065 3.13187 3.11553 3.11324 3.10661

5. Conclusions Using a particular overlapping domain decomposition technique, we pro- posed a parallel finite element algorithm for the generalized Stokes problem in this work. This algorithm is easy to implement on top of existing sequential software and exhibits good parallel performance. It can yield an approximate solution with an accuracy comparable to that of the standard Galerkin finite el- ement solution. Numerical tests demonstrated the effectiveness of the proposed algorithm.

References [1] R. Adams, Sobolev Spaces, Academic Press Inc, New York, 1975. [2] M. Ainsworth and S. Sherwin, Domain decomposition preconditioners for p and hp finite element approximations of Stokes equations, Comput. Methods Appl. Mech. Engrg. 175 (1999), no. 3-4, 243–266. [3] R. E. Bank, Some variants of the Bank-Holst parallel adaptive meshing paradigm, Com- put. Vis. Sci. 9 (2006), no. 3, 133–144. [4] R. E. Bank and M. Holst, A new paradigm for parallel adaptive meshing algorithms, SIAM J. Sci. Comput. 22 (2000), no. 4, 1411–1443. [5] R. E. Bank and P. K. Jimack, A new parallel domain decomposition method for the adaptive finite element solution of elliptic partial differential equations, Concurrency Computat.: Pract. Exper. 13 (2001), 327–350. [6] J. H. Bramble and J. E. Pasciak, A domain decomposition technique for Stokes problems, Appl. Numer. Math. 6 (1990), no. 4, 251–261. [7] M. A. Casarin, Schwarz preconditioners for the spectral element discretization of the steady Stokes and Navier-Stokes equations, Numer. Math. 89 (2001), no. 2, 307–339. [8] V. Dolean, F. Nataf, and G. Rapin, Deriving a new domain decomposition method for the Stokes equations using the Smith factorization, Math. Comp. 78 (2009), no. 266, 789–814. [9] H. C. Elman, D. J. Silvester, and A. J. Wathen, Finite Elements and Fast Iterative Solvers with Applications in Incompressible Fluid Dynamics, Oxford University Press, Oxford, 2005. [10] P. F. Fischer, N. I. Miller, and H. M. Tufo, An overlapping Schwarz method for spectral element simulation of three-dimensional incompressible flows, Parallel solution of partial differential equations (Minneapolis, MN, 1997), 159–180, IMA Vol. Math. Appl., 120, Springer, New York, 2000. [11] D. K. Gartling, A test problem for outflow boundary conditions-flow over a backward- facing step, Internat. J. Numer. Methods Fluids 11 (1990), 953–967. [12] V. Girault and P. A. Raviart, Finite Element Methods for Navier-Stokes Equations: Theory and Algorithms, Springer-Verlag, Berlin, Heidelberg, 1986. [13] Y. N. He, J. C. Xu, and A. H. Zhou, Local and parallel finite element algorithms for the Navier-Stokes problem, J. Comput. Math. 24 (2006), no. 3, 227–238. A PARALLEL ALGORITHM FOR STOKES PROBLEM 873

[14] Y. N. He, J. C. Xu, A. H. Zhou, and J. Li, Local and parallel finite element algorithms for the Stokes problem, Numer. Math. 109 (2008), no. 3, 415–434. [15] M. Q. Jiang and P. L. Dai, A parallel nonoverlapping domain decomposition method for Stokes problems, J. Comput. Math. 24 (2006), no. 2, 209–224. [16] H. H. Kim, E. T. Chung, and C. S. Lee, FETI-DP preconditioners for a staggered discontinuous Galerkin formulation of the two-dimensional Stokes problem, Comput. Math. Appl. 68 (2014), no. 12, 2233–2250. [17] H. H. Kim and C. O. Lee, A two-level nonoverlapping Schwarz algorithm for the Stokes problem without primal pressure unknowns, Internat. J. Numer. Methods Eng. 88 (2011), no. 13, 1390–1410. [18] H. H. Kim, C. O. Lee, and E. H. Park, A FETI-DP formulation for the Stokes problem without primal pressure components, SIAM J. Numer. Anal. 47 (2010), no. 6, 4142–4162. [19] A. Klawonn, An optimal preconditioner for a class of saddle point problems with a penalty term, SIAM J. Sci. Comput. 19 (1998), no. 2, 540–552. [20] A. Klawonn and L. F. Pavarino, Overlapping Schwarz methods for mixed linear elasticity and Stokes problems, Comput. Methods Appl. Mech. Engrg. 165 (1998), no. 1-4, 233– 245. [21] , A comparison of overlapping Schwarz methods and block preconditioners for saddle point problems, Numer. Linear Algebra Appl. 7 (2000), no. 1, 1–25. [22] J. Li, A dual-primal FETI method for incompressible Stokes equations, Numer. Math. 102 (2005), no. 2, 257–275. [23] J. Li and X. Tu, A nonoverlapping domain decomposition method for incompressible Stokes equations with continuous pressures, SIAM J. Numer. Anal. 51 (2013), no. 2, 1235–1253. [24] J. Li and O. Widlund, BDDC algorithms for incompressible Stokes equations, SIAM J. Numer. Anal. 44 (2006), no. 6, 2432–2455. [25] W. F. Mitchell, A parallel using the full domain partition, Electron. Trans. Numer. Anal. 6 (1997), 224–233. [26] , The full domain partition approach to distributing adaptive grids, Appl. Numer. Math. 26 (1998), no. 1-2, 265–275. [27] J. E. Pasciak, Two domain decomposition techniques for Stokes problems, Domain de- composition methods (Los Angeles, CA, 1988), 419–430. SIAM, Philadelphia, 1989. [28] L. F. Pavarino and O. B. Widlund, Iterative substructuring methods for spectral element discretizations of elliptic systems II: Mixed methods for linear elasticity and Stokes flow, SIAM J. Numer. Anal. 37 (2000), no. 2, 375–402. [29] , Balancing Neumann-Neumann methods for incompressible Stokes equations, Comm. Pure Appl. Math. 55 (2002), no. 3, 302–335. [30] A. Quarteroni and A. Valli, Numerical Approximation of Partial Differential Equations, Spring-Verlag, Berlin, 1994. [31] E. M. Rønquist, Domain decomposition methods for the steady Stokes equations, Pro- ceedings of the 11th International Conference on Domain Decomposition Methods in Greenwich, England. DDM.org, 1999. Available at www.ddm.org/DD11/index.html. [32] Y. Q. Shang, A parallel subgrid stabilized finite element method based on fully overlap- ping domain decomposition for the Navier-Stokes equations, J. Math. Anal. Appl. 403 (2013), no. 2, 667–679. [33] , Parallel defect-correction algorithms based on finite element discretization for the Navier-Stokes equations, Comput. & Fluids 79 (2013), 200–212. [34] Y. Q. Shang and Y. N. He, Parallel finite element algorithm based on full domain partition for stationary Stokes equations, Appl. Math. Mech. (English Ed.) 31 (2010), no. 5, 643–650. [35] , Parallel iterative finite element algorithms based on full domain partition for the stationary Navier-Stokes equations, Appl. Numer. Math. 60 (2010), no. 7, 719–737. 874 Y. SHANG

[36] P. Le Tallec and A. Patra, Non-overlapping domain decomposition methods for adaptive hp approximations of the Stokes problem with discontinuous pressure fields, Comput. Methods Appl. Mech. Engrg. 145 (1997), no. 3, 361–379. [37] X. Tu, A three-level BDDC algorithm for a saddle point problem, Numer. Math. 119 (2011), no. 1, 189–217. [38] X. Tu and J. Li, A unified dual-primal finite element tearing and interconnecting ap- proach for incompressible Stokes equations, Internat. J. Numer. Methods Eng. 94 (2013), no. 2, 128–149. [39] C. M. Wang, A preconditioner for FETI-DP method of Stokes problem with mortar-type discretization, Math. Prob. Eng. 2013 (2013), Art. ID 485628, 1–11. [40] J. C. Xu and A. H. Zhou, Local and parallel finite element algorithms based on two-grid discretizations, Math. Comp. 69 (2000), no. 231, 881–909. [41] , Local and parallel finite element algorithms based on two-grid discretizations for nonlinear problems, Adv. Comput. Math. 14 (2001), no. 4, 293–327.

Yueqiang Shang School of Mathematics and Statistics Southwest University Chongqing, 400715, P. R. China E-mail address: [email protected]