<<

¢¢¢¢¢¢¢¢¢¢¢ ¢¢¢¢¢¢¢¢¢¢¢ ¢¢¢¢¢¢¢¢¢¢¢ ¢¢¢¢¢¢¢¢¢¢¢ Eidgen¨ossische Ecole polytechnique f´ed´erale de Zurich ¢¢¢¢¢¢¢¢¢¢¢ ¢¢¢¢¢¢¢¢¢¢¢ ¢¢¢¢¢¢¢¢¢¢¢ ¢¢¢¢¢¢¢¢¢¢¢ Technische Hochschule Politecnico federale di Zurigo ¢¢¢¢¢¢¢¢¢¢¢ ¢¢¢¢¢¢¢¢¢¢¢ ¢¢¢¢¢¢¢¢¢¢¢ ¢¢¢¢¢¢¢¢¢¢¢ Z¨urich Swiss Federal Institute of Technology Zurich

Fully Discrete hp-Finite Elements: Fast Quadrature

J.M. Melenk, K. Gerdes∗ and C. Schwab

Research Report No. 99-15 September 1999

Seminar f¨urAngewandte Mathematik Eidgen¨ossische Technische Hochschule CH-8092 Z¨urich Switzerland

∗Department of Mathematics, Chalmers University, SE-412 96 G¨oteborg, Sweden Fully Discrete hp-Finite Elements: Fast Quadrature

J.M. Melenk, K. Gerdes∗ and C. Schwab

Seminar f¨urAngewandte Mathematik Eidgen¨ossische Technische Hochschule CH-8092 Z¨urich Switzerland

Research Report No. 99-15 September 1999

Abstract

A fully discrete hp finite element method is presented. It combines the features of the standard hp finite element method (conforming Galerkin Formulation, variable order quadrature schemes, geometric meshes, static condensation) and of the spectral element method (special shape functions and spectral quadrature techniques). The speed-up (relative to standard hp elements) is analyzed in detail both theoretically and computationally.

Keywords: hp finite element method, spectral element method, numerical integration

Subject Classification: Primary: 65N30, 65N35 Secondary: 65N50

∗Department of Mathematics, Chalmers University, SE-412 96 G¨oteborg, Sweden 1 Introduction

The (FEM) is today the most widely used discretization technique in solid and fluid mechanics. The classical approach still dominant in today’s industrial applications is to partition the domain into many small subdomains of diameter O(h) and to approximate the unknown solution by a piecewise polynomial of low order p (typically in applications, 1 ≤ p ≤ 3). Convergence is achieved through mesh refinement, i.e., by letting h → 0. In the early eighties, the so-called p-version FEM emerged where the polynomial degree p →∞on a fixed mesh in order to produce convergent sequences of FE-solutions (see, e.g., [2, 24] and the references there). Very closely related to the p-version FEM are spectral methods and Spectral Element Methods developed at the same time (cf. [10, 17, 5, 18, 13, 14, 3] together with the references therein), which may be viewed as weighted collocation methods. It was noted early on that some of the Spectral Methods could be interpreted as p-version FEMs with certain types of quadrature. Being collocation methods, Spectral methods are much faster than the standard p-version FEM as the generation of the stiffness is considerably cheaper. Additionally, various fast transform techniques have been devised to speed up the generation of the stiffness matrix. Many problems in fluid and solid mechanics have piecewise analytic solutions. For the approximation of such functions, the h-version, p-version FEM, and the Spectral methods can only lead to algebraic rates of convergence. In the context of the FEM, however, it is possible to combine the h and the p-version FEM into the hp-FEM where mesh refinement and an increase of the polynomial degree p can be done simultaneously. The proper combination of geometric mesh refinement towards the singularities and increase of the polynomial degree in regions where the solution is smooth (analytic) then leads to exponential rates of convergence (see [2, 19]). Commercial FEM codes with p and hp capabilities were developed since the mid 80ies. We mention here the pioneering code PROBE by B. Szab´oand his co-workers for heat transfer and elasticity problems, now superseded by STRESSCHECK, [21], the code Stripe, [20], and PHLEX, [22]. A general 2- D hp-FEM code for various problems from solid and fluid mechanics was developed in the late 80ies by [6, 7]. There, general meshes consisting of triangles and/or quadrilaterals and even certain types of irregular (“hanging”) nodes were admissible, which facilitate mesh refinement and allow the construction of geometrically refined meshes in a simple manner. More recently an improved implementation of this hp-FEM code has been presented in [8]. As mentioned, the key to the success of the hp-FEM is the ability to combine p refinement with mesh refinement. In contrast to this, the (original) is based on a fixed partitioning of the computational domain into (curvilinear) rectangles (hyper cubes in more than 2 dimensions) and then increasing the spectral order. Among the spectral element community, the observed need for mesh refinement has lead to various non-conforming approaches such as the use of “mortar elements” (see, e.g., [4] and the references therein), which also exhibit hp-like features. From an approximation theory point of view, the flexibility of the hp-FEM to combine h and p refinement makes it clearly superior to more traditional h-, p- version FEMs and Spectral methods. The popularity of h-FEM is due to simple stiffness matrix generation and available fast multilevel solvers (e.g., [11]), that of Spectral methods is due to the

1 availability of very efficient techniques for the stiffness matrix generation and precondi- tioners. In contrast, the perception of hp-FEMs is that they are computationally expensive in their stiffness matrix generation and that fast solution algorithms do not seem to exist, although domain decomposition methods [15], combined with parallelization, [16], can be very effective.

This paper is devoted to the development of hp-spectral Galerkin FEMs which unify and generalize “standard” hp-FEM technology and spectral methods. The most prominent features shared with hp-FEM are the decoupling of the quadrature rule from the ele- mental polynomial degrees and the ability to employ mesh refinement via hanging nodes. In the solution process, local static condensation and the element stiffness matrices are obtained completely in parallel. Geometry representation is uncoupled from the ansatz space, accomodating exact representations of the geometry via blending methods and the use of hp-isoparametric elements that approximate the geometry. Spectral techniques, are well-known to speed up the element stiffness matrix generation considerably. The main idea in spectral methods is combining Gauss-Lobatto quadrature with the use of shape functions that are Lagrange interpolation polynomials in the quadrature points. In spectral methods, the quadrature order is thus linked to the elemental polynomial degree and Gauss-Lobatto quadrature implies underintegration even for affine element maps in 2-D. Here, we generalize the spectral quadrature to polynomial degrees p and quadrature rules of order p + q, q ≥ 0 arbitrary. We show that the work per element is O(p4(1 + q)) and O(qp6 + p5) for problems in R2, R3 respectively (see Table 5 ahead for the details) 1. This is to be compared with O(p3d) for problems in Rd for the standard quadrature algorithm. We also show numerically that this speed-up is already visible for spectral orders p in the range of 5 ≤ p ≤ 10. We finally mention that similar algorithms have been proposed in, e.g., [12] for meshes based on triangles/tetrahedra; the hp convergence analysis of these algorithms will be the topic of a future paper. The outline of our presentation is as follows: In Section 2, the spectral Galerkin Algorithm (Algorithm 2.12) is presented. For clarity of exposition, the algorithm is presented for a 2D scalar model problem. The extension to higher dimensions is discussed in Section 3.1. Section 3.2 demonstrates for a 3D hexahedral element that Algorithm 2.12 does indeed realize significant computational savings already at moderate polynomial degrees p. Some concluding remarks finish the paper.

2 hp Spectral Quadrature Algorithm

2.1 Model problem

We will illustrate the hp-spectral Galerkin algorithm for the following scalar model prob- lem in two dimensions

−∇ · (A(x)∇u)=f(x) on Ω ⊂ R2,u= 0 on ∂Ω (2.1)

1here and in what follows the implied constant in the O(·) notation is independent of both p and q

2 where the matrix A is symmetric, positive definite, and piecewise smooth. The weak form 1 is: Find u ∈ H0 (Ω) such that

1 a(u, v)=l(v) ∀v ∈ H0 (Ω) where the bilinear form a and the right hand side l are given by

a(u, v)= ∇v · (A(x)∇u) dx dy, l(v)= fv dx dy. !Ω !Ω We restrict here our attention to the model problem (2.1) for simplicity of notation. The hp-spectral Galerkin algorithm adapts straight forwardly for equations involving lower order terms and vector valued problems.

2.2 hp-FEM

In the hp-FEM the domain Ω is partitioned into elements K; each element K is the image of the master element Kˆ = (0, 1)2 under the (bijective) element map

2 FK : Kˆ = (0, 1) → K. (2.2)

The elements K are collected into the triangulation T = {K}. The hp FE-space is then a space of continuous, mapped polynomials:

1 ˆ S = {u ∈ H0 (Ω) | u|K ◦ FK ∈ SpK (K)} (2.3) ˆ where the spaces SpK (K) are spaces of polynomials of degree pK on the master element Kˆ . The subscript K in pK indicates that we allow for variable polynomial degree (see ˆ Section 2.4 for a detailed description of the spaces SpK (K)). The hp-FEM then reads: find u ∈ S such that a(u, v)=l(v) for all v ∈ S. The bilinear form a and the right hand side functional l can be written as a sum of element integrals and the element maps FK provide a means to express all integrals over elements K as integrals of the master element Kˆ : find u ∈ S s.t.

aK(u, v)= lK (v) ∀v ∈ S K∈T K∈T " " 2 ∂uˆ ∂vˆ aK(u, v)= ∇uˆ · (Aˆ∇vˆ)dξ1dξ2 = Aˆij dξ1dξ2 ˆ ˆ ∂ξ ∂ξ K K i,j=1 i j ! ! " 2 ij =: aK (u, v) i,j=1 "

lK(u, v)= fˆvdˆ ξ1dξ2 ˆ !K ˆ " −T " −1 A =(FK ) (A ◦ FK )(FK ) JK (2.4)

fˆ = f ◦ FK JK (2.5)

uˆ = u ◦ FK, vˆ = v ◦ FK

3 " N ˆ with JK = detFK . If {ϕi}i=1 is a basis of SpK (K), then the

N N AK := (aK(ϕj, ϕi))i,j=1 ,LK := (lK (ϕi))i=1 are the element stiffness matrix AK and the element load vector LK . In the assembly process, these element stiffness matrices (and load vectors) are assembed into the global linear system which can then be solved. Especially in the context of iterative solvers, it may be advantageous to combine several elements into a mesh patch (also known as substructure or subdomain) and first subassemble the elements of the mesh patch before assembling the patches. The generic hp Galerkin FEM algorithm reads as follows:

Algorithm 2.1 (hp Galerkin FEM) loop over all mesh patches within each mesh patch loop over all elements generate element stiffness matrix AK (and element load vector LK) (optional) condense inner elemental dof assemble elemental dof into mesh patch stiffness matrix endloop over elements in mesh patch (optional) condense inner mesh patch dof assemble mesh patch dof into the global stiffness matrix endloop over all mesh patches solve global system backsolve for inner mesh patch and inner element dof (optional)

Several comments on Algorithm 2.1 are in order. The main amount of (alphanumerical) work arises in the generation of AK and LK , due to the numerical quadrature unless the FK are affine in which case the AK can be precomputed once and for all. In general, however, curved domains entail complicated element maps FK and numerical integration is required. Additionally, the evaluation of the element maps FK may be expensive and isoparametric or “quasi-regional” mappings, [1], are employed instead. For an error analysis of such approximate mappings we refer to [9]. The most elementary quadrature algorithm is

Algorithm 2.2 (standard Galerkin element quadrature algorithm) initialize AK =0 for all quadrature points

for all basis functions ϕ of SpK

for all basis functions ψ of SpK add contribution of aK(ϕ, ψ) at this quadrature point to AK end end end

It is easy to see that in two dimensions for a quadrature rule with O(p2) points and polynomials of degree p, Algorithm 2.2 requires O(p6) work. Fig. 1 shows the CPU time spent on various components of Algorithm 2.1 for the polynomial degree p ranging from

4 Element CPU time 1 10 standard total standard int. element mod. static cond. assemblage 0 10

!1 10 seconds

!2 10

!3 10

!4 10 0 1 10 10 approximation order

Figure 1: CPU time per quadrilateral element for standard element algorithms for scalar model problem.

p = 1 to p = 8 and the tensor product space Qp, the space of polynomials of degree p in each variable:

i j 2 Qp(Kˆ ) = span {x y | 0 ≤ i, j ≤ p}, dimQp(Kˆ ) = (p + 1) . (2.6)

We clearly see that indeed the majority of the work is spent on the numerical quadrature and not on other tasks such as condensation, enforcement of constraints, etc. Accordingly, the remainder of this paper is devoted to the development and analysis of various forms of accelerated quadrature routines. We propose to use Algorithm 2.12 below for the generation of AK. One of the advantages of this algorithm is that it only affects the so-called internal degrees of freedom; this implies that Algorithm 2.12 could easily be incorporated in existing hp-FEM codes.

Remark 2.3 We formulated Algorithm 2.1 in the spirit of classical hp-FEM based on direct solvers by using the notion of static condensation, mesh patches, etc. We point out, however, that Algorithm 2.1 can also be employed in an environment where the global solution process is done with iterative techniques. Even a “completely iterative” scheme where condensation is not performed at all, is possible. Akin to spectral and spectral element methods, Algorithm 2.12 can be reformulated to realize a matrix vector multiplication rather than to generate explicitly the (local) stiffness matrices.

2.3 Element Stiffness Matrix Generation

The goal of the present subsection is to derive Algorithm 2.12, the Spectral Galerkin Ele- ment Algorithm, unifying the standard Galerkin FEM and the Spectral Element Method. We will only describe the construction of the element stiffness matrix AK. The element

5 load vector LK can be constructed using the same ideas but will have to account for possibly non-smooth data f(x). In order to derive Algorithm 2.12, we start by recalling Gauss-Lobatto quadrature in Sub- section 2.3.1. Next, we describe a modified sum factorization technique (Algorithm 2.5). Algorithm 2.10, which is at the heart of Algorithm 2.12, can then be understood as a degenerate form of Algorithm 2.5. As mentioned above, we permit elements with gen- eral polynomial degree. However, in order to illustrate the complexity of the algorithms that we will describe below, we will always give an operation count for the case that ˆ ˆ SpK (K)=Qp(K).

2.3.1 Gauss-Lobatto Quadrature

To evaluate the entries of AK , one generally has to use quadrature rules. We will use tensor product Gauss-Lobatto quadrature rules. We recall that in one dimension, the q+1 Gauss- q " Lobatto points GL = {x0, . . . , xq} are the zeros of the polynomial x +→ x(1 − x)Lq(x) where Lq stands for q-th Legendre polynomial associated with the unit interval (0, 1). It is well-known [3], that these zeros are distinct and lie in the interval [0, 1]. Furthermore, for each q one can find positive weights wi, i =0, . . . , q such that the quadrature rule

q 1 GLq(u) := wku(xk) ≈ u(x) dx 0 "k=0 ! is exact for polynomials u of degree 2q − 1, [3]. Gauss-Lobatto quadrature rules in two dimensions are then obtained by tensor product constructions. For example, quadrature q1 q2 rules GLq1,q2 (u) based on the points GL × GL are given by

q1 q2

GLq1,q2 (u) := w1,kw2,lu(x1,k,x2,l) (2.7) "k=0 "l=0 For simplicity of notation in some of the ensuing algorithms, we assume that x1,0 = 0, x1,q1 = 1, x2,0 = 0, x1,q2 = 1.

Once a quadrature rule is chosen, the evaluation of the entries of the stiffness matrix AK (and of the load vector) proceeds by replacing integrals over Kˆ by (2.7). For example, for ˆ rs ϕ, ψ ∈ SpK (K) the evaluation of a (ϕ, ψ) is performed as ∂ ∂ ars(ϕ, ψ) ≈ GL ϕ · ψ · Aˆ , r, s ∈ {1, 2}, (2.8) q1,q2 ∂ξ ∂ξ rs # r s $ where the functions Aˆrs are defined in (2.4).

2.3.2 Sum Factorization

ˆ ˆ 0 Let us assume that a basis of the space SpK (K) is given in the form SpK (K) = span{E ∪ E 1 ∪ E 2 ∪ E 3 ∪ E 4 ∪ I}, where the sets E k, I have the following tensor product structure

k k k E = {χ1,i(ξ1)χ2,j(ξ2) |, 1 ≤ i ≤ Ik, 1 ≤ j ≤ Jk},k=0,..., 4, (2.9) I I I = {χ1,i(ξ1)χ2,j(ξ2) |, 1 ≤ i ≤ II , 1 ≤ j ≤ JI}. (2.10)

6 This structure is motivated by the usual decomposition into vertex shape functions, side shape functions, and internal shape functions. In that case, the numbers Ik, Jk, II , JI are a measure for the polynomial degree associated with each of these entities. For example, for the space Qp(Kˆ ), a standard choice of these numbers is (see Section 2.4 for a more specific example)

II = JI = p − 1,I0 = J0 = 2 (2.11)

(I2,J2) = (I4,I4) = (p − 1, 1) for the sides parallel to the y-axis (2.12)

(I1,J1) = (I3,J3) = (1,p− 1) for the sides parallel to the x-axis (2.13)

The element stiffness matrix AK has a 6×6 block structure, corresponding to the interac- tion of the 6 different types of shape functions introduced in (2.9), (2.10). An algorithm that exploits the tensor product structure of the basis functions and is based on sum factorization is the following:

Algorithm 2.4 initialize AK =0 for r,s=1:2 0 1 2 3 4 for B1 ∈ {E , E , E , E , E , I} 0 1 2 3 4 for B2 ∈ {E , E , E , E , E , I} add sum fact ∂ B , ∂ B , Aˆ to the block B -B of A ∂ξr 1 ∂ξs 2 rs 1 2 K end % & end end

Here, the operators ∂ , ∂ are understood to act elementwise on the shape function ∂ξr ∂ξs blocks B1, B2. Each call to sum fact returns a dim B1 × dim B2 matrix whose entries are just those calculated by the right hand side of (2.8) for all ϕ ∈ B1, ψ ∈ B2. We now specify the function sum fact in Algorithm 2.4:

Algorithm 2.5 (sum factorization) A := sum fact(B1, B2, a) %input: 1 2 % B1 = {ϕi (ξ1)ϕj (ξ2) | 1 ≤ i ≤ I, 1 ≤ j ≤ J} 1 2 " " % B2 = {ψi (ξ1)ψj (ξ2) | 1 ≤ i ≤ I , 1 ≤ j ≤ J } % function a : Kˆ → R %output: the matrix A with entries q1 q2 1 2 1 2 %A(i,j),(i!,j!) = k=0 l=0 ϕi (x1,k)ϕj (x2,l)ψi! (x1,k)ψj! (x2,l)w1,kw2,la(x1,k,x2,l)

' ' " " " S1 := (1 + q1)(1 + q2)J · J + (1 + q1)I · I · J · J % flop count for k l " " " S2 := (1 + q1)(1 + q2)I · I + (1 + q2)I · I · J · J % flop count for 'l 'k if S1 ≤ S2 then { % choose summation order to minimize flop count " q2 2 2 ' ' compute auxiliary array H(k, j, j ) := l=0 ϕj (x2,l)ψj! (x2,l)w2,la(x1,k,x2,l), " " k ∈ {0, . . . , q1}, j ∈ {1,...,J}, j ∈ {1,...,J } q1 1 ' 1 " compute entries A(i,j),(i!,j!) := k=0 ϕi (x1,k)ψi! (x1,k)w1,kH(k, j, j ), i ∈ {1,...,I}, j ∈ {1,...,J}, i" ∈ {1,...,I"}, j" ∈ {1,...,J "} } '

7 else { " q1 1 1 compute auxiliary array H(l, i, i ) := k=0 ϕj (x1,k)ψj! (x1,k)w1,ka(x1,k,x2,l), " " l ∈ {0, . . . , q2}, i ∈ {1,...,I}, i ∈ {1,...,I } q2 2 ' 2 " compute entries A(i,j),(i!,j!) := l=0 ϕj (x2,l)ψj! (x2,l)w2,lH(l, i, i ), i ∈ {1,...,I}, j ∈ {1,...,J}, i" ∈ {1,...,I"}, j" ∈ {1,...,J "} } ' return A

Remark 2.6 Algorithm 2.5 is essentially the classical sum factorization algorithm, e.g., [17], which is used in several commercial 3-D hp-FEM code. The new feature here over classical sum factorization is the comparison of S1 and S2. S1 and S2 estimate the oper- ation count for the evaluation of the double sum as either k l or as l k, and the smaller operation count is chosen. This modification of the classical sum factorization al- ' ' ' ' gorithm is particularly suited for elements SpK with highly anisotropic polynomial degree distribution and/or anisotropic quadrature rules.

Remark 2.7 We note in passing that another potential source of further savings is that the quadrature rule could be chosen for each pair B1-B2 separately. To illustrate this point, note that a typical choice for the vertex set E 0 is the set of the 4 bilinear functions only and that the set of internal shape functions I consists of polynomials of degree p. Assuming for the moment that a ≡ 1 in Algorithm 2.5 we see that the rule GL$p/2%+1,$p/2%+1 is sufficient to calculate sum fact(E 0, I, a) whereas p + 1 points in each direction are required for the calculation in sum fact(I, I, a).

Example 2.8 (work for Algorithm 2.5) For the simplified case q1 = q2 = p+q, q ≥ 0, inspection of Algorithm 2.5 shows that the complexity is

W = (1 + p + q)2 min {I · I",J · J "} + (1 + p + q)I · I" · J · J ". (2.14) ˆ ˆ For our 2D model case SpK (K)=Qp(K), we can be more explicit. Assuming that the sets I and E k satisfy (2.11)–(2.13), (2.14) reveals that, for example, the call sum fact ( ∂ I, ∂ I, Aˆ ) takes work W = O(p4(p + q)+p2(p + q)2). Similarly, the calls sum fact ∂ξ1 ∂ξ2 12 ( ∂ I, ∂ E 1, Aˆ ) and sum fact( ∂ E 1, ∂ E 2, Aˆ ) are of complexity W = O(p3(p + q)+ ∂ξ1 ∂ξ2 12 ∂ξ1 ∂ξ2 12 p(p + q)2) and W = O(p(p + q)2 + p2(p + q)), respectively.

Remark 2.9 Clearly, Algorithm 2.5 can also be formulated for other tensor product quadrature rule, e.g., the tensor product Gauss-Legendre quadrature rule.

2.3.3 Spectral Galerkin Sum Factorization

Algorithm 2.5 only assumes that the shape functions and the quadrature scheme have tensor product structure. If, however, the shape functions and the quadrature scheme are adapted to each other, then further savings are possible. The classical spectral method may serve as an example (which we will elaborate below) where the shape functions are just the base polynomials for the Lagrange interpolation in the quadrature points. The following variant of Algorithm 2.5 allows us to exploit the resulting degeneracy.

8 Algorithm 2.10 (spectral Galerkin sum factorization) A := spec sum fact(B1, B2, a) %input: 1 2 % B1 = {ϕi (ξ1)ϕj (ξ2) | 1 ≤ i ≤ I, 1 ≤ j ≤ J} 1 2 " " % B2 = {ψi (ξ1)ψj (ξ2) | 1 ≤ i ≤ I , 1 ≤ j ≤ J } % function a : Kˆ → R %output: the matrix A with entries q1 q2 1 2 1 2 %A(i,j),(i!,j!) = k=0 l=0 ϕi (x1,k)ϕj (x2,l)ψi! (x1,k)ψj! (x2,l)w1,kw2,la(x1,k,x2,l)

'" ' " 1 1 for each pair i, i compute set K(i, i ) := {k ∈ {0, . . . , q1}|ϕi (x1,k)ψi! (x1,k) =00 } " " 2 2 for each pair j, j compute set L(j, j ) := {l ∈ {0, . . . , q2}|ϕj (x2,l)ψj! (x2,l) =00 } " " " q1 q2 S1 := (1 + q1) j,j! | L(j, j ) | + J · J · i,i! | K(i, i ) | % flop count for k l " " " q2 q1 S2 := (1 + q2) ! | K(i, i ) | + I · I · ! | L(j, j ) | % flop count for 'i,i 'j,j 'l 'k if S1 ≤ S2 then { % choose summation order to minimize flop count ' ' " 2 2 ' ' compute auxiliary array H(k, j, j ) := l∈L(j,j!) ϕj (x2,l)ψj! (x2,l)w2,la(x1,k,x2,l), 1 1 " compute entries A ! ! := ! ϕ (x )ψ ! (x )w H(k, j, j ), (i,j),(i ,j ) k∈K(i,i ) 'i 1,k i 1,k 1,k } ' else { " 1 1 compute auxiliary array H(l, i, i ) := k∈K(i,i!) ϕj (x1,k)ψj! (x1,k)w1,ka(x1,k,x2,l), 2 2 " compute entries A ! ! := ! ϕ (x )ψ ! (x )w H(l, i, i ), (i,j),(i ,j ) l∈L(j,j )'j 2,l j 2,l 2,l } ' return A

Here, | L(j, j") | and | K(i, i") | stand for the number of elements of the sets L(j, j"), K(i, i"). We note that Algorithm 2.10 does indeed reduce to Algorithm 2.5 if the shape functions " " are not related to the quadrature rule: In that case L(j, j )={0, . . . , q2}, K(i, i )= {0, . . . , q1}, and hence Algorithm 2.10 indeed coincides with Algorithm 2.5. Let us now show that Algorithm 2.10 does indeed lead to savings if the shape functions are adapted to the quadrature rule.

Example 2.11 (spectral method) The classical spectral method (i.e., the use of GLp,p in conjunction with the space Qp(Kˆ )) is a special case of Algorithm 2.10. The shape functions E k and I are assumed to satisfy (2.9), (2.10). The internal shape functions I are additionally assumed to be of the form I (p) p χ1,i(x)=li (x, N ),i=1,...,II := p − 1, I (p) p χ2,j(x)=lj (x, N ),j=1,...,JI := p − 1. p p Here, the set N = GL = {xi |,i =0, . . . , p} is the set of Gauss-Lobatto nodes and the (p) p polynomials li (x, N ), i =0, . . . , p are the Lagrange interpolation polynomials of degree p associated with the set N p and they are given by p (p) p x − xi lj (x, N ) := . (2.15) i=0 xj − xi (i"=j We note that there holds (p) p li (xj, N )=δij, i, j =0, . . . , p.

9 Let us compute the cost of the call to spec sum fact( ∂ I, ∂ E 1, Aˆ ). In doing so, we ∂ξ1 ∂ξ2 12 do not assume that the shape functions comprising E 1 are related to the quadrature rule. The sets K(i, i"), L(j, j") are then seen to be

K(i, i")={0, . . . , p},L(j, j")={j}.

We now readily compute that this function call costs W = O(p4). For the call to the function spec sum fact( ∂ I, ∂ I, Aˆ ) we calculate ∂ξ1 ∂ξ2 12 K(i, i")={i"},L(j, j")={j}.

Hence, the work estimate also reduces to W = O(p4). The reader will convince himself that for any r, s ∈ {1, 2}, k ∈ {0,..., 4}, the work estimate for the calls to the functions spec sum fact( ∂ I, ∂ I, Aˆ ) and spec sum fact( ∂ I, ∂ E k, Aˆ ), is always O(p4). ∂ξr ∂ξs rs ∂ξr ∂ξs rs

Example 2.11 is in essence the classical spectral method (there, however, the external shape functions E k are also related to the quadrature rule thus allowing for improving the 4 constant in the work estimate O(p )). Note that the use of the Gauss-Lobatto rule GLp,p leads to underintegration as even for affine elements, the stiffness matrix is not computed exactly. For our scalar problem, we will show in [9] that the rule GLp,p in conjunction with the space Qp(Kˆ ) leads to a stable method and gives, in polygons and for piecewise analytic solution, exponential rates of convergence. However, this “minimal” spectral quadrature scheme performs poorly in the presence of very distorted meshes. For vector- valued problems such as the Lam´eequations, to the knowledge of the authors, stability of the Spectral Element Method (i.e., the use of the rule GLp,p) is an open problem on meshes containing non-affine elements. These considerations make it desirable to allow for overintegration, i.e., the ability to use the rule GLp+q,p+q (q ≥ 0) in conjunction with the space Qp(Kˆ ). We will show in the ensuing subsection that this is possible with work W = O(p4(1 + q)+p2q2).

2.3.4 Transition from Spectral to Galerkin via overintegration

Following Example 2.11, we restrict our attention first to the quadrature rules GLp+q,p+q k in conjunction with the space Qp(Kˆ ). The external shape functions E are again chosen to satisfy (2.9). For a set of nodal points N p satisfying

{0, 1} ⊂ N p |Np | = p +1, (2.16) the internal shape functions I are Lagrange interpolation polynomials for this nodal set N p (cf. (2.15)) and given by

I (p) p χ1,i(x)=li (x, N ),i=1,...,II := p − 1, I (p) p χ2,j(x)=lj (x, N ),j=1,...,JJ := p − 1.

p p+q It remains to choose the nodal set N . Let GL = {xi | i =0, . . . , p + q} be the Gauss- p p+q Lobatto quadrature points for the rule GLp+q and N be any subset of GL satisfying (2.16). For such a choice of N p there holds

p+q (p) p p+q p {xj ∈ GL | li (xj, N ) =00 } = ({i} ∪ GL \N ) = 1 + q, i =1, . . . , p − 1. ) ) ) ) ) ) ) ) ) 10 ) In this situation, let us calculate the cost of evaluating spec sum fact( ∂ I, ∂ E 2, Aˆ ). ∂ξ1 ∂ξ2 12 We see that | K(i, i") | = GLp+q = |{0, . . . , p + q}|= p + q +1, " p+q p | L(j, j ) | = ) {j} ∪)(GL \N ) = 1 + q. ) ) Inspection of Algorithm 2.10 shows) * that the work estimate+ ) is then W = O((p+q)p(p2+q)). ) ) Similarly, for the call of spec sum fact( ∂ I, ∂ I, Aˆ ) we get ∂ξ1 ∂ξ2 12 | K(i, i") | = {i"} ∪ GLp+q \Np = q +1, " p+q p | L(j, j ) | = ) *{j} ∪ (GL \N+)) = q +1. ) ) Thus, the work can be bounded by W) *= O(p2(p+q)(1+q))++ ) O(p4(q+1)) = O(p4(1+q)+ ) ) p2q2)). In two dimensions, the use of the Gauss-Lobatto rule of order p + q with a fixed q ≥ 0 does therefore have the same asymptotic complexity as the pure spectral method of Example 2.11. We presented these ideas here for the case of the rule GLp+q,p+q for the space Qp(Kˆ ). Similar arguments apply for more general polynomial degree distribution and anisotropic quadrature rules.

2.3.5 Choice of the interpolation nodes N p

In the preceding subsection 2.3.4, we merely assumed that the interpolation nodes N p were a subset of the quadrature points GLp+q. In the present subsection, we discuss criteria for the selection of the interpolation set N p and suggest some specific choices. For a given nodal set N p satisfying (2.16) we consider two types of 1-D shape functions on (0, 1):

1 1 1 (p) p χ0(x) := 1 − x, χp(x) := x, χi (x) := li (x, N ),i=1, . . . , p − 1, (2.17) 2 (p) p χi (x) := li (x, N ),i=0, . . . , p. (2.18) 1 The set χi is motivated by the traditional hp-FEM codes that always include the linear 2 (bilinear/trilinear shape functions in 2D/3D) shape functions. The set χi is closer to the typical choices in spectral methods. Note that these 1-D shape functions were used in the preceding subsection 2.3.4 to define the internal shape functions–this is the main motivation for imposing (2.16). For these two sets of shape functions (2.17), (2.18), we can define the mass matrix M m (m ∈ {1, 2}) in the usual fashion: 1 m p m m Mij (N ) := χi (x)χj (x) dx, i, j =0, . . . , p +1. !0 The two mass matrices M m(N p), m ∈ {1, 2} depend of course on the nodal set N p m p through the shape functions χi , and we included the argument N in the definition of M m in order to emphasize this dependence. Given p ≥ 1, q ≥ 0, the first criterion for the selection of the subset N p of GLp+q is that the mass matrix M m (m ∈ {1, 2}) have minimal condition number. The corresponding optimal choice of the nodal set is then p,q p,q denoted by N1 for the shape functions (2.17) and N2 for the shape functions (2.18): p,q m p p p+q p Nm := arg min {cond (M (N )) |N ⊂ GL and N satisfies (2.16)},m∈ {1, 2}. (2.19)

11 p,q p,q Tables 1, 2 give the optimal sets N1 , N2 (up to symmetry); in the interest of brevity, p,q Tables 1, 2 list the indices of the sets GLp+q \Nm (m ∈ {1, 2}) rather than the indices of p,q the sets Nm . Figs. 2, 3 show the condition numbers for these optimal choices of nodes. It is noteworthy that especially in the case of the shape functions (2.17), the condition number is fairly insensitive to q, that is, it is possible to simultaneously use overintegration and adapt the shape functions to the quadrature rule without an adverse effect on the condition number. Another measure of the quality of a set of nodal points N p is its Lebesgue number Λ(N p), which is the stability constant of the corresponding nodal interpolation operator:

p p (p) Λ(N ) := sup li (x) . (2.20) x∈(0,1) i=0 ) ) " ) ) ) ) p,q Figs. 4, 5 show the Lebesgue numbers for our optimal sets Nm , m ∈ {1, 2} (the Lebesgue numbers were calculated numerically by replacing the interval (0, 1) in (2.20) with 2000 uniformly distributed points). The Gauss-Lobatto set GLp (corresponding to q = 0 in Figs. 4, 5), is essentially the best possible choice with Λ(GLp)=O(ln p), [23]. Neverthe- p,q less, in the practical regime p =2,..., 20 the Lebesgue numbers Λ(Nm ) are at most one order of magnitude away from the “optimal” value Λ(GLp). We note that in both Tables 1, 2, the optimal points are not distributed symmetrically with respect to the mid-point 1/2. For implementational purposes during the assembly procedure, it is more convenient to have symmetrically distributed nodal points N p as this leads to shape functions that are either symmetric or anti-symmetric. This motivates to restrict the minimization in (2.19) to sets N p that are symmetric with respect to the midpoint 1/2:

p,q m p Nm,sym := arg min {cond (M (N )) | N p ⊂ GLp+q, N p satisfies (2.16), N p is sym. w.r.t. 1/2 },m∈ {1, 2}.

The optimal sets are listed in Tables 3, 4 (again, Tables 3, 4 actually list the indices of p,q GLp+q \Nm,sym). Figs. 6, 7 show the corresponding condition numbers. Comparing these with Figs. 2, 3 we note that imposing a symmetry condition on the nodal sets N p has practially no effect on the conditioning of the resulting mass matrix. A similar conclusion holds for the Lebesgue numbers as can be seen by comparing Figs. 8, 9 and Figs. 4, 5.

p 2 3 4 5 6 7 8 9 10 q=0 ------q=1 1 3 3 3 3 4 4 6 6 q=2 1,3 1,4 2,4 2,5 1,6 2,7 2,9 2,9 1,10 q=3 1,3,4 1,3,5 2,4,6 2,4,6 2,4,7 2,5,8 2,5,9 2,6,10 2,7,11 q=4 1,2,4,5 1,2,4,6 1,3,5,7 2,4,6,8 2,4,6,8 2,4,7,9 2,5,7,10 2,5,8,11 2,5,10,11 q=5 1,2,4,5,6 1,2,4,6,7 1,2,4,6,8 1,3,5,7,9 2,4,6,8,10 2,4,6,8,10 2,4,7,9,11 2,4,7,9,12 2,5,7,10,13 q=6 1,2,3,5,6,7 1,2,3,5,6,8 1,3,4,6,7,9 1,3,5,7,9,10 2,3,5,7,9,10 2,3,6,8,10,11 2,4,6,8,10,12 2,3,6,9,12,13 2,5,7,9,11,14

p+q p,q Table 1: index lists of GL \N1 .

12 p 2 3 4 5 6 7 8 9 10 q=0 ------q=1 2 2 2 3 3 4 4 5 6 q=2 1,3 1,4 2,5 2,5 2,6 2,7 3,7 3,8 3,9 q=3 1,2,4 1,3,5 1,3,6 1,4,7 2,5,8 2,5,8 1,5,9 2,6,10 2,6,10 q=4 1,2,4,5 1,3,5,6 1,3,5,7 1,3,5,8 1,3,6,9 1,4,7,10 1,4,7,10 2,5,8,11 1,5,9,13 q=5 1,2,3,5,6 1,2,4,6,7 1,3,5,7,8 1,3,5,7,9 1,3,5,7,10 1,3,6,9,11 1,3,6,9,12 1,4,7,10,13 2,5,8,11,14 q=6 1,2,3,5,6,7 1,2,4,5,7,8 1,2,4,6,8,9 1,2,4,6,8,10 1,3,5,7,9,11 1,3,5,7,9,12 1,3,6,8,11,13 1,3,6,9,12,14 1,3,6,9,12,15

p+q p,q Table 2: index lists of GL \N2 .

condition number for optimal choice of nodal set 4 10

q=0 q=1 q=2 q=3 q=4 3 10 q=5 q=6 D mass matrix

! 2 10 cond. number of 1 1 10

0 10 0 1 10 10 polynomial degree p

1 p,q Figure 2: cond (M (N1 )) as a function of p =2,..., 20 and amount of overintegration q

condition number for optimal choice of nodal set 4 10

q=0 q=1 q=2 q=3 q=4 3 10 q=5 q=6 D mass matrix

! 2 10 cond. number of 1 1 10

0 10 0 1 10 10 polynomial degree p

2 p,q Figure 3: cond (M (N2 )) as a function of p =2,..., 20 and amount of overintegration q

13 Lebesgue constants for optimal choice of nodal set

q=0 q=1 q=2 q=3 q=4 q=5 q=6

1 10 Lebesgue constant

0 10 0 1 10 10 polynomial degree p

p,q Figure 4: Lebesgue number of set N1 as a function of p =2,..., 20 and amount of overintegration q

Lebesgue constants for optimal choice of nodal set

q=0 q=1 q=2 q=3 q=4 q=5 q=6 1 10 Lebesgue constant

0 10 0 1 10 10 polynomial degree p

p,q Figure 5: Lebesgue number of set N2 as a function of p =2,..., 20 and amount of overintegration q

p 2 3 4 5 6 7 8 9 10 q=0 ------q=2 1,3 1,4 2,4 2,5 2,6 2,7 2,8 2,9 1,11 q=4 1,2,4,5 1,3,4,6 1,3,5,7 1,3,6,8 2,4,6,8 2,4,7,9 2,5,7,10 2,5,8,11 2,6,8,12 q=6 1,2,3,5,6,7 1,2,4,5,7,8 1,3,4,6,7,9 1,2,4,7,9,10 2,3,5,7,9,10 2,3,5,8,10,11 2,4,6,8,10,12 2,3,6,9,12,13 2,5,7,9,11,14

p+q p,q Table 3: index list of GL \N1,sym.

14 p 2 3 4 5 6 7 8 9 10 q=0 ------q=2 1,3 1,4 1,5 2,5 2,6 2,7 3,7 3,8 3,9 q=4 1,2,4,5 1,3,4,6 1,3,5,7 1,3,6,8 1,4,6,9 1,4,7,10 1,4,8,11 2,5,8,11 1,5,9,13 q=6 1,2,3,5,6,7 1,2,4,5,7,8 1,2,4,6,8,9 1,3,5,6,8,10 1,3,5,7,9,11 1,3,5,8,10,12 1,3,6,8,11,13 1,3,6,9,12,14 1,4,7,9,12,15

p+q p,q Table 4: index list of GL \N2,sym.

condition number for optimal symmetric choice of nodal set 4 10

q=0 q=2 q=4 q=6

3 10 D mass matrix

! 2 10 cond. number of 1 1 10

0 10 0 1 10 10 polynomial degree p

1 p,q Figure 6: cond M (N1,sym) as a function of p =2,..., 20 and amount of overintegration q * +

condition number for optimal symmetric choice of nodal set 4 10

q=0 q=2 q=4 q=6

3 10 D mass matrix

! 2 10 cond. number of 1 1 10

0 10 0 1 10 10 polynomial degree p

2 p,q Figure 7: cond M (N2,sym) as a function of p =2,..., 20 and amount of overintegration q * +

15 Lebesgue constants for optimal symmetric choice of nodal set

q=0 q=2 q=4 q=6

1 10 Lebesgue constant

0 10 0 1 10 10 polynomial degree p

p,q Figure 8: Lebesgue number of set N1,sym as a function of p =2,..., 20 and amount of overintegration q

16 Lebesgue constants for optimal symmetric choice of nodal set

q=0 q=2 q=4 q=6

1 10 Lebesgue constant

0 10 0 1 10 10 polynomial degree p

p,q Figure 9: Lebesgue number of set N2,sym as a function of p =2,..., 20 and amount of overintegration q

2.3.6 Spectral Galerkin Algorithm

We are now in position to combine all the above considerations together to formulate

Algorithm 2.12. The algorithm assumes that the space SpK is of the form (2.9), (2.10). The internal shape functions I are additionally adapted to the quadrature rule chosen:

Algorithm 2.12 (spectral Galerkin Element Algorithm) initialize AK =0 k choose external shape functions E , k =0,..., 4 for SpK I I determine poly. deg. p1 := I +1, p2 := J +1 of internal shape functions h v h v choose q , q ≥ 0 and define quadrature rule GLp1+q ,p2+q h v determine sets N p1,q , N p2,q according to one of Tables 1, 2, 3, 4 h v (p1) p1,q (p2) p2,q set I = {li (x, N ) · lj (y,N ) | 1 ≤ i ≤ p1 − 1, 1 ≤ j ≤ p2 − 1} for r,s=1:2 0 1 2 3 4 for B1 ∈ {E , E , E , E , E , I} 0 1 2 3 4 for B2 ∈ {E , E , E , E , E , I} add spec sum fact ∂ B , ∂ B , Aˆ to the block B -B of A ∂ξr 1 ∂ξs 2 rs 1 2 K end % & end end

17 ! 2

^a 43a^^7 a 1

a^ 9 a^ 6 ^a 8

a^ 2 ! 0 1 ^a 1 a^ 5 1

Figure 10: Quadrilateral master element Kˆ of order p =(p1,p2,p3,p4,p5).

Remark 2.13 In Algorithm 2.12 the same quadrature rule is used in the computation of all B1-B2 interactions. Of course, the quadrature rule could be chosen individually for each combination, cf. Remark 2.7.

In Algorithm 2.12, we did not specify the external shape functions. Specific choices are the topic of the following subsection.

2.4 hp quadrilateral elements

In this section, we present explicitly the shape functions that we use in Algorithm 2.12. In particular, we will illustrate the concept of anisotropic polynomial degree distribution. We start by introducing the 1-D hierarchical shape functions of degree p of [24] as

1 h1(x) = 1 − x, h2(x)=x, hi(x)= (Li(x) − Li−2(x)) ,i=3,... (2.21) ,4(2i − 1) where we recall that the polynomials Li are the Legendre polynomials associated with the interval (0, 1) normalized to satisfy Li(1) = 1. For each quadrilateral element K ∈ T we introduce the element polynomial degree vector pK in order to describe the polynomial approximation, i.e

1 5 pK = pK, ··· ,pK , (2.22)

i * + where pK , 1 ≤ i ≤ 4, represents the polynomial degree on the i-th edge of the element, i 5 i.e. pK represents the approximation order of the mid-side nodea ˆi+4 in Figure 10. pK = h v (pK,pK) is the polynomial degree in the interior of the element which we even allow to be h v 2 anisotropic, where pK represents the horizontal and pK the vertical approximation order . We only require that

h 1 3 v 2 4 pK ≥ max pK ,pK ,pK ≥ max pK ,pK , ∀ K ∈ T . (2.23) - . - . The nodesa ˆ1, ··· , aˆ9 represent the dof corresponding to the quadrilateral master element shown in Figure 10 and are divided into three categories:

2This feature is essential in the context of thin solids.

18 1. vertex nodesa ˆ1, aˆ2, aˆ3, aˆ4: the shape functions associated with the four vertices form the set E 0; they are always taken as the set of the four bilinear functions: 0 E = {hi(x) · hj(y) | (i, j) ∈ {1, 2} × {1, 2}} (2.24)

2. middle nodea ˆ9: the shape functions associated with this node form the set I; they are always taken to be of the form

(ph) ph,qh (pv) pv,qv h v I = {li (x, N ) · lj (y,N ) | 1 ≤ i ≤ p − 1, 1 ≤ j ≤ p − 1} (2.25) here, the sets N ph,qh , N pv,qv can be any of those from Tables 1, 2, 3, 4.

3. mid-side nodesa ˆ5, aˆ6, aˆ7, aˆ8: the shape functions associated with these four nodes form the side shape functions E k, k =1,..., 4; several possible choices will be given in the following two subsections.

2.4.1 Side shape functions of type (2.17)

The set of hierarchical side shape functions of type (2.17) is given by 1 E = {hi(x) · h1(y) | 1 ≤ i ≤ p1 − 1} (2.26a) 2 E = {h2(x) · hi(y) | 1 ≤ i ≤ p2 − 1} (2.26b) 3 E = {hi(x) · h2(y) | 1 ≤ i ≤ p3 − 1} (2.26c) 4 E = {h1(x) · hi(y) | 1 ≤ i ≤ p4 − 1} (2.26d) The side shape functions may also be taken as nodal-based functions. Representative for this approach is the following set:

1 (p1) p1 E = {li (x, GL ) · h1(y) | 1 ≤ i ≤ p1 − 1} (2.27a) 2 (p2) p2 E = {h2(x) · li (y,GL ) | 1 ≤ i ≤ p2 − 1} (2.27b) 3 (p3) p3 E = {li (x, GL ) · h2(y) | 1 ≤ i ≤ p3 − 1} (2.27c) 4 (p4) p4 E = {h1(x) · li (y,GL ) | 1 ≤ i ≤ p4 − 1}. (2.27d) Remark 2.14 The nodal sets GLp1 ,..., may be replaced by other sets, e.g., the sets p,q h h v v N . This is, for example, of interest if p1 = p or p3 = p or p2 = p or p4 = p as then (at least some of) the side shape functions are adapted to the quadrature as well, which is then automatically exploited by Algorithm 2.10.

2.4.2 side shape functions of type (2.18)

The side shape functions can of course also be taken of the form (2.18). Specifically, the hierarchical version is given by 1 (pv) pv,qv E = {hi(x) · l0 (y,N ) | 1 ≤ i ≤ p1 − 1} (2.28a) 2 (ph) (ph),qh E = {lph (x, N ) · hi(y) | 1 ≤ i ≤ p2 − 1} (2.28b) 3 (pv) pv,qv E = {hi(x) · lpv (y,N ) | 1 ≤ i ≤ p3 − 1} (2.28c) 4 (ph) ph,qh E = {l0 (x, N ) · hi(y) | 1 ≤ i ≤ p4 − 1} (2.28d) (2.28e)

19 h h v v where the sets N p ,q , N p ,q are now chosen from Table 2 or 4. Here, ph, pv, qh, qv are determined by the internal shape functions and the quadrature rule. Note that these side shape functions are (at least partially) adapted to the quadrature rule which is automatically exploited by Algorithm 2.10. Analogously, the version using Gauss-Lobatto side shape functions is given by

v v v 1 (p1) p1 (p ) p ,q E = {li (x, GL )l0 (y,N ) | 1 ≤ i ≤ p1 − 1} (2.29a) h h h 2 (p ) p ,q (p2) p2 E = {lph (x, N )li (y,GL ) | 1 ≤ i ≤ p2 − 1} (2.29b) v v v 3 (p3) p3 (p ) p ,q E = {li (x, GL )lpv (y,N ) | 1 ≤ i ≤ p3 − 1} (2.29c) h h h 4 (p ) p ,q (p4) p4 E = {l0 (x, N )li (y,GL ) | 1 ≤ i ≤ p4 − 1}. (2.29d)

Again, as in the case (2.27), the sets GLp1 ,..., GLp4 could be replaced with sets of the form N p,q.

Remark 2.15 Note that the vertex shape functions are always taken as the bilinear shape functions. This is motivated by implementational issues. Clearly, other choices are possible.

3 Extensions to 3D

3.1 Work estimates in 3D

The ideas presented for the 2D model problem can be extended to 3D. Table 5 collects the work estimates for the 2D and the 3D variants of the standard quadrature Algorithm 2.2, the sum factorization Algorithm 2.5, and the spectral-Galerkin Algorithm 2.10. In this subsection, we will briefly point at some of the differences between 2D and 3D and illus- trate how the work estimates of Table 5 were obtained.

A common decomposition of the spaces SpK is into vertex shape functions, edge shape functions, face shape functions, and internal shape functions with the following two prop- erties: first, all shape functions have tensor product structure and may be represented in the following fashion (note that a 3D hexahedron has 8 vertices, 12 edges, and 6 faces for a total of 26 geometric entities)

k m m m E = {χ1,i(ξ1)χ2,j(ξ2)χ3,k(ξ3) |, 1 ≤ i ≤ Im, 1 ≤ j ≤ Jm, 1 ≤ k ≤ Km},m=1,..., 26, I I I I = {χ1,i(ξ1)χ2,j(ξ2)χ3,k(ξ3) |, 1 ≤ i ≤ II , 1 ≤ j ≤ JI, 1 ≤ k ≤ KI}.

Second, the numbers Im, Jm, Km satisfy

∀m ∈ {1,..., 26} at least one of the numbers Im, Jm, Km is equal to 1. (3.30)

In order to fix ideas, we will again consider for our work estimates the case Qp(Kˆ ). The estimates, however, are valid for the general case as well. In that case, the values Im, Jm,

20 Km, and II, JI , KI would typically satisfy the following conditions:

for the 8 vertices, m ∈ {1,..., 8}: Im = Jm = Km =1

for the 12 edges, m ∈ {9,..., 20}: exactly one of the three numbers Im, Jm, Km equals p − 1 and the other two equal 1

for the 6 faces, m ∈ {21,..., 26}: exactly two of the three numbers Im, Jm, Km equal p − 1 and the remaining one equals 1

for I: II = JI = KI = p − 1. Since the shape functions have tensor product structure, the sum factorization Algo- rithm 2.5 can be employed to evaluate the stiffness matrix with an operation count of 4 3 O(p (p + q) ) (the use of GLp+q,p+q,p+q is assumed). Before proceeding to the analysis of the 3D analogon of Algorithm 2.12 we point out that the calculation of all the E m-E n blocks of the stiffness matrix is done with O(p4(p + q)+p2(p + q)2 + p(p + q)3) work by Algorithm 2.5. This operation count is due to the fact that our version of sum fac- torization re-orders the triple sums of the quadrature rule so as to minimize the work. Assumption (3.30) ensures that there is always one ordering of the triple sums which yields this operation count (which is O(p5) for fixed q). If no reordering is performed, it is easy to see that sum factorization techniques perform the calculation of the E m-E n blocks in O(p6) complexity. This re-ordering is therefore essential for the efficiency of our hp-spectral Galerkin algorithm in 3D. We now turn to the work estimates for the I-E and I-I blocks with internal shape functions that are adapted to the quadrature rule. In the 3-D version of Algorithm 2.10, 6 possible reorderings of the triple quadrature sum have to be checked. However, in order to obtain work estimates for the 3-D version of Algorithm 2.10, it suffices to consider the floating point operation count for one particular, judiciously chosen ordering. We illustrate this with an example, the calculation of spec sum fact( ∂ I, ∂ E 21,a). Here, ∂ξ1 ∂ξ2 21 E is a set of face shape functions satisfying I21 = J21 = p − 1, K21 = 1. We evaluate the triple quadrature sum as

I 21 " " " ! ! ! Aijk,i j k = χ2,j(x2,q2 ) χ2,j! (x2,q2 )H1(q2, k, k , i, i ), q2 " * + " " I 21 " where H1(q2, k, k , i, i )= χ3,k(x3,q3 )χ3,k! (x3,q3 )H2(q2,q3, i, i ), q "3 " I " 21 and H2(q2,q3, i, i )= χ1,i (x1,q1 )χ1,i! (x1,q1 )a(x1,q1 ,x2,q2 ,x3,q3 )w1,q1 w2,q2 w3,q3 . q1 " * + 2 2 Generating the auxiliary array H2 can be achieved with work W (H2)=O((p+q) p (p+q)). Next, as the internal shape functions are adapted to the quadrature rule, the sum in the definition of H1 has only q + 1 non-vanishing entries for each k, and we arrive at 3 W (H1)=O((p + q)p (1 + q)) for the generation of H1. Computing the final sum, we can again exploit that the internal shape functions are adapted to the quadrature rule to obtain W = O(p3p2(1+q)). The total cost is therefore W (A)=O(p5(1+q)+q2p3 +q3p2). We note that the essential feature of the above summation order is that the two outer sums have only 1 + q non-zero terms and that only the innermost sum has p + q + 1 non- zero terms; for our second-order model problem, the triple quadrature sums can always be arranged in this way for the I-E blocks.

21 Let us consider now the calculation of the I-I block. We consider spec sum fact( ∂ I, ∂ I,a). ∂ξ1 ∂ξ1 This is evaluated as

I I " " ! ! ! Aijk,i j k = χ2,j(x2,q2 )χ2,j! (x2,q2 )H1(q2, k, k , i, i ), q "2 " " I I " H1(q2, k, k , i, i )= χ3,k(x3,q3 )χ3,k! (x3,q3 )H2(q2,q3, i, i ) q "3 " I " I " H2(q2,q3, i, i )= χ1,i (x1,q1 ) χ1,i! (x1,q1 )a(x1,q1 ,x2,q2 ,x3,q3 )w1,q1 w2,q2 w3,q3 . q1 " * + * + 3 2 We check that generating the auxiliary array H2 costs W (H2)=O((p + q) p ). Next, for p+q the evaluation of the array H1 it is convenient to write the set of quadrature points GL p+q p,q p+q p,q as GL = N ∪ GL \N and express H1 as

* I + I I I H1 = χ3,k(x)χ3,k! (x)H2 + χ3,k(x)χ3,k! (x)H2 x∈N p,q p+q p,q " x∈GL"\N " I I = δk,k! H2(q2, k, i, i )+ χ3,k(x)χ3,k! (x)H2. p+q p,q x∈GL"\N The first term can be evaluated with work W = O((p + q)p3). The sum has only q terms 4 and it costs W = O(q(p + q)p ) flops. The cost for the evaluation of H1 is therefore 4 5 2 4 W (H1)=O(p + qp + q p ). Completely analogously, we conclude that the outermost sum is performed with work W = O(qp6 + p5). The total cost for the I − I block is therefore W (A)=O(qp6 + p5 + q2p4 + q3p2). Hence, the (p + 1)6 entries of the stiffness matrix can be computed in optimal complexity O(p6).

Remark 3.1 It is interesting to note that the case of “minimal” quadrature, q = 0, is special: The leading order term qp6 vanishes, and the complexity reduces again to O(p5) 3. This reduction of the complexity for q = 0 is due to the fact that adapting the internal shape functions to the minimal quadrature rule introduces a significant amount of sparsity in the I-I block of the stiffness matrix as only O(p5) entries of this block of size O(p6) are non-zero.

We collect the work estimates for the standard Algorithm 2.2, the sum factorization Algorithm 2.5, and our Algorithm 2.12 for the quadrature rule GLp+q,p+q,p+q in conjunction with the space Qp(Kˆ ) in Table 5. For Algorithm 2.12, we included in Table 5 only the work for the computation of the non-zero entries of the stiffness matrix.

3.2 Work estimates in 3D: numerical example

In the present section, we illustrate the quadrature speed-up of Algorithm 2.12 over both the standard quadrature Algorithm 2.2 and the sum factorization Algorithm 2.5. All

3 6 The work estimates in Table 5 ignore the initialization phase AK = 0, which is also an O(p ) 5 algorithm. To obtain a truely O(p ) algorithm for the case q = 0, a special storage format for AK has to be used, that stores the non-zero entries only. From a practical point of view, one can assume that 6 initializing AK = 0 is done very efficiently by the operating system so that the O(p ) contribution is negligible in the practical regime of polynomial degrees p.

22 algorithm 2D 3D standard O(p4(p + q)2) O(p6(p + q)3) sum fact. O(p4(p + q)+p2(p + q)2) O(p6(p + q)+p4(p + q)2 + p2(p + q)3) spect.-Gal. O(p4(1 + q)+p2q2) O(qp6 + p5 + q2p4 + q3p2)

Table 5: Asymptotic flop estimates for various quadrature schemes with overintegration of order q ≥ 0 three algorithms were used to generate the stiffness matrix for a 3D scalar Poisson prob- lem with zero-order term (hence, the mass matrix has to be generated also) with variable coefficients for the space Qp(Kˆ ), p =1,..., 9. The actual shape functions are taken as the 3D analogs of (2.24) for the vertex shape functions, of (2.25) for the internal shape functions, and of (2.26) for the edges and faces. The stiffness matrix calculation was done with the 3D-version of the hp-code HP90, [8], a general hp-code written in Fortran 90; actual calculations were performed on a SUN UltraSparc running at 336Mhz using the manufacturer provided F90 compiler with optimization flag (“-fast”) set. In Fig. 11 we compare three algorithms: the standard Algorithm 2.2 (with Gauss-Legendre quadrature with (p + 1)3 points), the sum factorization Algorithm 2.5 (again with Gauss-Legendre quadrature with (p + 1)3 points) and Algorithm 2.12 with Gauss-Lobatto quadrature rule GLp,p,p (i.e., q = 0). For comparison reasons, we additionally show the time needed to condense out the internal degrees of freedom as for environments where static con- densation is performed as part of the element stiffness matrix generation routine, this determines a lower bound for the speed-up. Fig. 11 in fact contains the timings for two different condensation schemes: the faster of the two (labelled “blas3 condensation”) uses the manufacturer-optimized LAPACK routine dposv and the BLAS3 routine dgemm. For the element generation, the Spectral-Galerkin algorithm outperforms both the sum- factorization and the standard algorithm: For p = 9, the stiffness matrix is generated in 228 secs. by the standard algorithm, in 46 secs. by the sum factorization and in only 19.6 secs. by the Spectral-Galerkin algorithm. The regular static condensation takes 28 secs. whereas the optimized one takes 2.7 secs. Thus, for our scalar model problem and the range p =1,..., 9, our Spectral-Galerkin algorithm does provide a means to significantly speed up the element CPU-time: compared with (even our improved version of) the sum factorization algorithm, the Spectral-Galerkin algorithm generates the condensed element matrix by a factor 2 faster (46 + 2.7 secs. vs. 19.6 + 2.7 secs.). Comparing the slopes of the curves in Fig. 11, we also see that the Spectral-Galerkin algorithm is of lower complexity than the sum factorization algorithm showing that our asymptotic complexity analysis above is valid for small p as well. A detailed break-down of the timings for the generation of the different blocks of the stiffness matrix for the sum factorization technique and the Spectral-Galerkin Algorithm is done in Figs. 12, 13. Adapting the internal shape functions to the quadrature rule has the biggest impact on the timings for the I-I block (the “bubble-bubble” block): Whereas for the sum factorization (cf. Fig. 12) it is the most expensive part, it is the cheapest for the Spectral-Galerkin algorithm in the range p = 2—9 (cf. Fig. 13). Some speed-up is also visible for the generation of the bubble-face block. Note, however, that the timings here are given for our sum factorization Algorithm 2.5, which is an improved version of the classical algorithm in that re-ordering of the quadrature sums is performed. Otherwise, the

23 Comparison of different algorithms 1000

100 spectral Galerkin sum factorization standard algorithm static condensation 10 BLAS3 condensation

1

0.1 seconds

0.01

0.001

0.0001

1e-05 1 10 100 1000 Degrees of Freedom

Figure 11: CPU time per hexahedral element for standard, sum factorization, and spectral-Galerkin algorithm for scalar 3D model problem; also included: standard con- densation and condensation based on optimized LAPACK and BLAS routines. discrepancy between the classical sum factorization algorithm and the Spectral-Galerkin algorithm for the bubble-face block would have been bigger. As the Spectral-Galerkin algorithm reduces to sum factorization techniques for the E n-E m blocks, the timings for the blocks not involving the internal shape functions are identical. Fig. 14 finally shows the timings for various auxiliary and book-keeping procedures: the evaluation of the 1-D shape functions in the quadrature points, dealing with constraint nodes, and the computation of the right hand side are seen to be negligible compared to the quadrature. In our algorithms, we did not focus on the cost of evaluating the geometry and the coefficient matrix at the quadrature points. In our calculations, the actual element map was a trilinear function and the coefficient matrix A was very simple, thus leading to an operation count of O((p + q)3) for sampling these data in the quadrature points. The timings for this part of the element computation are given in Fig. 14 and labelled “eval. Jacobian”. We point out that this part of the computation does depend strongly on the actual element maps (or approximations thereof) and the coefficients of the differential equations.

4 Conclusions

We have presented and discussed quadrature techniques for hp-FEM in two and three dimensional elliptic problems. The main idea consisted in the design of shapefunctions that are adapted to the quadrature rule and in the classical sum-factorization used. A special case is the spectral element method which is well-established in computational

24 detailed timings for sum factorization 100 total time vertex-vertex edge-vertex edge-edge 10 face-vertex face-edge face-face bubble-vertex bubble-edge 1 bubble-face bubble-bubble

0.1 seconds

0.01

0.001

0.0001 1 10 100 1000 Degrees of Freedom

Figure 12: Break-down of timing for sum factorization algorithm for scalar 3D model problem.

detailed timings for spectral Galerkin algorithm 100 total time vertex-vertex edge-vertex edge-edge 10 face-vertex face-edge face-face bubble-vertex 1 bubble-edge bubble-face bubble-bubble

0.1

seconds 0.01

0.001

0.0001

1e-05 1 10 100 1000 Degrees of Freedom

Figure 13: Break-down of timing for Spectral-Galerkin algorithm for scalar 3D model problem.

25 miscellaneous auxiliary function timings 1

eval. shape fcts eval. Jacobian reorder matrix right-hand side enforce constraints 0.1

0.01 seconds

0.001

0.0001 1 10 100 1000 Degrees of Freedom

Figure 14: Timings for auxiliary functions sum factorization/spectral Galerkin algorithm for scalar 3D model problem.

fluid dynamics. The spectral element method is well-known, however, to underintegrate the element stiffness matrices which is known to cause stability problems if elliptic systems are thus discretized. We generalized therefore the ideas in the spectral element method by decoupling the quadrature order from the elemental degree, as is customary in Galerkin hp-FEM. Lagrange shapefunctions based on a suitable subset of the Gauss-Lobatto nodes were proposed. New choices of nodal sets for these functions were given based on the criterion of small condition number for the resulting stiffness matrices. The condition numbers thus obtained were found to be marginally worse than those of the spectral element method, while the CPU-times for the element stiffness matrix generation were comparable to those of the spectral element method. The new algorithm was found to outperform the standard Galerkin scheme with sum-factorization. For polynomial degrees between 1 and 10, static condensation was found to be considerably cheaper than the quadrature for the stiffness matrix generation. The ideas in the present paper have natural generalizations to triangles and tetrahedra which will be presented elsewhere. Acknowledgement: We thank Philipp Frauenfelder for implementing the 3D quadrature rules in the code HP90, [8].

References

[1] R. Actis, B. Szab´o, C. Schwab, Hierarchic models for laminated plates and shells, Comput. Meths. Appl. Mech. Eng. vol. 172 (1999), p.79–107

[2] I. Babuˇska and M. Suri. The p and hp Versions of the Finite Element Methods, Basic Principles and Properties. SIAM review 36 (1994) 578-632.

26 [3] C. Bernardi and Y. Maday. Approximations spectrales de probl`emes aux limites elliptiques. Springer, 1992. [4] F. Ben Belgacem, Y. Maday. A spectral element methodology tuned to parallel im- plementations. Comput. Meths. Appl. Mech. Eng. 116 (1994), p. 59–67. [5] C. Canuto, M. Y. Hussaini, A. Quarteroni and T. A. Zhang. Spectral Methods in Fluid Dynamics, Springer Verlag, 1986. [6] L. Demkowicz, J.T. Oden, W. Rachowicz and O. Hardy. Toward a Universal hp Adaptive Finite Element Strategy. Part 1: Constrained Approximation and Data Structure. Computer Methods in Applied Mechanics and Engineering, 77 (1989), 79-112. [7] L. Demkowicz, W. Rachowicz, K. Banas and J. Kucwaj. 2-D hp Adaptive Package (2DhpAP). Technical Report, Section of Applied Mathematics, Technical University of Cracow, Poland, 1992. [8] L. Demkowicz, K. Gerdes, C. Schwab, A. Bajer and T. Walsh. HP90: A general and flexible Fortran 90 hp-FE code. Computing and Visualization in Science 1 (1998) 145-163. [9] K. Gerdes, J.M. Melenk, and C. Schwab. Fully Discrete hp-FEM: Error Analysis. (in preparation) [10] D. Gottlieb and S.A. Orszag, Numerical Analysis of Spectral Methods: Theory and Applications, CBMS 26, SIAM, 1977. [11] W. Hackbusch. Multigrid Methods and Applications. Springer Verlag, 1985. [12] G.E. Karniadakis and S.J. Sherwin, Spectral/hp Element Methods for CFD, Oxford University Press, 1999 [13] K.Z. Korczak and A.T. Patera, An Isoparametric Spectral Element Method for So- lution of the Navier-Stokes Equations in Complex Geometry. J. Comput. Phys. 62 (1986), p. 361–382 [14] Y. Maday and A.T. Patera. Spectral Element Method for Navier-Stokes Equations. In: State of the Art Surveys in Computational Mechanics (A.K. Noor and J.T. Oden, eds.), (1989), p. 71–143 [15] P. Le Tallec and A. Patra. Non-overlapping domain decomposition methods for adap- tive hp approximations of the Stokes problem with discontinuous pressure fields. Comput. Methods Appl. Mech. Engrg. 145 (1997) 361-379. [16] J. T. Oden, A. Patra and Y. Feng. Parallel Domain Decomposition Solver for Adap- tive hp Finite Element Methods. SIAM J. Numer. Anal. vol. 34 (1997) 2090-2118. [17] S.A. Orszag, spectral methods for problems in complex geometries, J. Comput. Phys., vol. 37 (1980), 70–92. [18] A.T. Patera, Spectral Element Method for Fluid Dynamics: Laminar Flow in a Channel Epansion, J. Comput. Phys. 54 (1984), p. 568–488

27 [19] C. Schwab. The p and hp FEM. Oxford University Press, 1998.

[20] STRIPE, Flygtekniska F¨ors¨oksanstalten, Box 11021, S-16111 Bromma, Sweden.

[21] Stresscheck, ESRD Inc., 10845 Olive Boulevard, Suite 170, St. Louis, MO 63141-7760, http://www.esrd.com

[22] PHLEX, Computational Mechanics Company, Inc. 7800 Shoal Creek Blvd, Suite 290E, Austin, TX 78757-1024, http://www.comco.com

[23] Burkhard S¨undermann. Lebesgue Constants in Lagrangian Interpolation at the Fekete points. Ergebnisberichte der Lehrst¨uhleMathematik III und VIII (Ange- wandte Mathematik) 44, Universit¨at Dortmund, 1980.

[24] B. Szab´oand I. Babuˇska. Finite Element Analysis. Wiley 1991.

28 Research Reports

No. Authors Title

99-15 J.M. Melenk, K. Gerdes, Fully Discrete hp-Finite Elements: Fast C. Schwab Quadrature 99-14 E. S¨uli,P. Houston, hp-Finite Element Methods for Hyperbolic C. Schwab Problems 99-13 E. S¨uli,C. Schwab, hp-DGFEM for Partial Differential Equations P. Houston with Nonnegative Characteristic Form 99-12 K. Nipp Numerical integration of differential algebraic systems and invariant manifolds 99-11 C. Lage, C. Schwab Advanced boundary element algorithms 99-10 D. Sch¨otzau, C. Schwab Exponential Convergence in a Galerkin Least Squares hp-FEM for Stokes Flow 99-09 A.M. Matache, C. Schwab Homogenization via p-FEM for Problems with Microstructure 99-08 D. Braess, C. Schwab Approximation on Simplices with respect to Weighted Sobolev Norms 99-07 M. Feistauer, C. Schwab Coupled Problems for Viscous Incompressible Flow in Exterior Domains 99-06 J. Maurer, M. Fey A Scale-Residual Model for Large-Eddy Simulation 99-05 M.J. Grote Am Rande des Unendlichen: Numerische Ver- fahren f¨urunbegrenzte Gebiete 99-04 D. Sch¨otzau, C. Schwab Time Discretization of Parabolic Problems by the hp-Version of the Discontinuous Galerkin Finite Element Method 99-03 S.A. Zimmermann The Method of Transport for the Euler Equa- tions Written as a Kinetic Scheme 99-02 M.J. Grote, A.J. Majda Crude Closure for Flow with Topography Through Large Scale Statistical Theory 99-01 A.M. Matache, I. Babuˇska, Generalized p-FEM in Homogenization C. Schwab 98-10 J.M. Melenk, C. Schwab The hp Streamline Diffusion Finite Element Method for Convection Dominated Problems in one Space Dimension 98-09 M.J. Grote Nonreflecting Boundary Conditions For Elec- tromagnetic Scattering 98-08 M.J. Grote, J.B. Keller Exact Nonreflecting Boundary Condition For Elastic Waves 98-07 C. Lage Concept Oriented Design of Numerical Software 98-06 N.P. Hancke, J.M. Melenk, A Spectral for Hydrody- C. Schwab namic Stability Problems