Florida State University Libraries

Electronic Theses, Treatises and Dissertations The Graduate School

2008 Realtime Computing with the Parareal Algorithm Christopherr.Harden

Follow this and additional works at the FSU Digital Library. For more information, please contact [email protected]

FLORIDA STATE UNIVERSITY

COLLEGE OF ARTS AND SCIENCES

REALTIME COMPUTING WITH THE PARAREAL ALGORITHM

By

CHRISTOPHERR.HARDEN

A Thesis submitted to the School of Computational Science in partial fulfillment of the requirements for the degree of Master of Science

Degree Awarded: Spring Semester, 2008

The members of the Committee approve the Masters Thesis of Christopher R. Harden defended on April 8, 2008.

Janet Peterson Professor Directing Masters Thesis

Max Gunzburger Committee Member

Robert Van Engelen Committee Member

Approved:

Max Gunzburger, Director Department of School of Computational Science

The Office of Graduate Studies has verified and approved the above named committee members.

ii This thesis is dedicated to all of the people who have helped and guided me throughout my research including but not limited to Janet Peterson, Max Gunzburger, John Burkardt, Robert Van Engelen, and the many other professors who have provided me with an excellent level of instruction throughout all of my course work here at FSU. Also, I would like to dedicate this work to my wife, Jennifer Alligood, and my son, Youth, for their infinite depth of understanding of my situation as a graduate student and for their unending support throughout this endeavor.

iii ACKNOWLEDGEMENTS

I would like to acknowledge my gratitude to my committee members, who have taken the time to review my work and to pose the difficult questions which have kept me honest and thus have allowed me to grow as an academic throughout this process. A special acknowledgment is in order for my advisor who took a chance in bringing me under her wing and teaching me how to become a researcher and to John Burkardt, who took a lot of time out of his schedule to teach me many of the tools that were necessary for me to be able to complete this work. Also, some thanks are owed to Clayton Webster for all of his TEX support.

iv TABLE OF CONTENTS

List of Tables ...... vii

List of Figures ...... viii

Abstract ...... x

1. INTRODUCTION ...... 1

2. The Parareal Algorithm ...... 4 2.1 The Basic Algorithm ...... 4 2.2 A Simple Example ...... 5 2.3 Comments on Some Mathematical Properties of the Parareal Algorithm . 8

3. The Finite Element Method and The Parareal Algorithm ...... 14 3.1 A Finite Element Method ...... 14 3.2 The Finite Element Method and the Parareal Algorithm for Nonlinear PDE’s ...... 22

4. Combining the Parareal Algorithm and Reduced Order Modeling ...... 25 4.1 Reduced Order Modeling with Proper Orthogonal Decompositions .... 25 4.2 Reduced Order Modeling and The Parareal Algorithm ...... 29 4.3 Implementation ...... 33

5. Computational Experiments and Results ...... 35 5.1 FEM and The Parareal Algorithm Results ...... 35 5.2 ROM and The Parareal Algorithm Results ...... 38

6. Performance Analysis and Scalability ...... 44 6.1 Introduction to performance analysis concepts and terminology ..... 44 6.2 Problem Parameters in our FEM and ROM Parareal Implementations .. 50 6.3 Strong Scaling Trends of the Parareal Algorithm ...... 57

7. Conclusions and Future Work ...... 69 7.1 Conclusions ...... 69 7.2 Future Work ...... 70

REFERENCES ...... 72

v BIOGRAPHICAL SKETCH ...... 74

vi LIST OF TABLES

5.1 Comparison of errors using standard finite element approach and the parareal/FEM approach ...... 36 5.2 Speedup results for the parareal/FEM approach compared to the serial FEM approach...... 37 5.3 (Comparison of errors for 4-parameter problem using standard ROM approach and the parareal/ROM algorithm. Errors are calculated by comparing to the full finite element solution.) ...... 40 6.1 Results of FEM Test Case 1 ...... 57 6.2 Results of FEM Test Case 2 ...... 57

6.3 Results of FEM Test Case 3 ...... 58 6.4 Results of FEM Test Case 4 ...... 58 6.5 Results of FEM Test Case 5 ...... 59

6.6 Results of FEM Test Case 6 ...... 59 6.7 Results of ROM Test Case 1 ...... 60 6.8 Results of ROM Test Case 2 ...... 60

vii LIST OF FIGURES

2.1 Illustration of the coarse and fine grids ...... 6 2.2 B2, Exact Solution ...... 10

2.3 B2, Parareal Solution on the Coarse Grid After 2 iterations of the Correction Scheme ...... 11

2.4 B1, Phase Portrait ...... 12 2.5 B1, Coarse Grid Solution and Divergent Refined Solution...... 13 4.1 The H-cell domain of the building ventilation problem ...... 32

5.1 speedup of Parareal/FEM Implementation : Blue-∆t = 0.01 and Red-∆t = 0.005 ...... 38

5.2 speedup of Parareal/ROM Implementation : Blue-∆t = 0.01 and Red-∆t = 0.005 ...... 41

5.3 The H-cell domain of the building ventilation problem, with boundary param- eters illustrated...... 42

5.4 speedup of Parareal/ROM Implementation of the Navier-Stokes Problem : Blue-∆t = 0.01 and Red-∆t = 0.005 ...... 43

6.1 Suite A, Speedup vs. Processors ...... 53 6.2 Speedup of Parareal/FEM with h = 0.1, ∆t = 0.005, and T =1 ...... 61

6.3 Speedup of Parareal/FEM with h = 0.1, ∆t = 0.001, and T = 1 ...... 62 6.4 Speedup of Parareal/FEM with h = 0.05, ∆t = 0.005, and T = 1 ...... 63 6.5 Speedup of Parareal/FEM with h = 0.05, ∆t = 0.001, and T = 1 ...... 64

6.6 Speedup of Parareal/FEM with h = 0.1, ∆t = 0.001, and T = 10 ...... 65 6.7 Speedup of Parareal/FEM with h = 0.05, ∆t = 0.001, and T = 10 ...... 66

viii 6.8 Speedup of Parareal/ROM with h = 0.1, ∆t = 0.005, and T = 1 ...... 67

6.9 Speedup of Parareal/ROM with h = 0.1, ∆t = 0.001, and T = 10 ...... 68

ix ABSTRACT

This thesis presents and evaluates a particular algorithm used for the real time com- putations of time dependent ordinary and partial differential equations which employs a parallelization strategy over the temporal domain. We also discuss the coupling of this method with another popular technique used for real time computations, model reduction, which will be shown to provide more gains than either method alone. In particular, we look at reduced order modeling based on proper orthogonal decompositions. We present some applications in terms of solving time dependent nonlinear partial differential equations and solving these equations with a coupled approach of combining model reduction and the parareal algorithm . The performance of this method, both numerically and computationally, is discussed in terms of the gains in speedup and efficiency, and in terms of the scalability of the parallelization of the temporal domain on a larger and larger set of compute nodes or processors.

x CHAPTER 1

INTRODUCTION

Many of the equations and systems governing the various mechanisms of nature implicitly involve time evolution and time dependencies in general. Many successful attempts have been made to construct computational schemes for integration in time. Until recently, many of these schemes have shared the common theme of being purely serial in implementation. Traditional schemes seem to have taken the view that time, itself, is sequential and many schemes have been designed in ways such that computations at a current time step rely on values computed at previous time steps. In contrast to this traditionally sequential view of the temporal domain there are all sorts of varieties of ways in which researchers have dealt with the spatial domain: serial, parallel, or a combination of both. In the context of time dependent ordinary and partial differential equations many parallel schemes have been proposed in which discretizations of the spatial domain are implemented in parallel. One class of methods in which such implementations are prevalent is in domain decomposition methods. These methods have been shown to be useful in both serial and parallel implementations. The scheme presented here shares much of the flavor of these domain decomposition methods. There are two significant points of departure though. The first is the focus on decomposing the temporal domain and not the spatial domain. The second lies in the fact that this scheme has absolutely no value in a serial computation and is, therefore, a purely . Also, many people will notice its similarity to some of the various flavors of methods for iterative improvement for linear systems. The Parareal algorithm was introduced by Lions, Maday, and Turinici in 2001 [13] as a numerical method to solve time-evolution problems in parallel. The name of the algorithm already indicates the intention of its design. The purpose is for parallel, real time

1 computations involving time evolution equations whose solutions may not be obtainable in real time using only a single processor machine. Similar to what takes place in spatial decompositions they introduced a decomposition of the temporal domain. Also, in the spirit of domain decompositions they introduce both coarse and fine grid resolutions in time. These grids are then combined in a corrector scheme which allows for the coarse resolution to be updated iteratively while preserving the accuracy and stability of the time discretization scheme being used over the coarse and fine grid resolutions. The coarse grid approximation and the application of the corrector scheme are purely serial in the implementation. The fine grid approximations are serial only within each sub-domain of the coarse grid, thus allowing for the parallel implementation of the fine grid scheme on each of these sub-domains. The corrector scheme is then used to update the coarse grid approximation using the results of the fine grid approximations on each sub-domain, which have been computed simultaneously in parallel, and this procedure is iterated until convergence. In principal the Parareal algorithm is completely problem independent. Further, the algorithm leaves much flexibility in the choice of a time discretization scheme. It is important to note that the algorithm can never exceed the accuracy or the stability of the numerical scheme being employed. Also, when we talk about the convergence of the algorithm we do so in terms of its approach towards the solution that would have been obtained given that the problem was solved directly using the fine grid over the entire temporal domain. Another upshot of this algorithm, which makes it especially promising for real time computations, and in stark contrast to the more traditional spatial decompositions, is that in a true parallel implementation the algorithm requires a very minimal amount of communication between any of the processors carrying out the fine grid approximations. In this work, we explain the computational capabilities of this algorithm and then analyze the practical results of the computations of some numerical experiments involving a variety of complex systems. The original contribution being made in this work is the combining of the parareal algorithm and model reduction to obtain more computational gains than either method alone. We will look at the effective speedup and efficiency of the parallel computations versus their serial implementations as a function of the number of processors employed and with this provide an analysis of the scalability of these parallel implementations. We begin with an introduction to the standard parareal algorithm. In this introductory

2 section we illustrate the algorithm with a very simple ordinary differential equation (ODE) to help make the idea clear. In the next chapter we discuss our implementation of the algorithm using a Finite Element Method (FEM) for the solution of nonlinear partial differential equations (PDEs). The next chapter will detail our coupling of the algorithm with Reduced Order Modeling (ROM) to obtain further gains in speedup and efficiency. We then devote an entire section to our numerical experiments that were implemented. Then we provide a section on performance and scalability analysis. Finally, a summary of this thesis and a discussion of future work is given in the concluding chapter.

3 CHAPTER 2

The Parareal Algorithm

2.1 The Basic Algorithm

In general we are interested in solving time dependent differential equations of the general form, ∂u + Au = f, t ∈ [0,T], x ∈ Ω (2.1) ∂t u(0, x)=u0 where, in general, A is an operator from a Hilbert Space S into another Hilbert space S′. A typical example would be the standard heat equation in which case, A = −∇2 = −∆, the Laplace operator. Although we are focusing primarily on the case of A being a linear operator, for the sake of the introduction, the algorithm works well for nonlinear cases as well, which is the class of primary interest in our research. Nonlinearities can also enter this general form through f and whether or not it is a function of the solution u and/or any of its derivatives. To illustrate the basic idea of the method in this chapter our focus will be on an implementation involving ODEs in which the equation depends solely on time; in this case one can then think of A as either just a constant or some single function of time, in the case of scalar equations, or as some coefficient matrix in the case of a system of equations. We then introduce a decomposition of the time interval [0,T] into N equal subintervals [T n,Tn+1] of size ∆T = T/N such that,

0=T 0

It is not necessary for the time step to be uniform but makes the introduction to the method a bit simpler. We then solve the discretized problem over this coarse grid with the large time

4 step ∆T and thus obtain a cheap approximation with a coarse resolution at each T n ∈ [0,T]. We denote this coarse grid solution by U n ≈ u(T n) and of course U 0 = u0. Next, we pose a similar, yet connected, problem on each subinterval [T n,Tn+1] which is the fine grid resolution with the much smaller time step δt. In the case of uniform time steps we have that the coarse step will be directly proportional to the fine step, i.e. we have that, ∆T = Mδt, where M is the number of steps to be taken on each of the N subintervals, [T n,Tn+1]. On each of these subintervals we solve the problem, ∂un + Aun = f, t ∈ [T n,Tn+1] (2.3) ∂t un(T n)=U n where un is the fine grid approximation of u for t ∈ [T n,Tn+1]. The fine grid resolution on each subinterval can be computed independently and thus in parallel. If we let k denote the n n n n iterate and U1 = U , u1 = u , then we can iterate to improve the accuracy of the scheme n n n n+1 by using the known values of Uk and uk over each [T ,T ] in a corrector scheme to obtain n n better approximations Uk+1 and uk+1 over these subintervals. Specifically we,

n n−1 n n (i) introduce the defect: Sk = uk (T ) − Uk ; n (ii) then on the coarse grid solve for the correction, δk , n n to the fine grid solution at T using the defect Sk ; n n−1 n n (iii) then update Uk+1 = uk (T )+δk and solve (2.3) again in parallel.

This procedure is then iterated until the accuracy that would have been attained by solving the full fine grid resolution problem over [0,T] with time step δt is achieved. The important thing to keep in mind is that the steps involving calculations on the fine grid within each subinterval are independent and thus parallelizable. To illustrate the method, let us look at an implementation of the algorithm employed on a simple ODE.

2.2 A Simple Example

Here is an example of the use of this algorithm on a very simple, first order linear ODE. To further aid in simplicity I will employ an implicit Euler scheme for the discretization of the temporal domain. This example will help to illustrate the mechanical aspect of implementing

5 Coarse Grid

0 1 n n+1 N T T T ∆T T T

❄ Fine Grid n n T δt T +1

1

Figure 2.1: Illustration of the coarse and fine grids

this algorithm in practice. Consider the ODE, dy − ay(t) = sin(t),t∈ [0,T] (2.4) dt y(0) = y0 and then employ the backward Euler scheme over the coarse grid, 0 = T0

6 where n =0, 1,...,N − 1. Next, we pose a similar problem on each sub-domain [T n,Tn+1] using the fine grid resolution with time step δt. Here we use the nodal values computed over the coarse grid for each T n as the initial value for this new but similar problem on each [T n,Tn+1] as follows, dyn − ayn(t ) = sin(t ),t∈ [T n,Tn+1] (2.6) dt δt δt δt n n n y (tδt = T )=Y

n n where Y is the solution of (2.5) at the value of T and tδt denotes the values of t within each of the subintervals [T n,Tn+1] with fine time step δt. The next step is to use the information computed on each of these sub-domains to update the coarse grid solution at each T n. In fact, the only information needed from the computations done on these subintervals is the value of each yn(T n+1), n =0, 1,...,N − 1. That is, the value which is the last (right most) value computed on each subinterval [T n,Tn+1]. This value is relative to the position of each value of the solution to (2.5) on the coarse grid and then used to update these nodal values over the coarse grid through the corrector scheme as follows,

n n−1 n n (i) introduce the defects Sk = yk (T ) − Yk ; n (ii) then propagate the defect with a coarse resolution of the δk as follows;

δn+1 − δn Sn k k − aδn+1 = k (2.7) ∆T k ∆T 0 δk =0.

−1 0 0 Note: we define, yk (T )=y . Also, we introduce k as the iteration number, hence if this is the first application of this procedure then k = 1. The computations done with the fine grid resolution on each subinterval are all indepen- dent problems and thus parallelizable. The only information required globally by the various n n n+1 processors is the nodal values of the coarse grid approximation Yk . Once each y (T ) has been computed they may be returned to the master processor and used to correct the original nodal values over the coarse grid approximation by solving (2.7) and then setting,

n n−1 n n Yk+1 = yk (T )+δk . (2.8)

7 Finally, we iterate this procedure by next solving the problem, dyn k+1 − ayn (t ) = sin(t ),t∈ [T n,Tn+1] (2.9) dt k+1 δt δt δt n n n yk+1(tδt = T )=Yk+1 and we continue iterating in this fashion until convergence is reached. Note that the convergence here is used in the sense of the coarse approximation approaching the accuracy of the fine grid approximation. Once the error in the coarse grid approximation gets to be within machine epsilon of the error in the fine grid approximation then further iterations become superfluous. There is, therefore, an optimal number of iterations required to reach the desired accuracy of the fine grid approximation. This algorithm is rich in computation over communication. Many of the methods used to solve PDEs numerically, such as our FEM and ROM methods, are dominated in computation and in storage by having to solve large linear systems of equations. In the linear case it is one linear solve for each time step, and the nonlinear solve typically involves many linear solves. In an implementation of the parareal algorithm the expensive, i.e. the many solves done on the fine grid, portion of the algorithm is done in parallel while it requires one initial coarse solve, to generate initial data for each of the fine solves, and only one coarse solve for the correction term per iteration. Furthermore, we are saving on storage and communication by only storing and then communicating the solution of the final step in the fine solve with step δt. Thus, if we have J spatial unknowns at each of the N time steps in the coarse grid we will be passing arrays of size J a total of N times and hence we pass JN doubles or floats, or some type of data structure, per each iteration of this algorithm. Since N is typically small compared to M (number of steps in the fine grid) then we are only passing a small number of arrays per each iteration but doing a large number of linear solves in parallel. So, this saturates the method in computation compared to communication, which is minimal. 2.3 Comments on Some Mathematical Properties of the Parareal Algorithm

The primary focus of this work is to highlight the computational aspects of this algorithm both in terms of performance and implementation. The purpose here is simply to highlight some of the basic mathematical properties of the algorithm which have already been given a great deal of attention in the literature.

8 In principle the convergence and stability of the method is essentially inherited from the numerical scheme that is being employed over the coarse and fine grid. There are some cases where this fails though. For a thorough discussion of these results the reader is referred to [1], where one will find the most general and abstract analysis of the mathematical details on the convergence and stability criterion for the parareal algorithm. Also, one can find similar results in [5]. For a nice general overview of all of the basic mathematical results of this algorithm, without all of the analysis and formal proofs the reader should take a look at, [17]. First, we provide a simple example problem which works well with this algorithm and then we shall take a look at a case where the method breaks down and discuss why it fails in that case. Both of these problems were taken from a test suite of dynamical systems found in [12]. Consider the following dynamical system. This system models a linear chemical reaction, denote the system by B2,

′ y1 = −y1 + y2,y1(0) = 2, (2.10) ′ y2 = y1 − 2y2 + y3,y2(0) = 0, ′ y3 = y2 − y3,y3(0) = 1.

First, take a look at the exact solution in Figure (2.2). Next, take a look at the parareal solution on the coarse grid after two iterations of the correction scheme in Figure (2.3). A simple backward in time Euler method was used on both the coarse and the fine grid. This is an example of a problem where the parareal algorithm works very well. Now, let us consider the following dynamical system. These equations model the growth of two conflicting populations, denote the system by B1,

′ y1 = 2(y1 − y1y2),y1(0) = 1, (2.11) ′ y2 = −(y2 − y1y2),y2(0) = 3.

First, take a look at a phase portrait of the system (2.11) in Figure (2.4), Again, a simple backward in time Euler method is employed here, but the parareal algorithm breaks down for this system. Take a look at Figure (2.5) and we see the initial

9 B2 Exact Solution

1.0

0.95

0.9

y3 0.85 0.8

0.75

0.7 0.0 0.1 0.2 y2 0.3 0.75 0.4 0.5 0.6 1.0

y1 1.25

1.5

1.75

2.0

Figure 2.2: B2, Exact Solution

coarse solve in red and in navy blue we see one iteration of the corrector scheme which is diverging from the initial coarse solution. The question then is what was different about these two systems such that the parareal worked well for B2 but then falls apart for B1? If you’re thinking that it has to do with B1 being nonlinear, it is important to note that the nonlinearity has absolutely nothing to do with why the correction scheme diverged. B1 is a dynamical system who’s eigenvalues are not just complex but very near to being purely imaginary. So, for systems whose eigenvalues are very near to being purely imaginary or when the magnitude of the imaginary part is much greater than the magnitude of the real

10 Parareal Solution of B2

1.0

0.95

y3 0.9

0.85

0.0 1.20.8 0.25 y2 0.5 0.75 1.4

y1 1.6

1.8

2.0

Figure 2.3: B2, Parareal Solution on the Coarse Grid After 2 iterations of the Correction Scheme

part then the correction scheme in the parareal algorithm becomes dissipative and quickly begins to diverge from the original coarse grid solution. If one looks closely at Figure (2.5) it will be seen that initially the corrector scheme starts out doing fine and only after a few corrections does it start to veer away from the initial coarse grid solution. So far what researchers have been proposing in the literature is to do the most obvious thing which is to monitor when the corrector begins to veer away from the previous iterated solution, particularly when one has some apriori knowledge of the

11 Phase Portrait for B1, t in [0,20]

5

4

3 y2

2

1

0 0 2 4 6 8 y1

Figure 2.4: B1, Phase Portrait

eigenvalues associated with the problem, such as when working with hyperbolic equations or equations with very large amplitude oscillations. When the corrector begins to veer they halt the iteration and restart it from the previous step from which it began to veer away from the desired trajectory. For more details on the performance of a restarted version of the parareal algorithm the reader is referred to, [2], [4], and [17].

12 B1 Coarse and ’Refined’ Solution

6

5

4

y2 3

2

1

0 !5.0 !2.50.0 2.5 5.0 7.5 y1

Coarse Refined

Figure 2.5: B1, Coarse Grid Solution and Divergent Refined Solution.

13 CHAPTER 3

The Finite Element Method and The Parareal Algorithm

In the previous chapter we illustrated the method using a very simple linear ODE. It is our primary goal, though, to utilize the parareal algorithm for the real time computations of much larger and complex multivariate nonlinear differential equations. Our method for discretizing PDE’s in space is to utilize a Finite Element Method. In this chapter we provide an extremely brief over view of the FEM for time dependent partial differential equations. Then we explain how we implement the parareal algorithm to parallelize the temporal domain coupled with a FEM discretization of the spatial domain.

3.1 A Finite Element Method

The most significant difference between a Finite Element Method (FEM) and a more typical Finite Difference Method (FDM) is in contrast to the FDM, where you directly discretize the derivatives in the equation by replacing them with finite difference quotients. We instead work with the weak or integral form of the equation and then seek an approximation to the solution, N

u(x) ≈ cjφj(x), (3.1) !j=1 where the φj’s are a finite set of basis functions, typically piecewise polynomials, and N is the cardinality of this set of basis functions. The cj’s are then the unknown coefficients of our basis functions to be determined so that the function expansion, in terms of these basis functions, holds for the the problem of interest. Although, we are ultimately interested in time dependent PDE’s it is enough, for this work, to have a basic understanding of how we implement our FEM descretization in space.

14 As an illustration consider an elliptic problem which can be thought of physically as the steady state form of the linear heat or diffusion equation in two spatial dimensions over a unit rectangle. If we let Ω be our domain and ∂Ω be the boundary of our domain then, the classic or strong problem takes the familiar form,

− ∆u = f(x, y) with (x,y) ∈ Ω = (0, 1) × (0, 1) (3.2) u = g on ∂Ω, (3.3) where g is our Dirichlet boundary condition. This example is referred to as an elliptic PDE with Dirichlet boundary conditions; FEMs are very adaptable to a much larger variety of boundary conditions and domains; this is another area where we see the benefits of using a FEM. To formulate the corresponding integral or weak form of the problem we begin by choosing an appropriate space of test functions. In general, this choice will be a type of Hilbert space which is a complete, inner product space, which is further equipped with a norm or metric induced by this inner product. The use of such spaces gives us and, in general, the mathematical analyst all of the tools necessary to prove the properties needed for the numerical convergence, stability, and consistency of these FEM schemes. Mathematically, this provides another desirable advantage over many FDMs. The choice of the underlying Hilbert space is most directly relevant to the desired degree of smoothness required by your solution. It is very common for second order PDE’s to work with the Sobolev space H1(Ω) and in fact this is our choice for the underlying function space in the implementations presented in this present work. Informally, this space is simply the space of all functions which are square integrable and further have first derivatives which are also square integrable over their domain Ω. Another way to put this is, the space of functions which are in the more familiar Hilbert space L2(Ω), of square integrable functions, and whose gradient, in magnitude, is also in this space, i.e.,

H1(Ω) = {u ∈ L2(Ω) : |∇u|∈ L2(Ω)}. (3.4)

Clearly, H1(Ω) is a subspace of L2(Ω) and we define the inner product on this space as,

1 (u, v)H1(Ω) =(u, v)L2(Ω) +(∇u, ∇v)L2(Ω), ∀u, v ∈ H (Ω), (3.5)

15 2 where (u, v)L2(Ω) denotes the L (Ω) inner product given by,

2 (u, v)L2(Ω) = uv, ∀u, v ∈ L (Ω). (3.6) "Ω Using this definition of the inner product we have the naturally induced norm,

1 2 1 1 'u'H (Ω) =(u, u)H1(Ω), ∀u ∈ H (Ω). (3.7)

Many of the error estimates used in FEMs involve the behavior of the solution in the H1 semi-norm which we denote by |·|H1 and is defined as,

1 2 1 1 |u|H =(∇u, ∇u)L2(Ω), ∀u ∈ H (Ω). (3.8)

This is referred to as a semi-norm because it fails to conform to one property that defines a norm. Namely the property that states 'u' = 0 iff u = 0. Another commonly used space that one often encounters in the literature and which is used extensively for the analysis of a FEM is used when dealing with homogeneous Dirichlet boundary conditions, i.e. when the value of the solution to the differential equation is specified to be zero on the boundary of the domain over which it is defined. The space 1 1 is the Sobolev space, denoted, H0 (Ω) of all functions which are both H (Ω) functions but that also satisfy the zero boundary conditions on the domain Ω. This is a case where the boundary conditions can be safely imposed directly on the underlying space of test functions which can’t be done in general due to the compromise of the closure of that space and thus violating the requirement that it be a . Thus,

1 1 H0 (Ω) = {v ∈ H (Ω) : v = 0 on ∂Ω}. (3.9)

Now that we have the notion of our solution space clearly defined, we can look at how we develop the weak form of the problem. First, we take an arbitrary function in H1(Ω) and multiply our strong problem by this test function call it v. Then, we integrate this equation over all of Ω so we have,

− ∆uv = fv, ∀u, v ∈ H1(Ω). (3.10) "Ω "Ω Next, we use integration by parts or Green’s theorem, rather, on the term involving the laplacian to obtain the full integral or weak form of the problem and by doing so we will

16 have reduced the order of the derivatives in the equation from two to one and thus we can relax the level of smoothness required by our solution allowing the method to capture a larger class of solutions than in the FDM, for example. So the continuous weak problem is to seek u ∈ H1(Ω) such that, ∂u ∇u∇v − v = fv, ∀v ∈ H1(Ω). (3.11) "Ω "∂Ω ∂n "Ω The immediate question arises as to how well this relates to our original classical problem. It can be shown that if u satisfies the classical problem in differential form then it will indeed satisfy the corresponding weak or integral form of the problem. There is a small caveat in stating the converse though. It can also be shown that the weak solution does satisfy the classical formulation in the case that u is smooth enough. The details of the proof essentially boil down to being able to show that the forcing term on the right-hand side, f, is in fact an L2(Ω) function. From the boundary integral term in (3.11) one can see why Neumann boundary conditions ∂u are satisfied naturally by a FEM because then ∂n is specified and thus becomes a contribution to the right-hand side. For simplicity let us consider the case where g from (3.2) is zero, i.e. homogeneous Dirichlet boundary conditions. This allows us to impose the boundary conditions directly on our underlying space of test functions. If v = 0 over ∂Ω then the boundary integral term, in (3.11), becomes zero. We will thus restrict ourselves to the 1 problem of finding u ∈ H0 (Ω) such that,

1 ∇u∇v = fv, ∀v ∈ H0 (Ω). (3.12) "Ω "Ω Inhomogeneous Dirichlet boundary data can be handled in several ways. One of the easiest ways to handle this case is to transform the problem into one which does have homogeneous Dirichlet boundary data, solve this transformed problem and then transform its solution back to the solution that satisfies the original inhomogeneous conditions. Another approach is to deal with the boundary integral directly which will add a term to our bilinear form and additional contributions to the right-hand side of the equation. For the purposes of analysis the problem is usually cast into a more general and abstract form. If we let V denote a Hilbert space, let A(·, ·) denote a bilinear form on V × V and let F denote a linear functional on V, then the general weak problem we consider is to seek u ∈ V satisfying, A(u, v)=F (v), ∀v ∈ V. (3.13)

17 Many weak formulations encountered in applications can be posed within this general framework with an appropriate choice of the Hilbert space, the bilinear form, and the linear functional. In terms of our example concerning the elliptic problem with homogeneous Dirichlet 1 boundary conditions we have as our Hilbert space, V = H0 (Ω) and the bilinear form,

1 A(u, v)= ∇u∇v, ∀v ∈ H0 (Ω) (3.14) "Ω and for the linear functional we have,

1 F (v)= fv, ∀v ∈ H0 (Ω). (3.15) "Ω If F is a bounded linear functional on the given Hilbert space V and the bilinear form A(·, ·) is bounded, which is equivalent to being continuous on the space V, and furthermore, satisfies a property referred to as coercivity which is also referred to as V-elllipticity, in the literature, then the Lax-Milgram theorem guarantees the existence and the uniqueness of the solution to (3.13). For our example problem, these properties can be readily demonstrated, see [19]. Moreover, this theorem provides us with a bound of the solution in terms of the data that is given in the original problem.

Theorem 3.1.1 ( Lax-Milgram Theorem ) Let V be a Hilbert space with a norm '·' and let A(·, ·):V × V → R1 be a bilinear form on V which satisfies |A(u, v)|≤ M'u''v'∀u, v ∈ V (3.16) and A(u, u) ≥ m'u'2 ∀u ∈ V, (3.17) where M and m are positive constants independent of u, v ∈ V . Let F : V → R1 be a bounded linear functional on V. Then there exists a unique u ∈ V satisfying (3.13). Moreover, 1 'u'≤ 'F '. (3.18) m

Once we have done the appropriate analysis to convince ourselves that we are indeed working with a problem that bares a solution that is unique. We thus have some information about some a-priori error bounds we can begin to think about how to compute such a solution.

18 The first computational issue that arises immediately is that our general problem was posed over an infinite dimensional Hilbert space but for computational purposes we will always be restricted to having to work within finite dimensional spaces. So, it is then that our first task in making this a viable computational method is in choosing an appropriate finite dimensional subspace of our underlying infinite dimensional Hilbert space V. It is, in fact, the choice of this finite dimensional subspace which will either make or break the computational efficiency of our FEM. Since most of the computational effort in running a FEM simulation is usually dominated by solving linear systems of equations, it is extremely important to try to construct the method such that these resulting linear systems will be well structured, i.e. sparse or banded, for example. The key to achieving nicely structured matrices, is in one’s choice of the basis functions to be used to construct the finite dimensional subspace of the problem underlying the infinite dimensional Hilbert space V. The way to achieving such nicely structured linear systems is to seek basis functions with small local and very compact support. A popular way to achieve this is to utilize piecewise continuous polynomials constructed over the discretized computational domain Ω. Many applications involve the use of piecewise linears, or piecewise quadratics. The choice of what type of polynomials to use relates to one’s desired degree of regularity and accuracy obtained in the approximation (we will see this illustrated clearly in what follows). Once we have chosen an appropriate subspace, Sh ⊂ V which is true for conforming FEMs, we have a nice theorem that tells us about the existence and uniqueness of our solution in this new finite dimensional space. This gives us an estimate of the deviation of our finite dimensional approximation from the infinite dimensional one.

Theorem 3.1.1 ( Galerkin’s or Cea’s Lemma ) Let V be a Hilbert space with a norm '·' and let A(·, ·):V × V → R1 be a bilinear form on V satisfying (3.16) and (3.17), and let F (·) be a bounded linear functional on V. Let u be the unique solution of A(u, v)=F (v), ∀v ∈ V guaranteed by the Lax-Milgram theorem. Let {Sh}, 0

A(uh,vh)=F (vh), ∀vh ∈ Sh and moreover, (3.19)

19 M 'u − uh'≤ inf 'u − χh' (3.20) m χh∈Sh where M,m are the constants appearing in the Lax-Milgram theorem and '·' denotes the norm on V.

The bound provided in this theorem may not seem immediately useful, but if we open a text on approximation theory, we shall find that if we let Ihu be the Sh interpolant of u we have, inf 'u − χh'≤'u − Ihu'≤Chr, (3.21) χh∈Sh where the h relates the maximum step size in our discretization, C is a constant independent of h, and r is a constant determined by the degree of the polynomial interpolant used as a basis for Sh. For example, if we use piecewise linear polynomials we have,

h 2 |u − I u|H1 ≤ C1h'u'H2 , provided u ∈ H (Ω) (3.22)

h 2 2 'u − I u'L2 ≤ C2h 'u'H2 , provided u ∈ H (Ω) (3.23) and for the case of piecewise quadratic polynomials we have,

h 2 3 |u − I u|H1 ≤ C3h 'u'H3 , provided u ∈ H (Ω) (3.24)

h 3 3 'u − I u'L2 ≤ C4h 'u'H3 , provided u ∈ H (Ω) (3.25) and thus we are now equipped with some extremely useful a-priori error estimates on the results of approximating our solution by these choices of basis functions in the finite dimensional space Sh. Now that we have a reasonable notion of what kinds of subspaces may be used and how a set of basis functions can be constructed we may introduce the fully discrete form of the weak problem which is the actual problem to be solved computationally. First let us recall h that having a set of basis functions for the subspace, S0 , implies that all functions within that space can be represented in terms of a linear combination of the elements within that i=N h h set of basis functions. In particular, if we let {φi}i=1 ∈ S0 which is a nodal basis for S , where N = dim(Sh), or the dimension of Sh, then we have,

N h h h u = cjφj,u∈ S . (3.26) !j=1

20 Taking our test functions v to be the nodal basis itself we have as the discrete weak problem to seek a uh ∈ Sh, such that,

h h ∇u ∇φi = fφi, ∀φi ∈ S . (3.27) "Ω "Ω If we make use of the series expansion of uh in (3.26), we have for the fully discrete weak problem,

cj ∇φj∇φi = fφi. i, j =1,...,N (3.28) ! "Ω "Ω It is these equations that produce the linear system of equations to be solved. In general the system is often written in the form,

Mc + Kc = F (3.29) where M is referred to as the mass matrix and K the stiffness matrix. In our example we have for the mass matrix,

Mij = 0 (3.30) and the stiffness matrix is,

Kij = ∇φj∇φi (3.31) "Ω and the right-hand side vector is,

Fi = fφi. (3.32) "Ω In a time dependent FEM implementation, the computational complexity is in solving these systems of equations over each time step. This is then our motivation for turning to the parareal algorithm and to its hope of allowing us to divide up this computational complexity in parallel over multiple processes to be able to solve these problems quicker, in real time, and further to allow us to approach larger, more complex, and thus more realistic and practical sized problems obtaining our results in a much more reasonable time frame than previously available. In practice one one commonly discretizes the spatial domain as described here and then uses a finite difference approximation for the discretization of the temporal domain. For the details of the mathematical analysis of a time dependent FEM the reader is referred to, [19]. It is enough to understand what we mean when we talk about the FEM errors, which was the major point of the linear analysis, that we use to show the convergence of the parallel

21 implementation, i.e. that the parallel implementation can produce the same results as a sequential version of the numerical scheme, of the parareal algorithm when applied to time dependent nonlinear PDE’s. Similar results can be shown for nonlinear problems as well, but this requires quite a bit more background and results in functional analysis. So, we present just the linear analysis here simply to provide a flavor of the results and to give the reader a basic understanding of how the error estimates are obtained.

3.2 The Finite Element Method and the Parareal Algorithm for Nonlinear PDE’s

In the previous section we looked at a linear analysis to get a sense of what the error calculations mean and what norms are to be used in measuring the spatial error. It is our primary interest to use the parareal algorithm for large, complex nonlinear systems of time dependent differential equations. In this section we describe how we have implemented the parareal algorithm in conjunc- tion with the FEM to solve time dependent nonlinear partial differential equations. First, let us recall an overview of the basic algorithm. We introduce the notation, C, to denote the coarse grid solve; F, to denote the fine grid solves; and δ to denote the correction solves. This is fairly consistent with what is commonly found in the literature.

• Step (i) Decompose the time interval [0,T] into N coarse time intervals [Tn,Tn+1], n =0,...,N −1 of length ∆T and solve the discretized problem sequentially in time; denote the solution at each Tn by, Cn, n =0,...,N where C0 denotes the initial condition of the problem;

• Step (ii) decompose each coarse interval [Tn,Tn+1] n =0,...,N −1 into M subintervals of length ∆t; in parallel solve the discretized problem over each subinterval using Ci, i =0,...,N − 1 as the initial condition at Ti; denote this solution at the points Ti, i =1,...,N by, Fi;

•Step (iii) define the errors or defects at the points Tn, n =1,...,N, by Sn = Fn − Cn; solve the coarse grid problem for the correction to Fn where the right-hand side of the

Sn equation is the jump ∆T ; call these corrections δn, n =1,...,N;

•Step (iv) set Cn = Fn + δn; return to step(ii) if satisfactory convergence has not been achieved. In our implementation of this algorithm we are careful to implement it in a way such that

22 the sequential solves are linear, and thus the full nonlinear solves are done in parallel only.

We achieve this in two different ways. First, when we do our initial coarse solve for Cn we simply lag the nonlinearity in the equation thus making it a known term and the equation reduces to just a linear solve. Second, when we solve for the correction term δn we solve a linearized version of the differential equation. To illustrate an implementation of the algorithm, consider an initial boundary value problem for a nonlinear parabolic equation where we use a FEM for the spatial discretization and a backward Euler approximation for the temporal domain. Specifically, consider the problem,

ut − ∆u + f(u)=g(x,t)(x,t) ∈ Ω × (0,T] (3.33)

u(x, 0) = u0(x) x ∈ Ω u(x,t) = 0 (x,t) ∈ ∂Ω × (0,T]. (3.34)

1 • Step (i) For n =1,...,N, solve the linear problem with φ ∈ H0

0 0 Cn − Cn−1 0 0 0 φ + ∇Cn ·∇φ = g(x,Tn)φ − f(Cn−1)φ, where,C0 = u0 "Ω ∆T "Ω "Ω "Ω For k =0, 1,... perform the following steps until satisfactory convergence is achieved

(Tn+1−Tn) • Step (ii) for each interval [Tn,Tn+1], n =0,...,N − 1, either set ∆t = M (Tn+1−Tn) k with a fixed M or one can also fix ∆t, instead, and set M = ∆t ,Fn,0 = Cn and for m =1,...,M, solve in parallel the nonlinear problems over each subdomain,

k k Fn,m − Fn,m−1 k k φ + ∇Fn,m∇φ + f(Fn,m)φ = g(x,tm)φ; "Ω ∆t "Ω "Ω "Ω k denote this solution at each point Tn, n =1,...,N, by Fn ; k k k k •Step (iii) for n =0,...,N − 1, define Sn = Fn − Cn, set δ0 = 0 and for n =1,...,N solve the linear problem

k k k δn − δn−1 k ′ k k Sn−1 φ + ∇δn∇φ + f (Fn )δnφ = φ; "Ω ∆T "Ω "Ω "Ω ∆T • Step (iv) k+1 k k Cn = Fn + δn,n=1,...,N.

23 Note the subtle distinction in our choice of how we implement the nonlinear solves in parallel. In one case we can choose the number of steps M to be taken within each of the subintervals

[Tn,Tn+1] to be fixed, which then allows ∆t to vary with P . Another option is to instead fix the underlying fine grid resolution by setting ∆t to be fixed. In this case, M will vary with the number of processors P . This subtle choice in implementation has a significant effect on how this algorithm scales as we increase P . We explore these implications in detail within our chapter on performance analysis. It is important to keep in mind that sequentially we are solving linear problems only by lagging the nonlinear terms or solving a linearized version of the equation. It is only in parallel, over the fine grid, where we solve the fully nonlinear equations using a nonlinear solver such as Newton’s method or any of its variants. Performing all of the necessary sequential solves linearly is key to keeping the execution time of the sequential portions of our algorithm at a minimum. The parareal algorithm is such that the sequential computation time is dependent on the number of processors since we typically take the number of coarse intervals N to be equal to the number of processors T P we are employing by letting ∆T = P , then the initial sequential coarse solve and the k one coarse solve done per each further iteration for the correction term δn becomes more complex, in terms of computational costs, as we scale this algorithm to larger P .

24 CHAPTER 4

Combining the Parareal Algorithm and Reduced Order Modeling

The decrease in computational time that is achieved by the parareal algorithm alone is oftentimes still not sufficient to be able to perform real time calculations. Model reduction or reduced order modeling (ROM) techniques have been shown to be effective in reducing computational costs and therefore compute time, itself. In this chapter we couple the techniques of ROM and the parallel in time integration method to achieve a further reduction in compute time, thus bringing us closer to achieving true real time calculations. In what follows we first give a brief overview of the ROM methods we used which are the ones based on proper orthogonal decomposition (POD) and then describe our implementation of this ROM technique along with the parareal algorithm.

4.1 Reduced Order Modeling with Proper Orthogonal Decompositions

4.1.1 Main Idea

In model reduction the goal is to intelligently sample over pre-computed snapshots of the state equations to extract important samples that illustrate the dynamics of the problem d and then use these to construct a reduced basis, {ψi(x)}i=1 for the reduced state space. If the number of vectors in this reduced basis, d, is much smaller than that of the state space, then the effect will be that one will end up solving a d × d dense linear system instead of a n × n sparse linear system, where d << n, over each time step, and thus reducing the computational costs significantly. Once the reduced basis is constructed one seeks an

25 approximation urom(x,t) to the state u of the form,

d

urom(x,t)= aj(t)ψj(x) ∈ W ≡ span{ψ1, . . . , ψd}. (4.1) !j=1

Then one determines the coefficients aj,j =1, . . . , d, by solving the equation in the set W , e.g. one could find a Galerkin solution of the equation in a standard way, using W for the finite element space of approximations. Here we use the approach where the reduced basis is generated using proper orthogonal decomposition (POD) for a set of snapshots obtained by computing approximations of the solution of the differential equation using a sampling of parameter values over the computed time steps.

4.1.2 Generating the Snapshot Set

One of the key steps in achieving a good reduced model is in being able to generate good snapshot sets of the state equations that capture the dynamics of the problem that one is interested in investigating. A reduced model, in fact, is only as good as the snapshot set that one samples from; for if the dynamics of interest are not in the snapshot set then they most certainly will not be present in the ROM solution to the system. Unfortunately, at present, the generation of snapshot sets is not a science but still much more of an art and many ad hoc approaches are used in a large variety of situations. Most snapshot sets will likely contain large amounts of redundant information with respect to the dynamics of interest and so any a priori knowledge of the system in question can and should be used to close in on these interests in an attempt to reduce, as much as possible, the amount of redundant information present in the snapshot set. For example, one may have knowledge of certain bounds of allowable parameter values, there may be known constraints on the parameters or even the independent variables themselves, one may also have knowledge of correlations between parameters and variable, or one may also have knowledge of the way parameters are distributed in terms of a probability distribution function, for example. These are examples of the types of a priori information that should be considered and used when setting out to generate snapshot sets of one’s state simulations. Regardless of the detail in how the snapshot sets are generated, it is the case that almost every approach will involve the computations of the full, high dimensional state

26 or adjoint state, in the case of control or optimization, solves. It is the hope that paying the computational costs of a single or very few full high fidelity system simulations, to provide adequate snapshot sets, can be amortized by being given the ability to perform many more times the reduced order computations in the cases of real time designs or in instances where emergences arise and actual lives are dependent on the capabilities of real time computations. In practice the steps involving snapshot generation are considered as an offline, pre- processing step. The idea again being that once a good snapshot set is in hand then many ROM simulations can be achieved multiples of times over. So, it is that when the gains, such as in speedup or computational complexity are achieved by the ROM methods and reported the steps involving the snapshot generation are not included as part of the computational costs of the method being discussed or reported upon. Many approaches have been taken to obtain good snapshot sets. In this work and within the experiments reported we use the rather direct approach of computing full, high fidelity state solutions to the differential equations using a sampling of the parameter values involved in the system over a sampling of time steps.

4.1.3 Proper Orthogonal Decomposition

Now, we assume that we have a well-generated snapshot set that captures our dynamics of interest within the system we are studying. The question now becomes that of determining a reduced basis for our state space based on the snapshot set we have available to us. We now describe the method of proper orthogonal decompositions (POD) for achieving this goal. K Let, {&ak}k=1 denote the set of points in parameter space that are chosen for generating the snapshots. Let, ∆t denote the time-sampling interval, which is usually some multiple of the actual time step used to discretize the state system and perhaps even that same step itself. Although, there is no need for this sampling step to be uniform, we consider the uniform case in this introduction for simplicity. Let, l∆t, l =1,...,L, denote the corresponding sampling & times. Let, S˜k,l denote the solution (e.g., a vector of nodal values) of the discretized (e.g. by a high-dimensional finite element method) state system corresponding to the parameter point &ak sampled at time l∆t. The snapshot set could consist of the N = KL vectors,

& S&i = S˜k,l,k=1, . . . , K, l =1, . . . , L, i =(k − 1)L + l. (4.2)

27 m Given N snapshots S&j ∈ R , let S denote the M × N snapshot matrix whose columns are the snapshots, i.e.

S =(S&1, S&2,..., S&N ). (4.3)

Let, S = UΣVT , denote the singular value decomposition (SVD) of the snapshot matrix S. M The POD basis vectors, ψ&i ∈ R ,i =1,...,N, are the first N left singular vectors of the snapshot matrix S, i.e.

ψ&i = U& i, for i =1,...,N. (4.4)

The d-dimensional POD-reduced basis (d < N) are the first d left singular vectors of the snapshot matrix S, i.e.

ψ&i = U& i, for i =1, . . . , d. (4.5)

By construction the POD basis is an orthonormal basis, i.e.

&T & &T & ψi ψj = 0 for i =- j and ψi ψi =1. (4.6)

It can be shown that the energy error of the d-dimensional POD subspace is given by the sum of the squares of the remaining singular values associated with the singular values not used for the reduced POD-basis, i.e.

N 2 εpod = σj (4.7) j=!d+1 where N is the number of snapshots in the snapshot set and d is the dimension of the POD subspace. If one wishes for the relative error to be less than a perscribed tolerance, δ, i.e. if one wants, N 2 εpod ≤ δ |S&j| , (4.8) !j=1 then one should choose the smallest integer d, such that.

d 2 σj j=1 ≥ γ =1− δ. (4.9) #N 2 j=1 σj # In this way we have an approach to guide us in choosing the number, d, of POD vectors to be used in our reduced basis. In general, we can look at when the decay of the singular values becomes rapid enough that the inclusion of more singular vectors from the snapshot

28 matrix becomes redundant in the sense that they are no longer helping us to capture more information than what we already have included in the reduced basis. In the other case, when we know a priori what sort of tolerances we need to meet we can again use the singular values of the snapshot matrix to help guide us in how many POD vectors we need in our reduced basis. 4.2 Reduced Order Modeling and The Parareal Algorithm

Although, much work has been done, independently, on on both the parareal algorithm and model reduction, to date no one has looked into combining these two separate real time computing methods to investigate their combined performance capabilities. This section presents the work that was done on combining the parareal algorithm and reduced order modeling for the real time computations of time dependent nonlinear partial differential equations. The ROM techniques used have a purely spatial effect in that they drastically reduce the size of the linear system or systems that are solved at each of the time steps used in the temporal evolution scheme. The system being solved is due to the spatial discretization, at a fixed time step. So, ROM immediately provides a more efficient way of computing time evolution equations already. Traditionally though, the numerical temporal evolution schemes have all been sequential in nature and have thus been viewed as providing a limited means of performance gains. In the proceeding chapters on numerical experiments and performance analysis, we shall demonstrate that one can observe significant amounts of speedup in the high fidelity FEM simulations with the parareal implementations alone, as desired. It is well-documented in the literature that significant speedup can be achieved using ROM instead of the high fidelity FEM simulations, see [7], [8], [9], and [10]; we view the application of the parareal algorithm to the ROM setting as a way to exploit the overlooked performance gains that, consequentially we view, were inherent in our choice of how we implement our numerical schemes in the temporal domain. We will see that the performance gains such as speedup and scalability, of our ROM implementations, follow the performance trends of the full FEMs very closely. In combining the ideas of the parareal algorithm and ROM, we simply use ROM to solve

29 both of the coarse and fine grid calculations in space. We will see, in the next chapter, that the parareal algorithm behaves as expected when applied in the ROM setting. In this work we consider the problem of computing POD-based ROM solutions to nonlinear PDE’s with multiple parameters on the boundary. So, in general suppose we obtain an approximation upod(tn, &x) to the solution u(t, &x) of a nonlinear partial differential equation defined in a domain Ω with boundary Γ and evaluated at the time tn. The general form of the boundary conditions considered are given by

u(t, &x)=βk(t)gk(&x) on Γk,k =1,...,K (4.10) and K

u(t, &x) = 0 on Γ − Γk (4.11) k$=1 K where Γk Γl = ∅ if k =- l and k=1 Γk may be a portion of the boundary Γ or the entire % &K boundary. The functions {gk(&x)}k=1 are assumed to be given so that there are K (time K dependent) parameters {βk}k=1 that serve to specify the problem. Having time dependent parameters over the boundary complicates the ROM process; for more details on how this is handled see [8] and [10]. We consider two examples in this work.

4.2.1 4 - Parameter ROM Problem

We started with a reaction diffusion problem with multiple parameters. The problem is to solve,

2 ut − ∆u + u = 0 (x,t) ∈ Ω × (0,T] (4.12) u(x, 0) = 0 (x,t) ∈ Ω. (4.13)

Where, we take Ω = [0, 1] × [0, 1] and T = 1, with the following boundary parameters,

y =1 u =4x(1 − x)β1 (4.14) where, 2t if t<0.5 β = 1 ' 2(1 − t) if t ≥ 0.5

30 and,

y =0 u =4x(1 − x)β2 where β2 =4t(1 − t) (4.15)

x =0 u =4y(1 − y)β3 where β3 = | sin(2πt)| (4.16)

x =1 u =4y(1 − y)β4 where β4 = | sin(4πt)|. (4.17)

For the snapshot generation we sampled points in the four-dimensional parameter space, then solved the full finite element model with h = 0.1 for the equation by impulsively jumping between the sampled parameters; snapshots were generated from the solution at various time intervals for the choice of parameters. A total of 300 snapshots were generated and a POD technique was used to determine the basis vectors. Satisfying the inhomogeneous Dirichlet boundary data requires extra work and was handled by generating basis vectors, which satisfied general inhomogeneous boundary data; again see [8] for more details.

4.2.2 ROM - Navier Stokes Equation with 6 Boundary Parameters

The next problem we looked at was one where a ROM method had just been developed for a particular application which was much more complicated and thus practical. Our interest was to verify that we could observe the same performance trends as in both the first, simpler, ROM implementation and the FEM implementations. In this problem we are looking at a gas that has been released in a building, assumed toxic, and the question is the optimal configuration of the ventilation system with its inflow/outflow orifices to clear the building of the toxin as quickly as possible. We model the flow of the gas through the building with the time dependent incompressible Navier-Stokes equations over an H shaped domain, to emulate a particular portion of the building, and multiple boundary parameters to emulate the options of the inlet/outlet orifices. Formally, the flow problem is stated as

∂&u − ν∆&u + &u ·∇&u + ∇p =0 ∈ Ω × (0,T] (4.18) ∂t ∇· &u =0 ∈ Ω × (0,T]

&u(&x, 0) = u& 0 ∈ Ω

1 for the velocity &u and the pressure p; here the Reynolds number ν is chosen to be 100. The physical domain for this problem is illustrated in Figure 4.1 ; along the boundary, of this

31 flow domain, one should note the six sets of inlet/outlet orifices Γi,i =1,..., 6 and a main outlet orifice. The remainder of the flow domains boundary is a solid wall. We enforce a zero stress outflow boundary condition at the main outlet orifice as indicated in Figure 4.1 and homogeneous zero velocity boundary conditions along the solid portions of the wall.

Figure 4.1: The H-cell domain of the building ventilation problem

At the six sets of inlet/outlet orifices Γi, i =1,..., 6, we impose the following boundary conditions:

Γ1 (inlets) x1 =0, 8 ≤ x2 ≤ 9 u = .48β1(x2 − 8)(9 − x2)  ai ≤ x1 ≤ bi,x2 =6 v = .48β1(x1 − ai)(bi − x1)     Γ2 (inlets) x1 = 105, 8 ≤ x2 ≤ 9 u = −.5β2(x2 − 8)(9 − x2)   ci ≤ x1 ≤ di,x2 =6 v = .5β2(x1 − ci)(di − x1)     Γ3 (inlets) x1 =0, 2 ≤ x2 ≤ 3 u = .44β3(x2 − 2)(3 − x2)   ai ≤ x1 ≤ bi,x2 =5 v = −.44β3(x1 − ai)(bi − x1)   Γ4 (inlets) x1 = 105, 2 ≤ x2 ≤ 3 u = −.352β4(x2 − 2)(3 − x2)   ci ≤ x1 ≤ di,x2 =6 v = −.352β4(x1 − ci)(di − x1)     Γ5 (outlets) ai ≤ x1 ≤ bi,x2 = 11 u = .612β5(x1 − ai)(bi − x1)     Γ6 (outlets) ci ≤ x1 ≤ di,x2 = 11 u = .896β6(x1 − ci)(di − x1),   where (ai,bi) ∈{(10, 11), (22, 23), (34, 35)} and (ci,di) ∈{(70, 71), (82, 83), (94, 95)}, i =1, 2, 3. Approximate solutions of the Navier-Stokes equations are obtained using the standard Taylor-Hood finite element method for the spatial discretization, i.e., continuous piecewise

32 linear functions on triangles are used to approximate the pressure and continuous piecewise quadratic functions are used on the same triangles, as the pressure was computed, to approximate the components of the velocity. The backward Euler approximation is used for the temporal discretization. A uniform grid consisting of 8,520 triangles is used, resulting in 35,730 unknowns, and a uniform time step is also used in the full high fidelity FEM approximation. The full high fidelity FEM approximation is then used to generate our snapshot set which is then used to generate the POD bases. Again, having multiple parameters on the boundary complicates the ROM procedure see, [10], for the full details on how this is achieved. In the paper, [10], it is shown that the 35,730 unknowns in the full FEM solution can be reduced to 14 unknowns and still produce a very accurate approximation to the full finite element solution. On FSU’s IBM SP3 supercomputer this simulation ran in about fifty minutes with the ROM method. When we implemented the parareal algorithm we obtained speedup factors of up to six with the problem parameters we used. Thus, a potentially life saving simulation that was taking close to an hour to compute could then be computed in under ten minutes making it much more practical in terms of real time use. 4.3 Implementation

Assuming that you have produced a snapshot set and then generated a reduced basis from d the snapshots, {ψi(x)}i=1, of cardinality d, then we apply this basis as a direct substitute d for the formulated finite dimensional finite element basis functions, {φi(x)}i=1. We apply the same algorithm as we do in the FEM case but with the ROM bases instead, i.e. let C to denote the coarse grid solve, F to denote the fine grid solves, and δ to denote the correction solves, then;

• Step (i) Decompose the time interval [0,T] into N coarse time intervals [Tn,Tn+1], n =0,...,N − 1 of length ∆T and solve the differential equation sequentially in time: denote the solution at each Tn by, Cn, n =0,...,N where C0 denotes the initial condition of the problem;

• Step (ii) decompose each coarse interval [Tn,Tn+1] n =0,...,N −1 into M subintervals of length ∆t; in parallel solve the differential equation over each subinterval using Ci, i =0,...,N − 1 as the initial condition at Ti; denote this solution at the points Ti,

33 i =1,...,N by, Fi;

•Step (iii) define the errors or defects at the points Tn, n =0,...,N − 1, by ,

Sn = Fn − Cn: solve the coarse grid problem for the correction to Fn where the right-hand n side of the equation is the jump Sn; call these corrections δ , n =0,...,N − 1; •Step (iv) set Cn = F n + δn; return to step(ii) if satisfactory convergence has not been achieved. And thus on the 4 parameter ROM problem we have, 1 • Step (i) For n =1,...,N, solve the linear problem with φ ∈ H0

0 0 Cn − Cn−1 0 0 2 0 ψ + ∇Cn ·∇ψ = − (Cn−1) ψ, where,C0 = u0 "Ω ∆T "Ω "Ω For k =0, 1,... perform the following steps until satisfactory convergence is achieved

(Tn+1−Tn) • Step (ii) for each interval [Tn,Tn+1], n =0,...,N − 1, either set ∆t = M (Tn+1−Tn) k with a fixed M or one can also fix ∆t, instead, and set M = ∆t ,Fn,0 = Cn and for m =1,...,M, solve in parallel the nonlinear problems over each subdomain,

k k Fn,m − Fn,m−1 k k 2 ψ + ∇Fn,m∇ψ + (Fn,m) ψ = 0; "Ω ∆t "Ω "Ω k denote this solution at each point Tn, n =1,...,N, by Fn ; k k k k •Step (iii) for n =0,...,N − 1, define Sn = Fn − Cn, set δ0 = 0 and for n =1,...,N solve the linear problem

k k k δn − δn−1 k k k Sn−1 ψ + ∇δn∇ψ + 2(Fn )δnψ = ψ; "Ω ∆T "Ω "Ω "Ω ∆T • Step (iv) k+1 k k Cn = Fn + δn,n=1,...,N.

Again, we do only linear sequential solves so that the full nonlinear solves are done in parallel only thus providing us with some performance gains.

34 CHAPTER 5

Computational Experiments and Results

In this chapter we report the results of three separate trial problems primarily to show that the parareal algorithm is sequentially consistent and to give some basic evidence of the speedup capabilities of this algorithm. The speedup reported here is relative speedup, i.e., S = T1 , where T is the time to execute on a single processor and T is the execution time p Tp 1 p over P processors. In the next chapter we take a much closer look at the speedup, scalability, and an overall parallel performance analysis of this algorithm. All of the results reported in this chapter were computed on the Teragold supercomputer at Florida State University, which is an IBM SP2 system. 5.1 FEM and The Parareal Algorithm Results

We first consider a nonlinear parabolic problem, the model problem discussed in chapter (3), to illustrate the parareal algorithm applied in the FEM setting.

ut − ∆u + f(u)=g(x,t)(x,t) ∈ Ω × (0,T] (5.1)

u(x, 0) = u0(x)(x,t) ∈ Ω u(x,t) = 0 (x,t) ∈ ∂Ω × (0,T].

As a test we computed the case where f(u)=u2 and g(x,t) was chosen so that the exact solution is u(x, y, t) = cos(10t) tan(x2 + y2 − 1) with Dirichlet boundary data. All results use continuous piecewise quadratic elements on a triangular grid for the spatial finite element approximation, the backward Euler approximation is used for the temporal discretization, and a Newton’s method is used to solve the nonlinear equations with a serial direct banded linear solver, we take Ω = [0, 1] × [0, 1] and T = 1.

35 The first goal is to demonstrate the convergence of the parareal algorithm by showing that we do indeed obtain the same convergence properties as we do in a sequential implementation of a FEM. In Table, 5.1, we summarize the errors obtained by solving this problem using

Table 5.1: Comparison of errors using standard finite element approach and the parareal/FEM approach

Parareal Algorithm h ∆T ∆t FEM Initial Coarse Iteration # 1 Iteration # 2 L2−error L2−error L2−error L2−error

1 −3 −2 −2 −3 10 0.1 0.01 9.873×10 8.424×10 1.749×10 7.339×10

1 −3 −2 −2 −3 20 0.1 0.0025 2.501×10 8.424×10 1.760×10 3.116 ×10

H1−error H1−error H1−error H1−error

1 −2 −2 −2 10 0.1 0.01 6.117×10 0.4382 9.156×10 5.152×10

1 −2 −2 −2 20 0.1 0.0025 1.588×10 0.4372 8.457×10 1.704 ×10

a standard serial finite element approach with a timestep of length ∆t and the parareal algorithm using the same timestep for the fine grid calculation but a much larger timestep, ∆T , for the coarse grid calculations. As can be seen from the table, it takes only two iterations of the parareal algorithm to obtain equivalent accuracy as in the serial finite element approach. The errors reported are the maximum L2 and H1 errors over the time interval in which the problem was computed. In Table 5.2, we give a brief example of the speedup potential for this algorithm. Here T ∆T = # of steps taken in the coarse grid which typically corresponds to the number of T ∆T processors, ∆t = # of time steps taken in the fine grid over the whole domain, ∆t = # of 1 time steps taken in the fine grid over each sub-domain of size ∆T , and h = the number of elements used in the FEM spatial discretization. In this table the problem size is specified in a way so that the number of time steps taken within each sub-interval is a fixed constant ∆T T but since, ∆t = M and ∆T = P , where P is the number of processors, this means that

36 the underlying fine grid problem changed size for each P. Thus, the near linear scaling that is present in the table is only a weak scaling which tends to be linear. The details of this phenomenon will be the subject of the next chapter where we will take a much closer look at the scalability of this algorithm and the problem parameters that affect these trends.

Table 5.2: Speedup results for the parareal/FEM approach compared to the serial FEM approach.

No. of T T ∆T h processors speedup ∆t ∆T ∆t

1 10 100 10 10 10 3.62 1 10 400 20 20 20 4.82 1 10 6400 80 80 80 8.39

1 20 100 10 10 10 3.66 1 20 400 20 20 20 4.97 1 20 6400 80 80 80 9.13

In Figure 5.1, one can note the nonlinear strong scaling trend of this algorithm. As can be seen, this algorithm does not scale linearly in the strong sense, which is not necessarily bad, whenever the underlying fine grid spacing ∆t is fixed. As the coarse grid is refined to the point that it begins approaching the resolution of the fine grid, then we begin to loose speedup. The result is that there is always a sweet spot or an optimal number of processors to be used for a given problem. We investigate the scalability and how it depends on the problem one is working with in the next chapter. Some basic speedup results are presented here simply to provide a flavor of what is to follow in the proceeding chapter.

37 6

5

4

3 SpeedUp

2

1

0 0 10 20 30 40 50 60 Processors

Figure 5.1: speedup of Parareal/FEM Implementation : Blue-∆t = 0.01 and Red-∆t = 0.005

5.2 ROM and The Parareal Algorithm Results

In this section we look at some of the results of our combination of the parareal with model reduction algorithms.

5.2.1 4 - Parameter Problem, Reaction Diffusion

To illustrate the behavior of the algorithm we discussed in the in the ROM setting, we consider the following nonlinear reaction diffusion example, which involves four time dependent parameters on the boundary. We state the problem, once again, as in chapter (4)

38 to be clear.

2 ut − ∆u + u = 0 (x,t) ∈ Ω × (0,T] (5.2) u(x, 0) = 0 (x,t) ∈ Ω (5.3)

Where, we take Ω = [0, 1] × [0, 1] and T = 1, with the following boundary parameters,

y =1 u =4x(1 − x)β1 (5.4) where, 2t if t<0.5 β = 1 ' 2(1 − t) if t ≥ 0.5 and,

y =0 u =4x(1 − x)β2 where β2 =4t(1 − t) (5.5)

x =0 u =4y(1 − y)β3 where β3 = | sin(2πt)| (5.6)

x =1 u =4y(1 − y)β4 where β4 = | sin(4πt)| (5.7)

For the snapshot generation we sampled points in the four-dimensional parameter space, then solved the full finite element model with h =0.1 for the equation by impulsively jumping between the sampled parameters; snapshots were generated from the solution at various time intervals for the choice of parameters. A total of 300 snapshots were generated and a POD technique was used to determine the basis vectors. Satisfying the inhomogeneous Dirichlet boundary data requires extra work and was handled by generating additional basis vectors, which satisfied general inhomogeneous boundary data; again see [8] for more details. The calculations reported in Table 5.3 are for the spatial discretization of h = 0.1. The errors are computed by comparing with the full FEM solution of the problem since an exact solution is not known; once again the errors measure the maximum L2 error over all time. In Table 5.3, column four gives the error between the ROM solution and the standard FEM solution and the remaining columns give the errors at each stage of the parareal/ROM algorithm. Note that similar to the FEM example, it takes only two iterations of the parareal/ROM algorithm to match the error in the serial ROM solution. Again, Figure 5.2 is included to show that indeed the speedup trend is maintained for the ROM implementation.

39 Table 5.3: (Comparison of errors for 4-parameter problem using standard ROM approach and the parareal/ROM algorithm. Errors are calculated by comparing to the full finite element solution.)

Parareal/ROM Algorithm # basis ∆T ∆t ROM Initial Coarse Iteration # 1 Iteration # 2 vectors L2−error L2−error L2−error L2−error

4 0.1 0.01 2.442×10−2 0.1064 1.899×10−2 1.344×10−2

8 0.1 0.01 2.338×10−3 0.1054 1.405×10−2 1.940×10−3

16 0.1 0.01 1.126 ×10−4 0.1053 1.403×10−2 1.569×10−3

5.2.2 ROM and the Navier-Stokes Equations

We also wanted to test the algorithm on a more complex and thus more realistic system. We state the problem, once again, as in chapter (4) to be clear. For this example, we solve the time dependent incompressible Navier-Stokes equations given by ∂&u − ν∆&u + &u ·∇&u + ∇p =0 ∈ Ω × (0,T] (5.8) ∂t ∇· &u =0 ∈ Ω × (0,T]

&u(&x, 0) = u& 0 ∈ Ω × (0,T]

1 for the velocity &u and the pressure p; here the Reynolds number ν is chosen to be 100. The physical domain for this problem is illustrated in Figure 5.3 ; along the boundary, of this

flow domain, one should note the six sets of inlet/outlet orifices Γi,i =1,..., 6 and a main outlet orifice. The remainder of the flow domains boundary is a solid wall. We enforce a zero stress outflow boundary condition at the main outlet orifice as indicated in Figure 5.3 and homogeneous zero velocity boundary conditions along the solid portions of the wall.

At the six sets of inlet/outlet orifices Γi, i =1,..., 6, we impose the following boundary conditions:

40 6

5

4

3 SpeedUp

2

1

0 0 10 20 30 40 50 60 70 80 90 Processors

Figure 5.2: speedup of Parareal/ROM Implementation : Blue-∆t = 0.01 and Red-∆t = 0.005

Γ1 (inlets) x1 =0, 8 ≤ x2 ≤ 9 u = .48β1(x2 − 8)(9 − x2)  ai ≤ x1 ≤ bi,x2 =6 v = .48β1(x1 − ai)(bi − x1)     Γ2 (inlets) x1 = 105, 8 ≤ x2 ≤ 9 u = −.5β2(x2 − 8)(9 − x2)   ci ≤ x1 ≤ di,x2 =6 v = .5β2(x1 − ci)(di − x1)     Γ3 (inlets) x1 =0, 2 ≤ x2 ≤ 3 u = .44β3(x2 − 2)(3 − x2)   ai ≤ x1 ≤ bi,x2 =5 v = −.44β3(x1 − ai)(bi − x1) (5.9)   Γ4 (inlets) x1 = 105, 2 ≤ x2 ≤ 3 u = −.352β4(x2 − 2)(3 − x2)   ci ≤ x1 ≤ di,x2 =6 v = −.352β4(x1 − ci)(di − x1)     Γ5 (outlets) ai ≤ x1 ≤ bi,x2 = 11 u = .612β5(x1 − ai)(bi − x1)     Γ6 (outlets) ci ≤ x1 ≤ di,x2 =41 11 u = .896β6(x1 − ci)(di − x1),   Figure 5.3: The H-cell domain of the building ventilation problem, with boundary parameters illustrated.

where (ai,bi) ∈{(10, 11), (22, 23), (34, 35)} and (ci,di) ∈{(70, 71), (82, 83), (94, 95)}, i =1, 2, 3. Approximate solutions of the Navier-Stokes equations are obtained using the standard Taylor-Hood finite element method for the spatial discretization, i.e., continuous piecewise linear functions on triangles are used to approximate the pressure and piecewise quadratic functions are used on the same triangles, as the pressure was computed, to approximate the components of the velocity. The backward Euler approximation is used for the temporal discretization. A uniform grid consisting of 8,520 triangles is used, resulting in 35,730 unknowns for the full high fidelity FEM approximation. The full high fidelity FEM approximation is then used to generate our snapshot set which is then used to generate the POD basis. Again, having multiple parameters on the boundary complicates the ROM procedure see, [10], for the full details on how this is achieved. In the paper, [10], it is shown that the 35,730 unknowns in the full FEM solution can be reduced to 14 unknowns and still produce a very accurate approximation to the full finite element solution. In Figure 5.4, we see that the combination of our ROM approach along with the use of the parareal algorithm further accelerates the generation of the desired solution. At this point one should also note the trend displayed within each of the examples presented; namely that by using a higher resolution in the fine grid of the parareal implementation we are achieving a higher speedup factor. It is this property that makes the parareal a workhorse algorithm;

42 in that the more work we give it the more performance gains we get back out of it. The goal of the next chapter will be to try and quantify exactly how these performance gains are affected by the problem parameters such as the underlying fine grid resolution, for example.

6

5

4

3 SpeedUp

2

1

0 0 10 20 30 40 50 60 70 80 90 Processors

Figure 5.4: speedup of Parareal/ROM Implementation of the Navier-Stokes Problem : Blue- ∆t = 0.01 and Red-∆t = 0.005

43 CHAPTER 6

Performance Analysis and Scalability

In this chapter, we take a closer look at the parallel performance of the parareal algorithm. The focus here will be on the interesting way that this algorithm scales with a larger number of processors and how this scalability depends on specific aspects of the problem one is interested in solving. Efficiency and cost effectiveness of the algorithm will be addressed, also. All of the results reported in this chapter were computed on the new Florida State University shared High Performance Computing (HPC) facility. The FSU HPC consists of four head nodes, 128 quad core compute nodes (512 cores), 78 TB of usable storage, and non-blocking Infiniband and IP communication fabrics. 6.1 Introduction to performance analysis concepts and terminology

6.1.1 Quantities of Interest Speedup

One of the primary quantities of interest in this chapter is the speedup capabilities we get with the application of the parareal algorithm to both our FEM and ROM implementations. In order to make the results clear, we first need to define precisely what we mean here by speedup and exactly how this quantity is measured. In the previous section we gave some hints towards the speedup capabilities of this algorithm and now we seek to make these results more precise. In practice there are two basic classes of speedup, absolute and relative. In the case of absolute speedup the parallel algorithm is compared directly to the fastest known serial implementation of the method. Thus, if we let TA be the wall clock time of this serial

44 implementation and TP be the wall clock time of the parallel implementation using P processors then the absolute speedup is,

TA SPA = . (6.1) TP Often times it isn’t clear what the fastest serial implementation is and further many times these highly optimized versions may only be able to reach this peak on very specific architectures. In the case of the parareal algorithm we are dealing with a purely parallel algorithm so it is difficult to compare directly to a serial FEM or ROM implementation, for example. In these types of situations it is very common for people to study and report the relative speedup of their algorithms performance. In this case we let T1 be the wall clock time of the code running on a single processor while TP is again the wall clock time of the implementation over P processors and thus the relative speedup with respect to one processor is defined as, T S1 1 . PR = (6.2) TP

Similarly, if we let Tk be the wall clock time of the code running on a k processor while TP is again the wall clock time of the implementation over P processors then a generalized relative speed with respect to k processors is defined as, T Sk k PR = , (6.3) TP where it is assumed that k < P . In this work all of the speedup results reported are relative speedup as defined by (6.2) which will typically be denoted as simply, S, SP , or just speedup. For the parareal algorithm this is reasonable because it is a purely parallel algorithm and computing on a single processor is essentially just doing a single nonlinear solve over the full fine grid with the fine time step δt. Speedup is considered to be linear whenever,

SP ≈ P (6.4) where SP denotes the speedup obtained with P processors, and it is called super linear whenever,

SP > P. (6.5)

45 One common factor in the advent of observing a super linear speedup is the effect of cache aggregation. In the parallel computations, not only does the number of processors change, but so does the size of accumulated caches from different processors. With the larger accumulated cache size, more or even the entire data set can fit into the caches which can have a dramatic effect on reducing memory access time and producing additional speedup beyond what is provided purely from the computations being done in parallel. When neither one of these apply the speedup trend is then said to be nonlinear.

Efficiency

The efficiency of an algorithm using P processors is, S E = P . (6.6) P P The efficiency of an algorithm estimates how well-utilized the processors are in solving the problem, compared to how much effort is lost in communication and synchronization.

Algorithms with linear speedup and algorithms on a single processor have EP = 1. Many difficult to parallelize algorithms have an efficiency that approaches zero as P increases.

Scalability

The speedup trend is a statement about the scalability of the algorithm, which is our primary interest in this chapter. Scalability analysis is the process of analyzing the speedup trend as the number of processors is varied. Many times though the scaling of the algorithm depends on other problem specific parameters such as the problem size, for example. It is then also of great importance to explore the problem parameters to see how they can affect the scaling trend as the number of processes P is varied. We will look closely at what problem parameters are playing a significant role in affecting the scalability of our parareal implementations. In general, speedup describes how the parallel algorithm’s performance changes with an increasing number of processors P . Scalability is concerned with the efficiency of the algorithm with changing problem parameters, such as problem size, by choosing a P dependent on the problem parameter so that the efficiency of the algorithm is bounded below. If we take the problem size to be the parameter of interest in the scalability analysis, then we may define scalability formally.

46 Definition 6.1.1 An algorithm is scalable if there is a minimal efficiency ǫ> 0 such that given any problem size N there is a number of processors P (N) which tends to infinity as N tends to infinity, such that the efficiency EP (N) ≥ ǫ> 0 as N is made arbitrarily large.

There are two classes of scalability, strong and weak scalability. Weak scalability is a type of scaling study where the problem size is allowed to change as P is varied. Strong scalability, on the other hand, is a type of scaling study where all of the parameters used to specify the problem size are fixed as the number of processes P are varied and increase to larger and larger sizes. The hope with most parallel algorithms is to achieve at least linear speedup or rather a linear scaling trend. The idea being that if the parallel implementation provides any speedup whatsoever one would like to observe more speedup as the number of processes are increased. There are many parallel algorithms that by design don’t scale in a linear fashion and this doesn’t imply in any way that these types of algorithms are bad or not useful. In fact, there are applications where one needs a computation that involves a set problem size and with a nonlinear scaling trend there typically exists an optimal number of processors to be used so that one needs only invest in a cluster resource of size X instead of size Y , where X < Y , and perhaps saving a large sum of money; if this were going to be a routine type of calculation, such as in many real time computing needs. We will see that with the parareal algorithm we can achieve a near to linear scaling but only in the weak sense while the strong scaling trend is nonlinear. It is a goal of this chapter to explain how this occurs and the role that the problem parameters play in the scalability of the parareal algorithm with respect to our FEM and ROM implementations.

6.1.2 Metrics

In the practice of there are some classic results that give us a means of understanding the capabilities and limitations of the implementations of parallel algorithms. Two such classic results are Amdahl’s law and Gustafson’s law. Amdahl’s law is a model for the theoretical relationship between the expected speedup of parallelized implementations of an algorithm relative to the serial algorithm under the assumption that the problem size remains the same when the algorithm is parallelized. More specifically, the law is concerned with the speedup achievable from an improvement to

47 a portion of that computation, typically the parallelized fraction, where that improvement leads to an observable speedup in the overall execution of the algorithm. It is often used in the practice of parallel computing to try and predict the theoretical maximum speedup of an algorithm using multiple processors. Amdahl’s law can be stated in a large variety of ways where each has something more or less revealing to express about the meaning of the statement of the law. If we let f denote the fraction of the computation that is purely sequential and cannot be parallelized into concurrent tasks and let Ts denote the execution time of a sequential run of the algorithm, then we can state Amdahl’s law in terms of the expected speedup.

Law 6.1.1 (Amdahl’s Law v1) Amdahl’s law states that the speedup given P processors is, Ts P SP = = . (6.7) (1−f)·Ts 1+(P − 1)f f · Ts + P Something immediately evident from this formulation is that the maximum speedup is limited by f −1 since,

1 lim SP = . (6.8) P →∞ f

If we let TP1 denote the execution time of the parallelized portion of the algorithm using a single processor and Ts denote the execution time of the sequential portion of the algorithm, then we can formulate Amdahl’s Law in terms of the total parallel execution time.

Law 6.1.2 (Amdahl’s Law v2) Amdahl’s law states that the wall clock time of the parallel execution, T (P ), of the algorithm given P processors is, T T (P )=T + P1 . (6.9) s P Again, one can immediately see that in the limit the serial fraction of the algorithm is still present as an upper bound on the total execution time. This is a part of the motivation for many researchers to pay a great deal of attention to the optimization of the serial fractions of their algorithms. In our implementations of the parareal algorithm we achieve sequentially consistent results by doing only linearized state solves within the sequential portions of the algorithm precisely to help in reducing the amount of time being spent in the sequential fractions of the code.

48 Gustafson’s law addresses the shortcomings of Amdahl’s law, which cannot scale to match the availability of computing power as the machine size increases. Amdahl’s law is based on the assumption that there is a fixed problem size or fixed problem parameters. Gustafson’s law tries to account for a changing problem size by instead fixing the parallel execution time.

Law 6.1.3 (Gustafson’s Law) Gustafson’s law defines the scaled speedup by keeping the parallel execution time constant by adjusting P as the problem size N changes

SP,N = P + (1 − P )α(N), (6.10)

where α(N) is the non-parallelizable fraction of the normalized parallel time, i.e. TP,N = 1, given the problem size N. Assuming that the serial function α(N) diminishes with the problem size N, then the speedup approaches P as N approaches infinity as desired in a linear scaling. An immediate problem with applying these classic parallel computing principles to the parareal algorithm is that some of the assumptions implicit within each of these laws fails to hold true for this algorithm. In the case of Amdahl’s law, it is assumed that the execution time of the sequential portion of the algorithm remains constant as the number of processors is varied. In the case of Gustafson’s law, it is assumed that the sequential execution time decreases as the problem size grows. Both of these assumptions fail to hold for the parareal algorithm. In all of our implementations of the parareal algorithm the sequential portions of the algorithm, namely the initial coarse grid solve and the solve for the correction term upon each iteration, each grow with the the number of processors because we set the coarse T grid time step to be ∆T = P and so as we increase P we refine the coarse grid time step thus making each of the sequential solves more expensive computationally. A more practical problem exists for using these principles as a performance measure or estimator. In each case these laws require apriori knowledge of the execution time of the sequential portion of the algorithm. In most cases this is not known or difficult to gauge especially when this value changes with any of the problem parameters. Fortunately, this isn’t the end of the story. In 1990, Alan H. Karp and Horace P. Flatt proposed a metric that allows one to empirically measure the amount of time consumed by the sequential portion of the execution time of an implementation of a parallel algorithm. This is now well-known as the Karp-Flatt

49 Metric, which is an aposteriori empirical measurement. Given a parallel computation exhibiting a relative speedup ψ using P>1 processors, the experimentally determined sequential fraction e is defined to be the Karp-Flatt metric viz,

1 1 ψ − P e = 1 . (6.11) 1 − P It is easy to see that this metric is still consistent with Amdahl’s law. Consider the case

Ts where P = 1 in (6.1.2) and we have, T (1) = Ts + TP . If we define the serial fraction e = T (1) then the equation can be re-written as, T (1)(1 − e) T (P )=T (1)e + . (6.12) P

T (1) The relative speedup is defined as before, ψ = T (P ) , thus dividing (6.12) by T (1) we get, 1 1 − e = e + , (6.13) ψ P and by simply solving for the serial fraction we obtain the Karp-Flatt Metric. Keep in mind that this only shows that the metric is consistent with Amdahl’s law and is not a formal derivation since e is a metric and not merely a mathematically derived quantity. We will see that this metric is useful in tracking at which points the parareal algorithm is operating at or near its peak performance for a given set of parameters where strong scalability results, i.e. the problem size over the temporal domain remains fixed. What we will demonstrate is that the parareal is achieving it’s near to peak performance when the serial fraction of the algorithm, e is at its minimum.

6.2 Problem Parameters in our FEM and ROM Parareal Implementations

In this section the goal is to explain in detail the problem parameters that define the problem size of the simulation. It is these parameters and how they are treated that affect both the strong and weak scalability of the parareal algorithm. Thus, in order to understand how this algorithm scales, we first must shed light on precisely how the input parameters define the problem size of a particular implementation of the algorithm.

50 6.2.1 Input Parameters and Problem Size Parareal

The general input parameters for any implementation of the parareal algorithm, over a temporal domain, are the number of processors being used P , the coarse grid time step ∆T , the fine grid time step ∆t, the number of steps taken within each subinterval M, the size of the temporal domain T (assuming that we begin at T = 0), and the number of iterations k needed to reach the desired level of accuracy, which we take to be the accuracy that would be obtained with a solve over the entire temporal domain with the fine grid time step. Since we observed convergence to the desired accuracy of the fine grid resolution in two iterations of the correction scheme, with all of our implementations, we fix k = 2, in our performance analysis, to eliminate this variable and thus simplify the scalability study. Of course, all of these parameters are not independent of one another. Let us recall the following relations. In all of the implementations designed for this scaling study we let, T ∆T = , (6.14) P ∆T ∆t = , (6.15) M or ∆T M = . (6.16) ∆t The last two equations, (6.15) and (6.16), at a glance don’t appear to be significantly different but, in fact, the distinction is very important in terms of how the algorithm will scale. It is a seemingly subtle distinction in how the algorithm is implemented but the resulting behavior is drastically different. Weak Scalability In the case of (6.15) we fix M and essentially are telling the algorithm that regardless of the number of processors we are using, and thus the number of subintervals [T n,Tn+1] we generate, we want to perform M state solves over each of these subintervals with the fine time step ∆t. When M is fixed in this way we observe a linear scaling trend, i.e. SP ≈ P . This is not very surprising once one looks closer at what is happening with this configuration. It’s important to realize that under this configuration the fine grid resolution is a function T 1 of P , i.e. we have ∆t(P )= M ( P ). Thus, as we let P increase to larger sets of processors

51 the fine grid resolution, which is characterized by the fine time step ∆t, gets increasingly smaller so the problem size in the temporal domain is growing such that the fine step ∆t is always considerably smaller than the coarse step ∆T . This is a case of weak scalability and so we can conclude that the parareal algorithm can be made to scale linearly, but as we will demonstrate, only in the weak sense. To illustrate the weak scaling of the parareal algorithm under this type of problem parameter configuration consider the following test suite of time dependent scalar ODEs; this test suite is taken from [12] and each of the solutions to the ODEs in the suite are designed to exhibit various types of different behavior. This problem, A1, is the simplest case exhibiting basic monotonic decay, dy + y(t) = 0 (6.17) dt y(0) = 1,t∈ [0, 10].

This problem, A2, is a special case of the Riccati equation, dy −y3 = (6.18) dt 2 y(0) = 1,t∈ [0, 10].

This problem, A3, exhibits oscillatory behavior, dy = y cos(t) (6.19) dt y(0) = 1,t∈ [0, 10].

This problem, A4, is a basic logistic growth model, dy y y = (1 − ) (6.20) dt 4 20 y(0) = 1,t∈ [0, 10].

The solution to this problem, A5, is a spiral curve, dy y − t = (6.21) dt y + t y(0) = 4,t∈ [0, 10].

In Figure (6.1) we have plotted the speedup results for this test suite of ODEs. The number of processors P is plotted along the vertical axis and the speedup S is plotted along

52 1,000

750

P 500

250

0 0 50 100 150 200 250 300 350 S A1

A2 A3

A4 A5

Figure 6.1: Suite A, Speedup vs. Processors

the horizontal axis. We don’t currently have the capability to compute on a machine with 1,000 processors so the parallel speed up was emulated by a series of sequential runs where the parallel section of the code was looped P times. Each of the loops were timed and then divided by P and the timings from the sequential portions was added to the total. There is an analytical formula that does a fairly good job of capturing the weak scaling phenomenon. The result was derived by G. Bal in 2003, see [2] for the details of the derivation of the speedup and efficiency.

53 Theorem 6.2.1 (Speed Up for Weak Scaling)

T ∆t 1 S = T ∆T = ∆t ∆T , (6.22) k ∆T +(k − 1) ∆t k ∆T +(k − 1) T where ∆T and ∆t are the coarse and fine grid time steps respectively, T is the maximum time, and k is the number of iterations.

Theorem 6.2.2 (Efficiency for Weak Scaling) 1 E = T ∆t , (6.23) (k − 1) + k (∆T )2 where again, ∆T and ∆t are the coarse and fine grid time steps respectively, T is the maximum time, and k is the number of iterations.

As a demonstration of how close this result comes to capturing the actual speedup and efficiency, consider the tabulated results of problem (6.21).

P = 1,000 processors, A5 Speed-Up Efficiency Iterations Required Theoretical 333.33 33% 2 Actual 323.66 32% 2

P = 500 processors, A5 Speed-Up Efficiency Iterations Required Theoretical 166.67 33% 2 Actual 157.21 31% 2

P = 100 processors, A5 Speed-Up Efficiency Iterations Required Theoretical 33.3 33% 2 Actual 27.27 27% 2

P = 50 processors, A5 Speed-Up Efficiency Optimal Iterations Theoretical 10 20% 2 Actual 16.67 33% 2

Note that again we were able to reach the accuracy of the underlying fine grid resolution in each case with only two iterations of the corrector scheme. The weak scalability of the parareal algorithm is fairly well-understood and straight forward to explain. One problem with a weak scaling type configuration is that there is no

54 fixed underlying accuracy that is being sought for any P . It’s true that for a fixed P we have a fixed underlying fine grid resolution, but this reference point changes as we vary P , and if we simply let P get too large, then ∆t begins to get very small to the point that round off becomes a serious concern. Strong Scalability On the other hand, if we consider the case of (6.16) where we fix ∆t, then M varies instead T 1 but also as a function of P , we have M = ∆t ( P ). With this type of configuration we now have a fixed underlying fine grid resolution, and thus the problem size over the temporal domain remains fixed as we vary P . The price we pay is that we loose our nearly linear speedup and the scaling trend becomes nonlinear. This makes sense because as P becomes larger then at some point ∆T begins to approach the fixed value of ∆t and you are essentially solving the full fine grid problem with a purely parallel algorithm and hence speedup is lost and eventually slowdown will begin to occur. What we will demonstrate is that with such a configuration the strong scaling trend is such that for a specified set of problem parameters, in which the problem size remains fixed in time, then there is always an optimal choice of processors to achieve the maximum speedup for the given problem configuration.

FEM

The parareal algorithm applied to time dependent PDEs is sensitive to all of the general problem parameters of the standard parareal algorithm in the temporal domain, plus the additional parameters that serve to specify the problem sizes in each of the spatial dimensions as well. In all of the implementations within this work we are solving PDEs in two spatial dimensions over a unit square. In principle, we could have different sized spatial discretizations in each of the dimensions, but to simplify the performance analysis we implement a uniform discretization in each spatial dimension. In this work the problem size in space is specified by the number of finite elements used in the spatial discretization 1 which we denote by h and thus h is the actual step size used to discretize the domain Ω = (0, 1) × (0, 1), i.e. the unit rectangle. So, in our parareal/FEM implementations the problem parameters that effect the scaling T ∆T trend in general are P , T ,∆T = P ,∆t, h, and M = ∆t . Since we are now interested in investigating the strong scaling trend we keep ∆t fixed.

55 ROM

The parareal algorithm applied to time dependent PDE’s, where ROM has been applied, is very similar to the parareal/FEM parameters with a slightly different involvement with h. Recall that in the ROM setting we use POD to generate a reduced basis from a snapshot set that was generated from a high fidelity FEM solution which was generated using a specific h. What ultimately influences the scaling trend is the number of POD-basis vectors used in the reduced basis in space. In this scaling study we fix the number of effective basis vectors to be eight but in the problems with multiple parameters we generate extra basis vectors to handle each of the inhomogeneous parameters. In this scalability study we look specifically at the ROM problem with four parameters on the boundary, so if the effective basis is eight, then for this problem the total number of POD-basis vectors used is twelve. Thus, the only parameters that we probe in the ROM setting are the standard parareal T ∆T parameters that affect the temporal domain, i.e. P , T ,∆T = P ,∆t, h, and M = ∆t . We will see though that the strong scaling trend in the ROM setting is very similar to what we will observe in the FEM cases. If a value of h is specified in a ROM result, it specifically refers to the value of h used by FEM to generate the snapshot set from which the POD-basis is generated.

6.2.2 Test Suite for Strong Scalability Study

In each of the test suites described here we are simply stating, explicitly, the values that were set within each of the separate strong scaling experiments. In each case we choose two values for each of the input parameters and then run experiments with nearly every combination of those parameters to observe the scalability trend as we vary P .

FEM

The FEM test suite consists of the following six cases. FEM case 1: h =0.1, ∆t =0.005, and T =1.0, Figure (6.2). FEM case 2: h =0.1, ∆t =0.001, and T =1.0, Figure (6.3). FEM case 3: h =0.05, ∆t =0.005, and T =1.0, Figure (6.4). FEM case 4: h =0.05, ∆t =0.001, and T =1.0, Figure (6.5). FEM case 5: h =0.1, ∆t =0.001, and T = 10.0, Figure (6.6). FEM case 6: h =0.05, ∆t =0.001, and T = 10.0, Figure (6.7).

56 ROM

The ROM test suite consists of the following two cases. ROM case 1: h =0.1, ∆t =0.005, and T =1.0, Figure (6.8). ROM case 2: h =0.1, ∆t =0.001, and T = 10.0, Figure (6.9).

6.3 Strong Scaling Trends of the Parareal Algorithm

In each case all of the timings were done with the MPI standard wall clock timers from the openMPI distribution.

6.3.1 FEM Suite

Table 6.1: Results of FEM Test Case 1

P Speed-Up Efficiency Karp-Flatt 4 1.51 38% 55% 5 1.873 37% 42% 10 3.51 35% 21% 15 1.42 9.5% 68% 20 0.807 4% 125% 40 0.266 0.67% 383%

Table 6.2: Results of FEM Test Case 2

P Speed-Up Efficiency Karp-Flatt 4 1.535 38% 54% 5 2.03 41% 37% 10 4.486 45% 14% 20 2.58 13% 36% 40 1.67 4.2% 59% 50 0.876 1.8% 114%

Notice that in comparing cases one and two that although the maximum speed up occurs at around ten processors in each of the case that case two, which has a more refined fine grid resolution, is able to achieve more speed up for about the same price. This is typical in the strong scaling trend. In general, as we make the problem sizes larger we obtain better

57 performance gain although typically at the cost of using more processors. Also, note that the Karp-Flatt metric does indeed obtain a minimum value at the optimal choice of processors, which in these cases is about ten. Also, note that with the larger problem size in time, case two, a higher percentage of the implementation is benefitting from the parallelization.

Table 6.3: Results of FEM Test Case 3

P Speed-Up Efficiency Karp-Flatt 4 1.53 38% 54% 5 1.88 37.6% 41% 10 2.35 24% 36% 40 3.54 8.9% 26% 50 2.92 5.8% 33% 80 1.92 2.4% 51% 100 1.56 1.6% 64%

Table 6.4: Results of FEM Test Case 4

P Speed-Up Efficiency Karp-Flatt 4 1.96 49% 35% 5 2.55 51% 24% 10 2.38 24% 36% 20 2.89 14% 31% 40 1.86 4.7% 53%

In cases three and four we have increased the problem size in space. In adjusting this parameter we notice that we can obtain similar factors in speedup, but the cost is that it takes roughly two to four times as many processors to achieve this result. Also, the Karp- Flatt metric is still at a minimum near the sweet spot, i.e. the optimal number of processors where the maximum speedup occurs. In cases five and six we observe something very interesting. In each of these cases we have T = 10 and ∆t =0.001. The only difference is the problem size in space h. So far it seems that the strong scaling of the algorithm is most sensitive to the length of the time interval. Also, one can note that when the time interval is made larger then the problem size in space has a much more drastic affect on the scaling trend. In these cases one can observe that when h =0.1 as in (6.5) the maximum speedup occurs at P = 40, with a factor

58 Table 6.5: Results of FEM Test Case 5

P Speed-Up Efficiency Karp-Flatt 5 2.03 41% 37% 10 4.06 41% 16% 15 2.95 20% 29% 20 3.95 20% 21% 40 8.003 20% 10% 50 6.93 14% 13% 60 5.61 14% 16% 80 5.81 7.3% 16% 100 4.71 4.7% 20%

Table 6.6: Results of FEM Test Case 6

P Speed-Up Efficiency Karp-Flatt 4 2.09 52% 30% 5 2.6 52% 23% 10 2.57 26% 32% 20 4.74 24% 17% 40 18.69 47% 2.9% 50 22.55 45% 2.5% 80 32.191 40% 1.9% 100 36.925 37% 1.7% 200 39.7016 20% 2% of about 8. For the case, (6.6) where h =0.05 the maximum speedup was computed for P = 200, with a factor of about 40 which is a jump by a factor of five in both the speedup and the optimal number of processors needed to achieve the maximum speedup. Also, we again see that the Karp-Flatt metric is near its minimum value near the optimal number of processors required to achieve the maximum amount of speedup. It may be that for (6.6) the true maximum speedup actually occurs at some number of processors between 100 and 200 since the value of the Karp-Flatt metric increases slightly from P = 100 to P = 200.

6.3.2 ROM Suite

An important thing to note with both of the ROM cases is that they do indeed have a very similar scaling trend as in the FEM cases. One can observe that in case 1, (6.7) the maximum

59 Table 6.7: Results of ROM Test Case 1

P Speed-Up Efficiency Karp-Flatt 4 1.96 49% 35% 5 2.417 48% 27% 10 4.29 43% 15% 20 5.96 30% 12% 40 4.465 11% 20% 50 3.723 7.4% 25% 100 1.98 2% 50%

Table 6.8: Results of ROM Test Case 2

P Speed-Up Efficiency Karp-Flatt 4 1.003 25% 99.6% 10 4.945 49% 11% 20 9.91 50% 5.4% 40 19.003 48% 2% 80 32.373 40% 1.9% 100 36.74 37% 1.7% speedup occurs when P = 20 and in (6.8) we, again, see how drastically the lengthening of the temporal domain affects the strong scaling trend of the implementation. Once again we see that the Karp-Flatt metric is indeed attaining its minimum at the point when we are achieving the maximum speed up of our implementation. It is extremely nice to see that we have found a metric that aids us in tracking where the given problem will provide optimal performance in the case of strong scaling. The problem is that in its current formulation it is only useful in an aposteriori empirical measurement, so it’s not yet clear how we can make use of this in a predictive way. These results are at least promising in the sense that this metric may lead us towards an analytical approach to being able to predict the number of processors required to achieve the maximum performance from a given problem configuration in an apriori way. More work is required to determine how fruitful this metric could be for this purpose.

60 4

3.5

3

2.5

2 Speed U p

1.5

1

0.5

0 0 5 10 15 20 25 30 35 40 Processors

Student Version of MATLAB

Figure 6.2: Speedup of Parareal/FEM with h = 0.1, ∆t = 0.005, and T =1

61 4.5

4

3.5

3

2.5 Speed U p

2

1.5

1

0.5 0 5 10 15 20 25 30 35 40 45 50 Processors

Student Version of MATLAB

Figure 6.3: Speedup of Parareal/FEM with h = 0.1, ∆t = 0.001, and T = 1

62 4

3.5

3

2.5 Speed U p

2

1.5

1 0 10 20 30 40 50 60 70 80 90 100 Processors

Student Version of MATLAB

Figure 6.4: Speedup of Parareal/FEM with h = 0.05, ∆t = 0.005, and T = 1

63 3

2.8

2.6

2.4

2.2

2 Speed U p 1.8

1.6

1.4

1.2

1 0 5 10 15 20 25 30 35 40 Processors

Student Version of MATLAB

Figure 6.5: Speedup of Parareal/FEM with h = 0.05, ∆t = 0.001, and T = 1

64 9

8

7

6

5 Speed U p

4

3

2

1 0 10 20 30 40 50 60 70 80 90 100 Processors

Student Version of MATLAB

Figure 6.6: Speedup of Parareal/FEM with h = 0.1, ∆t = 0.001, and T = 10

65 40

35

30

25

20 Speed U p

15

10

5

0 0 20 40 60 80 100 120 140 160 180 200 Processors

Student Version of MATLAB

Figure 6.7: Speedup of Parareal/FEM with h = 0.05, ∆t = 0.001, and T = 10

66 6

5.5

5

4.5

4

3.5 Speed U p 3

2.5

2

1.5

1 0 10 20 30 40 50 60 70 80 90 100 Processors

Student Version of MATLAB

Figure 6.8: Speedup of Parareal/ROM with h = 0.1, ∆t = 0.005, and T = 1

67 40

35

30

25

20 Speed U p

15

10

5

0 0 10 20 30 40 50 60 70 80 90 100 Processors

Student Version of MATLAB

Figure 6.9: Speedup of Parareal/ROM with h = 0.1, ∆t = 0.001, and T = 10

68 CHAPTER 7

Conclusions and Future Work

7.1 Conclusions

In conclusion, first note that we have clearly demonstrated the potential of the parareal algorithm to aid in providing significant performance gains for computing nonlinear partial differential equations, and thus its potential for the computations of real time computing applications. Further, we have shown that in combining the parareal algorithm with model reduction we do indeed obtain greater performance gains than either method can provide alone. Also, we provide a fairly detailed performance analysis of the weak and strong scalability of the parareal algorithm. The subtle implementation differences are clearly explained so that one is aware of how to achieve either weak or strong scaling with this algorithm depending on which is preferred for their particular application. Demonstrations of each type of scalability trend has been provided. While we have provided explanations of why each of the two types of scalability trends behave either linearly or nonlinearly only the weak case has been captured analytically so far. Although, in the literature the analytic formula for the speedup and efficiency of the parareal algorithm, which applies to the case of weak scalability only, is reported to be the speedup and efficiency formulas and no distinction is made between the differences between implementing a configuration, which leads to either weak or strong scaling, see [17], for example. We have demonstrated that these analytical results are only applicable to the case of weak scalability. A test suite of problems for both the FEM and ROM implementations were employed to explore the sensitivity of the strong scaling trends to the input parameters of these problems. The results demonstrate that there is no significant difference in the strong scaling trends of either implementation, FEM or ROM. We have identified a metric that allows one to track

69 the performance trend in the case of strong scalability but only in the sense of an aposteriori empirical measurement. We believe that this provides a potential avenue for investigation that may perhaps provide the ability to capture the strong scaling trends of the parareal algorithm analytically. 7.2 Future Work

Scalability Analysis

The holy grail, so to speak, of the strong scalability analysis is to be able to capture the phenomenon analytically for the purpose of being able to predict apriori how many processors will be needed to achieve the maximum speedup possible for a given problem configuration and application. There is a lot of work that is left to be done on fully understanding and quantifying, in particular, the strong scalability trends of the parareal algorithm. One possibility is to explore what more can be done with the Karp-Flatt metric to produce an analytic formula to try and capture these trends. Also, there is more work to be done in trying to further understand the algorithms sensitivity to all of the problem parameters. Recall, that for the strong scaling study we fixed the number of iterations k for each problem and in the ROM setting we fixed the number of POD-basis vectors. This leaves some question as to how much these parameters might affect the performance trends of the algorithm.

Coupling with other Parallel Algorithms

It would be interesting to see how much of a performance gain would result from the coupling of this algorithm with other parallel algorithms. Parallelization in space, i.e. domain decomposition, might have a significant affect on the performance of high fidelity FEM simulations. Also, the use of parallel linear solvers could be significant in both the FEM and ROM settings for gaining some performance, since the computational complexity of each method tends to be dominated by solving systems of linear equations. After parallelizing so many routines in a FEM simulation, for example, the actual assembly procedure for constructing the linear systems associated with the FEM discretization may quickly become a significant bottleneck to gaining more performance. There has been some interest in researching techniques to actually parallelize the assembly routines within a FEM implementation.

70 Adaptivity

In this work all of our discretizations were uniform in both the spatial and temporal domains. It would be interesting to investigate the use of non-uniform discretizations, particularly in the temporal domain, to see how this might affect the performance of the parareal algorithm. If non-uniform time stepping in either the corse grid, fine grid, or perhaps both doesn’t have any significant affects on the performance of the algorithm, then this would open the door for investigating the use of various adaptive methods. It would be nice to see us saving computational time by using adaptive methods to focus the computational efforts of the algorithm in regimes that are in need of a finer resolution or further iterations of the corrector scheme.

Applications

The real point of this work is to explicate the computational aspects of the parareal algorithm to see if the performance gains are suitable for applications involving the need for real time computations of complex nonlinear partial differential equations. It would then be really fun to begin exploring what types of real-time computing applications in science and engineering would work well with the parareal algorithm. Some examples that come to mind are the need for performing real time control of PDEs, real time design simulations driven by PDEs, interactive physics simulators for various types of career training, and diagnostic safety systems that have to compute PDEs very quickly to determine the appropriate configuration of a safety mechanism of some sort. Of course, many people might be interested in simply being able to compute their PDE or ODE driven simulations quicker but not necessarily in real time. Exploring the rich span of application areas to determine which settings are well-suited for the parareal algorithm is something seen as essential and is of great interest.

71 REFERENCES

[1] G. Bal, On the Convergence and the Stability of the Parareal Algorithm to Solve Partial Differential Equations , in Proceedings of the 15th international domain decomposition conference, R. Kornhuber, R.H.W. Hoppe, J. P´eeriaux,O. Pironneau, O.B. Widlund, and J. Xu, eds., Springer LNCSE, 2003, pp 426-432. [2] G. Bal, Parallelization in time of (stochastic) ordinary differential equations. ,Math. Meth. Anal. Num. (submitted), 2003b. Preprint; www.columbia.edu/ gb2030/PAPERS/ParTimeSODE.ps, 2003. [3] G. Bal and Y. Maday, A Parareal Time Discretization for Nonlinear PDEs with applications to the pricing of an American put , vol. 23 of Lect. notes Compt. Sci. Eng., Springer, 2002, pp 189-202.

[4] P.F. Fishcher, F. Hecht, Y. Maday, The parareal in time semi-implicit ap- proximation of the Navier-Stokes equations, in Proceedings of the 15th International Conference on Domain Decomposition Methods, Berlin, Vol 40, LNCSE series, Springer Verlag, 2004. [5] M.J. Gander and S. Vandewalle, Analysis of the Parareal Time-Parallel Time- Integration method , Technical Report, Report TW 443. Katholieke Universiteit Leuven, Department of Computer Science, Belgium, 2005.

[6] William Gropp, Ewing Lusk, Anthony Skjelllum, Using MPI, Portable Parallel Programming with the Message-Passing Interface , MIT Press, Cambridge, 1997

[7] M. Gunzburger, J. Burkardt, Q.Du, H.-C. Lee, Reduced-order modeling of complex systems, in 2003 : Proceedings of the 20th Biennial Conference on Numerical Analysis, University of Dundee, Dundee, 2003, pp 29-38. [8] M. Gunzburger, J. Peterson, Reduced-order modeling of complex systems with multiple system parameters, in Large Scale Scientific Computing : 5th International Conference, LSSC 2005, Sozopol, Bulgaria, June 6-10, 2005 Springer, Berlin, 2006, pp 15-27.

[9] M. Gunzburger, J. Burkardt, H.-C. Lee, POD and CVT-based reduced-order modeling of Navier-Stokes flows, in Comput. Methods Appl. Mech. Engrg. 196, 2006, pp 337-355

72 [10] M. Gunzburger, J. Peterson, J. Shadid, Reduced-order modeling of time- dependent PDEs with multiple parameters in the boundary data, in Comput. Methods Appl. Mech. Engrg. 196, 2007, pp 1030-1047 [11] C. Harden, J. Peterson, Combining the parareal algorithm and reduced order modeling for real-time calculations of time dependent PDE’s, International conference on recent advances in scientific computation, Beijing, China, June 2006 [12] T.E.Hull; W.H.Enright; B.M.Fellen; A.E.Sedgwick, Comparing Numerical Methods for Ordinary Differential Equations , SIAM Journal on Numerical Analysis, Vol. 9, No. 4. (Dec., 1972), pp. 603-637.

[13] Jacques-Louis Lions, Y. Maday, and G. Turinici, A parareal in time dis- cretization of PDEs , C.R. Acad. Sci. Paris. , Serie I, 332, 2001, pp. 661-668.

[14] G.E.Karniadakis, R.M.Kirby II, Parallel Scientific Computing in C++ and MPI, A Seamless Approach to Parallel Algorithms and Their Implementation , Cambridge University Press, 2003 [15] Y. Maday, and G. Turinici, The Parareal in Time Iterative Solver: a further direction to parallel implementation, in Proceedings of the 15th international domain decomposition conference, R. Kornhuber, R.H.W. Hoppe, J. P´eeriaux,O. Pironneau, O.B. Widlund, and J. Xu, eds., Springer LNCSE, 2003, pp 441-448. [16] Y. Maday, and G. Turinici, A parareal in time procedure for the control of partial differential equations, C.R. Acad. Sci. Paris, Ser I 335 (2002) 387-392. [17] G.A.Staff, The Parareal Algorithm: A survey of present work, Technical Report, Norwegian University of Science and Technology. Department of Mathematical Sci- ences, Norway, 2003

[18] G.A. Staff and E.M. Rønquist , Stability of the Parareal Algorithm, in Pro- ceedings of the 15th international domain decomposition conference, R. Kornhuber, R.H.W. Hoppe, J. P´eeriaux,O. Pironneau, O.B. Widlund, and J. Xu, eds., Springer LNCSE, 2003, pp 449-456. [19] V. Thome´e, Galerkin Finite Element Methods for Parabolic Systems, Springer Series in Computational Mathematics, 2006 2.3

2.3, 6.2.1

73 BIOGRAPHICAL SKETCH

Christopher R. Harden

Christopher R. Harden was born on the twenty ninth of December, in the year of nineteen seventy six, in Jacksonville, Florida. In the spring of 2005, he completed a Bachelors degree in Philosophy and a Bachelors degree in Pure Mathematics at the University of North Florida. In the fall of 2005, he entered into the Ph.D program for Applied and Computational Mathematics at Florida State University. At the end of his first semester in the program he came under the advisement of Prof. Janet Peterson and was supported as a research assistant by her for the following two semesters in the School of Computational Science. When the new degree programs became approved in Computational Science he happily and excitedly switched over into this new program and upon the defense of this thesis expects to be the first to obtain his Masters degree under the new program in Computational Science towards the end of the 2008 spring semester. He has plans to continue on in pursuit of his Ph.D in the newly approved Ph.D program in Computational Science and to continue under the advisement of Prof. Janet Peterson. Chris’ research interests include numerical ordinary and partial differential equations, finite elements, reduced order modeling, general real time computing strategies, parallel computing and algorithms, the analysis of the performance of parallel algorithms, and code verification and validation. Chris currently lives in Tallahassee, FL, with his wife, Jennifer; his son, Youth; their dog, Eris; and their cat, Arthur.

74