Parallelization of Ensemble (EnKF) for Oil Reservoirs

Md. Khairullah

Parallelization of Ensemble Kalman Filter (EnKF) for Oil Reservoirs

Master’s Thesis in Computer Simulations for Science and Engineering (COSSE)

Delft Institute of Applied Mathematics Faculty of Electrical Engineering, Mathematics, and Computer Science Delft University of Technology The Netherlands

Md. Khairullah Student id: 4187326

20th June 2012 Author Md. Khairullah

Title Parrallelization of Ensemble Kalman Filter (EnKF) for Oil Reservoirs

MSc presentation July 05, 2012

Graduation Committee Prof. dr. ir. A.W. Heemink (chair) Delft University of Technology Prof. dr. ir. C. Vuik Delft University of Technology Prof. dr. ir. H.X. Lin Delft University of Technology dr. R.G. Hanea Delft University of Technology dr. ir. Harald Kostler¨ Friedrich-Alexander University Abstract

This thesis describes the design and implementation of a parallel algorithm for with ensemble Kalman filter (EnKF) for oil reservoir manage- ment. The implemented application works on large number of observations from time-lapse seismic, which lead to a large turnaround time for the analysis step, in addition to the time consuming simulations of the realizations. Provided that par- allel resources are used for the parallel simulations of the realizations, the analysis step also deserves parallelization. Our experiments show that parallelization of the analysis step in addition to the forecast step also scales well, exploiting the same set of resources with some additional efforts. iv Preface

The work lying before you is my Master of Science thesis, in which a description is given of the work done and the results achieved since September 2011 at the Delft Institute of Applied Mathematics (DIAM), Faculty of EEMCS, Delft University of Technology. Having my bachelor degree in Computer Science and Engineering, I feel lucky enough to apply my achieved programming and analytical ability to some applied and practical topic of the real world in my current studies. I was very fond of parallel computing after I learnt Socket Programming in Java during my undergrad studies. It was a nice opportunity for me to work on parallelization of data assimilation for reservoir management.

Before continuing with my thesis, I would like to thank some people who have been of great help during my Masters project. First of all, I would like to thank my supervisors Dr. Lin and Dr. Hanea for giving me their valuable time for answering all my stupid and wise questions related to the project and also for their comments, suggestions and supports at different stages of the project. I want to thank the graduation committee members for their collective and individual contributions. I would like to thank Dr. J. D. Jansen for his support at the very beginning of the project to teach me some basic things of the simsim simulator. I would also thank ir. C.W.J. Lemmens for his technical supports. I would like to thank my family members for their patience and support for my life in abroad. Special thanks go to Dr. Harald Kostler,¨ FAU, Erlangen for his supports during my Erlangen life. Last but not the least, I express my deep gratitude to European Union for funding my studies through Erasmus Mundus Scholarship.

Md. Khairullah

Delft, The Netherlands 20th June 2012

v vi Contents

Preface v

1 Introduction 1

2 Parallel Data Assimilation with EnKF 5 2.1 Problem settings ...... 5 2.2 Previous works ...... 6 2.3 Proposed parallel algorithm ...... 7 2.4 Parallel performance ...... 11

3 Basics of Reservoir Engineering 15 3.1 Definitions ...... 15 3.2 Recovery processes ...... 16 3.3 Different data sets ...... 19 3.3.1 Production history data ...... 19 3.3.2 Time-lapse seismic data ...... 20 3.3.3 Prior knowledge ...... 20

4 Data Assimilation with EnKF 21 4.1 Review of the Kalman Filter ...... 22 4.1.1 Variance minimizing analysis scheme ...... 22 4.1.2 Kalman filter ...... 23 4.2 Ensemble Kalman filter ...... 23 4.2.1 Representation of error statistics ...... 24 4.2.2 Analysis scheme ...... 24 4.3 Practical implementation ...... 26 4.3.1 Ensemble representation of the covariance ...... 26 4.3.2 Measurement perturbations ...... 27 4.3.3 Analysis equation ...... 27 4.3.4 EnKF for combined parameter and state estimation . . . . 29

5 Parallel Computing with MPI and ScaLAPACK 33 5.1 Parallel computing ...... 33 5.2 MPI ...... 35

vii 5.3 Parallel numerical libraries for SVD ...... 36 5.3.1 ScaLAPACK ...... 37

6 Implementation 43 6.1 Reservoir model ...... 43 6.2 Software structure ...... 45 6.3 Optimization ...... 47 6.3.1 Load balancing ...... 47 6.3.2 Minimizing data communication ...... 48

7 Results and Discussions 53 7.1 Verification ...... 53 7.2 Performance of the parallel algorithm ...... 54 7.2.1 Difference in speedup: forecast vs analysis step ...... 59 7.2.2 Super-linear speedup ...... 64 7.2.3 Parallel scalability ...... 64 7.2.4 Difference in performance: 48 realizations vs 96 realizations 65 7.2.5 Difference in speedup and performance: LAN cluster vs SARA Lisa cluster ...... 65 7.2.6 Effect of ScaLAPACK process grid ...... 65 7.2.7 Effect of ScaLAPACK block size ...... 69 7.2.8 Effect of non blocking communication ...... 71

8 Conclusion and Future Work 75 8.1 Conclusion ...... 75 8.2 Future work ...... 77

viii List of Figures

1.1 The forecast of global demand of energies [2] ...... 2 1.2 The forecast of global production of energies [2] ...... 2

2.1 Typical logical domain decomposition for the forecast step (top) and for the analysis step (bottom) ...... 8 2.2 An example of physical domain decomposition used for the ocean circulation model in [21] ...... 8 2.3 Serial implementation of the EnKF for data assimilation . . . . .9 2.4 1st level parallelization of the simulator ...... 9 2.5 2nd level parallelization of the simulator ...... 10 2.6 A multi level parallelization ...... 10 2.7 Proposed parallel EnKF for data assimilation ...... 11

3.1 Reservoir management depicted as a closed loop model-based con- trolled process [19] ...... 19

4.1 Schematic of how data assimilation (DA) works and adds value to observational and model information. The data shown are various representations of ozone data of a particular day [22] ...... 22 4.2 The ongoing discrete Kalman filter cycle. The time update projects the current state estimate ahead in time. The measurement update adjusts the projected estimate by an actual measurement at that time. 24 4.3 The procedure of data assimilation with EnKF for the ensemble member j...... 26

5.1 Shared memory architecture ...... 34 5.2 Distributed memory architecture: MPI’s work place ...... 35 5.3 ScaLAPACK Software Structure ...... 38 5.4 Data distribution examples: column blocked (left) and column cyc- lic (right) ...... 40 5.5 Data distribution examples: column blocked cyclic (left) and 2D blocked cyclic ...... 41 5.6 An example of view of data distribution in ScaLAPACK [3] . . . . 42

6.1 Location of the wells in the model reservoir field ...... 44

ix 6.2 Assumed constant porosity field for all ensemble members . . . . 45 6.3 Software structure of the developed parallel application ...... 46

7.1 rms differences of log permeability over time ...... 55 7.2 log permeability fields with 16 realizations: initial (top), after 6 assimilation steps in 3 years (middle), the true log permeability (bottom) ...... 56 7.3 log permeability fields with 48 realizations: initial (top), after 6 assimilation steps in 3 years (middle), the true log permeability (bottom) ...... 57 7.4 log permeability fields with 96 realizations: initial (top), after 6 assimilation steps in 3 years (middle), the true log permeability (bottom) ...... 58 7.5 Speedup of the parallel implementation for 48 realizations in the LAN cluster for different parts ...... 60 7.6 Speedup of the parallel implementation for 48 realization on SARA Lisa cluster for different parts ...... 61 7.7 Speedup of the parallel implementation for 96 realizations in the LAN cluster for different parts ...... 62 7.8 Speedup of the parallel implementation for 96 realization on SARA Lisa cluster for different parts ...... 63 7.9 Execution time of the forecast and analysis step for 48 and 96 real- izations on the LAN cluster ...... 66 7.10 Execution time of the forecast and analysis step for 48 and 96 real- izations on the SARA Lisa cluster ...... 67 7.11 Comparison of execution time of the forecast and analysis step for 48 realizations on different clusters ...... 68 7.12 Execution time of the analysis step for different process grid ori- entation for 48 realizations ...... 70 7.13 Execution time of the analysis step for different block sizes for 48 realizations ...... 72 7.14 Execution time of the analysis step for blocking and non-blocking data communication on the LAN cluster (top) and SARA Lisa cluster (bottom) ...... 74

x List of Tables

7.1 rms differences between the true and assimilated log permeabilities at different time steps ...... 54 7.2 Hardware specification of the LAN cluster for initial tests and meas- urements ...... 55 7.3 Performance of the parallel implementation for 48 realizations in the LAN cluster ...... 59 7.4 Performance of the parallel implementation for 48 realizations in the SARA Lisa cluster ...... 59 7.5 Performance of the parallel implementation for 96 realizations in the LAN cluster ...... 60 7.6 Performance of the parallel implementation for 96 realizations in the SARA Lisa cluster ...... 61 7.7 Execution time (in minute) of the analysis step for different process grid orientation for 48 realizations ...... 69 7.8 Execution time (in minute) of the analysis step for different block sizes for 48 realizations ...... 71 7.9 Execution time (in minutes) of the analysis step for blocking and non-blocking data communication ...... 73

xi xii List of Algorithms

1 A single step of basic EnKF ...... 30 2 A single step of basic parallel EnKF ...... 31 3 A single step of the optimized parallel EnKF ...... 49 4 A single step of the final parallel EnKF ...... 51

xiii xiv Chapter 1

Introduction

Because of rapidly increasing population of the world and the necessary industrial- ization, the natural resources, such as petroleum, are becoming scarce day by day. For more than 150 years after the discovery, oil continues to play an essential role for the mankind and in the global economy. At the same time, renewable energy becomes more important. Its production is less harmful to the environment and starts to become economically competitive compared to the production of energy from fossil fuels. Though the use of renewable fuels is expected to grow strongly over the coming years, still oil and natural gas will remain the main energy sources until at least 2050.

Both history and the future projections show that demand for oil is a crying need for the world. Since the energy shocks of the 1970s and 1980s [1], which proved extreme dependency of the developed world on petroleum products and also the extreme vulnerability of its shortfalls in supplies, oil remains the top source of energy, though it has fallen off its position of high adoration. The figures 1.1 and 1.2 from EIA give the forecast of demand and production of oil and other sources for the next 20 years, showing clear deficiency of the production to meet up the demand, which infers the urgency of promoting the production of petroleum reservoirs. The exploration and production industry is struggling to keep up with this demand. Oil companies find it each year more and more difficult to meet this huge demand for fossil fuels. First, a lot of large fields are already at a mature stage. Other fields, discovered recently, are often too small to be exploited efficiently. It is therefore essential to develop new technologies that allow for reduction of costs of maintaining oil fields and increase the rate of recovery of oil from the existing fields. A sound reservoir management is a key requirement to increase the production from petroleum reservoirs.

1 Figure 1.1: The forecast of global demand of energies [2]

Figure 1.2: The forecast of global production of energies [2]

Data assimilation is an important tool for closed loop oil reservoir management. The ensemble Kalman filter (EnKF) has gained huge popularity for data assimil- ation for non linear systems in the recent years. However, EnKF takes large turn around time in terms of computing, particularly to run the simulations of hundreds of realizations. The situation worsens when we have large number of observations. Also the analysis step becomes time consuming for large number of observations. But fortunately enough, both the forecasting and the analysis part of EnKF have inherent parallelism. We implement a parallel EnKF for the data assimilation for oil reservoirs and

2 present the experimental results. The results show that by parallelization of the whole EnKF process, significant speed up can be achieved utilizing the same set of parallel hardware, which are used for the forecast step only or the forecast step and partially for the analysis step in some previous works. The remainder of the thesis contains 7 chapters and are organized as follows. In chapter 2, we describe the problem setting and the corresponding survey of previous works, and also the formulation of a new parallel algorithm for data as- similation with EnKF. In chapter 3 we study some necessary basics and founda- tions for modern reservoir engineering in order to be familiar with some common terminologies. Chapter 4 describes a general background and overview of data assimilation and ensemble Kalman filter (EnKF). Chapter 5 describes the paral- lel computing tools MPI and ScaLAPACK which we use for the parallel EnKF. In Chapter 6 we describe the detailed implementation of the parallel EnKF. In Chapter 7 we describe the experimental results and correctness of the implemented parallel EnKF as well as performance and efficiency of the implemented EnKF. Chapter 8 includes the concluding remarks, and also some recommendations for potential future research directions.

3 4 Chapter 2

Parallel Data Assimilation with EnKF

In this chapter we define our problem settings. Then we go through a brief discus- sion on previous works in this topic. Then we present a parallel algorithm for data assimilation with EnKF for oil reservoirs. Finally, we briefly discuss some metrics to measure the benefits of parallel applications, which we will use in chapter 7.

2.1 Problem settings

Data assimilation is an important component of modern reservoir management systems. The Kalman filter, a nice data assimilation algorithm, uses a series of measurements observed over time, containing noise and other inaccuracies, and produces estimates of unknown variables that tend to be more precise than those that would be based on a single measurement alone. However, Kalman filter is an optimal filter for linear systems and needs derivation of tangent linear operator to work with non linear systems, which is impossible or very compute intensive for some systems. The ensemble Kalman filter (EnKF), a suboptimal approximation of Kalman filter, is a recursive filter suitable for problems such as geophysical models with a large number of variables. EnKF consists of two major parts: 1. forecast or simulation: A group or ensemble of realizations/models of the system run independently, generating an ensemble of predictions or state vectors of the system. This step is computing intensive in systems with both small and large number of observations.

2. analysis or update: After completion of the forecast/simulation, the en- semble of the predicted state vectors are gathered and compared with avail- able measurements in hand. Assuming Gaussian noises both in the measure- ments and the model forecasts, some computations are done to decide the weights of the forecast and measurements to include in the updated predic- tion with better approximation.

5 The forecast step is embarrassingly parallel as each ensemble member can run in- dependently in separate processing element of computers. Hence most parallel im- plementations solely depend on this step only. The analysis or update step is critical for only systems with large number of observations and parallel implementation of this step is not simple because of the intermediate inter process communication requirement. Most of the implementations of this step depend on some sort of do- main decomposition to avoid communicating large volume of data. Our goal is to develop a parallel EnKF for oil reservoirs with the whole domain.

2.2 Previous works

In [20] data assimilation experiments are performed using a parallel ensemble Kal- man filter (EnKF) implemented for a two layer shallow water model where ad- vantage is taken of the inherent parallelism in the EnKF by running each ensemble member on a different processor of a parallel computing system. The Kalman filter analysis step is parallelized by letting each processor handle the observations from a different division of the whole domain. Figure 2.1 depicts the logical domain decomposition used in [20], also a common decomposition technique for any par- allel EnKF. Each shade of different colors in the figures represents the state vector of a different ensemble member. In the top picture each bounded cell represents a processor. The alphabetical letters in a cell represents division of the state vector based on some physical decomposition of the domain. The ensemble members (six in this example) are distributed one per processor. In the bottom picture the domain decomposition used for the analysis step is shown. The memory of each processor contains the same state-vector elements (of the same physical division of the do- main; from one of the six in this example) from each ensemble member. In [21] a multivariate ensemble Kalman filter (MvEnKF), implemented on a massively par- allel computer architecture and developed for the ocean circulation model, has been presented. Parallelism for the analysis step is achieved by regionalization of the er- ror covariances of the ensemble. Each processing element (PE) collects elements of a matrix measurement functional from nearby PEs. In figure 2.2 the concept of physical domain decomposition is depicted. This is used for the analysis step for the ocean circulation model in [21]. The outer rectangle delimits the area, A, from which the data assimilated on one processor are collected. The innermost rectangle depicts the boundary of the processor’s private area, B. The ellipse de- limits the influence region of the processor private area’s south-eastern corner cell. The shaded area contains the ellipses for all grid cells contained in B. The region C, popularly known as the ‘halo’ region, contains the regions which must be influ- enced and overlapped between adjacent processors and be updated after all private regions of each processor have been updated. The authors implemented domain decomposition to avoid allocating and loading large state vectors from each pro- cessor on the basis of the fact that observations from a distant location should have insignificant affect on the covariances compared to the observations from a nearby

6 region. However with the advent of computer memory technology, allocating large state vectors on each processing element is not be a big concern at present or in the near future. In [23] a parallel implementation of weighted EnKF for oil reservoir is presen- ted which parallelize each ensemble member in a number of processors and this is described as two levels of parallelism. In [29] a multilevel parallization for oil reservoir simulation by EnKF is proposed. The first level of parallelization is usually during the forecast step, where each ensemble member runs on a separate processor. A second level of parallelization which uses a parallel reservoir simu- lator for each realization is implemented. It proposes an algorithm which partially parallelize the analysis step and describes it as the third level of parallelization. The main computational gain of parallelization of the analysis step comes from the fact that the matrix-vector multiplications can be parallelized efficiently. Figure 2.3 depicts the work flow of a serial implementation of the method whereas figures 2.4, 2.5 and 2.6 show the simplified work flows of the parallel implementations done so far. In these examples the EnKF method works with N realizations. In figure 2.3 all realizations and the analysis step are executed on a single processor P1. In figure 2.4 each realization runs on a different processors, requiring total N processors, namely P1, P2, ... , PN and the analysis step runs on the single pro- cessor P1. In figure 2.5 each realization runs on three processors, involving a total 3N processors, namely P1, P2, ... , P3N and the analysis step runs on the single processor P1. In 2.6 each realization runs on three processors, involving a total 3N processors, namely P1, P2, ... , P3N and the analysis step is partially parallelized on N processors, the so-called master processors.

2.3 Proposed parallel algorithm

In detailed discussion of EnKF in chapter 4, we will see that the EnKF update is expressed as a matrix multiplication

Aa = AX (2.1) where the matrix A is the ensemble of the forecasted state vectors, Aa is the en- semble of the updated state vectors and X, let us call it the updating coefficient matrix, is computed in the analysis step based on the forecast and the measure- ments, to minimize the variances of the state vectors among the ensemble mem- bers. So the EnKF analysis step consists of construction of the update coefficient matrix X, and the matrix multiplication of A and X in (2.1). It is obvious that in the above mentioned parallel algorithms, the first two levels of parallelization work for the construction of the matrix A and the third level is used only for the matrix-matrix multiplication in (2.1). But the construction of the matrix X in (2.1) would be very compute intens- ive for large number of observations. We will see in later chapters that the con- struction of X involves a series of singular value decomposition (SVD) compu-

7 a a a b b b c c c d d d e e e f f f a a a b b b c c c d d d e e e f f f

a b c a b c a b c a b c a b c a b c d e f d e f d e f d e f d e f d e f

Figure 2.1: Typical logical domain decomposition for the forecast step (top) and for the analysis step (bottom)

Figure 2.2: An example of physical domain decomposition used for the ocean circulation model in [21]

8 Run Simulator Update J=1 on P1 on P1

Run Simulator Update J=2 on P1 on P1

...... Compute the updating coefficient matrix ......

Run Simulator Update J=N-1 on P1 on P1

Run Simulator Update J=N on P1 on P1

Figure 2.3: Serial implementation of the EnKF for data assimilation

Run Simulator Update J=1 on P1 on P1

Run Simulator Update J=2 on P2 on P1

...... Compute the updating coefficient matrix ......

Run Simulator Update J=N-1 on PN-1 on P1

Run Simulator Update J=N on PN on P1

Figure 2.4: 1st level parallelization of the simulator

9 Update P1 P2 P3 J=1 on P1

Update P4 P5 P6 J=2 on P1

…. .... …. ... Compute the updating coefficient matrix …. .... …. ...

Update P3N-5 … P3N-3 J=N-1 on P1

P …. P 3N-2 3N Update J=N on P1

Figure 2.5: 2nd level parallelization of the simulator

Update P1 P2 P3 J=1 on P1

Update P4 P5 P6 J=2 on P4

…. .... …. ...

Compute the updating …. .... …. coefficient matrix ...

Update P3N-5 ... P3N-3 J=N-1 on P3N-5

Update P3N-2 .... P3N J=N on P3N-2

Figure 2.6: A multi level parallelization

10 Run simulator Update J=1 P on P1 1 on P1

P2 Run simulator Update on P J=2 on P 2 ... 2

…. Compute ... the updating coefficient …. matrix ...

Run simulator ... Update J=N-1 on PN-1 on PN-1

PN-1 Run simulator Update J=N P on PN N on PN

Figure 2.7: Proposed parallel EnKF for data assimilation tation and matrix-matrix multiplications and these two operations can be carried out efficiently in parallel computers due to availability of nice parallel algorithms and corresponding libraries. Hence also the construction of X can be parallelized and significant reduction in assimilation time can be obtained. However, there are techniques to compute X with less computations, based on different assumptions. Provided parallel computing resources for running the simulations in parallel, we can exploit them for the large computations for the analysis steps to get exact res- ults, instead of depending on the assumptions. As we said, the matrix X is termed as the updating coefficient matrix in the flowcharts. And by the statement ‘update’ in the flowcharts we the matrix-matrix multiplication in (2.1).

2.4 Parallel performance

A measurement is needed to express the performance gain achieved by parallelising a given application over a sequential implementation. Here we list some common parallel performance metrics and related terminologies [15].

• execution time: The time elapsed between the beginning and end of exe- cution on a sequential computer is called the serial runtime and usually is denoted by TS. The time elapsed from the start of the parallel computation to the end of execution by the last processing element (PE) is called the par- allel runtime and usually is denoted by TP . These are not practically used as

11 metrics.

• speedup (SP ): It demonstrates the relative benefit of solving a problem in parallel and is calculated by taking the ratio of the serial runtime TS of the best sequential algorithm to the time TP taken to solve the same problem with P processors: T S = S (2.2) TP This is the mostly used metric for parallel performance. Some people, for simplicity, defines the speedup by the ratio of the single processor runtime T1 of the parallel application to the P processor run time TP :

T S = 1 (2.3) TP The maximum speedup theoretically achievable is P (linear speedup), al- though super-linear speedup might occur in practise due to a non-optimal sequential algorithm or hardware characteristics. Cache memory is a very fast memory in the computer memory hierarchy. When the collective cache memory of some processors can hold the required data of the application which is impossible for a single or less number of processors, super-linear speedup can be observed due to the cache memory performance, in addi- tion to the performance due to more number of processors [16]. However, the well known Amdahl’s law states an important general limitation on the speedup of parallel application with increasing number of processors [6]. If N is the number of processors, s is the amount of time spent (by a serial processor) on serial parts of a program and p is the amount of time spent (by a serial processor) on parts of the program that can be done in parallel, then Amdahl’s law says that speedup achieved by parallel processing is given by

s + p 1 Speedup = = , (2.4) s + p/N s + p/N

where we have set total time s + p = 1 for algebraic simplicity. When s 6= 0, the speedup decreases with increase in N, the number of processors. However, Gustafson’s law, a counter point of Amdahl’s law, states that with increased number of processors a larger problem can be solved with the same efficiency [17]. Gustafson proposes a new metric ‘scaled speedup’ in this context.

• cost (C): It is the sum of the times that all processors spend solving the problem in a parallel system. Cost is used for comparing the performance of parallel algorithms to their sequential counterparts, and is defined by

C = PTP (2.5)

12 • efficiency (EP ): It is another performance metric, closely related to spee- dup, which is the fraction of the time for which each processor is utilized usefully. S P T E = = s = S (2.6) P PTP C This is mostly used in conjunction with speedup.

• parallel scalability: A common task in parallel computing is measuring the scalability of an application. This measurement indicates how efficient an application is when using increasing numbers of parallel processing ele- ments. Strong scaling is defined as how the solution time varies with the number of processors for a fixed total problem size. On the other hand, weak scaling is defined as how the solution time varies with the number of pro- cessors for a fixed problem size per processor.

• overhead: The performance depends on the overhead inflicted on a paral- lel algorithm during execution. The overhead of the parallel algorithm is the difference between its cost and the runtime of the fastest known serial algorithm for solving the same problem.

TO = C − TS = PTP − TS (2.7)

Some major sources that cause overhead inevitably in a parallel computing are:

1. communication overheads: caused by the time taken to exchange in- formation between processors, which obviously does not occur in se- quential algorithms. 2. synchronisation overheads: caused by idling processors, because there is no work in the work queue or a processor has to wait at a synchron- isation point. The most important factor in synchronisation overhead is the load imbalance.

• granularity: It is the number and size of basic tasks into which a prob- lem is decomposed. It therefore bounds the maximum level of concurrency and is of great influence to the overhead factors. A decomposition into a large number of small tasks is called fine-grained and a decomposition into a small number of large tasks is called coarse-grained. Generally, coarser- grained reduces communication overhead, but increases the synchronisation overhead.

13 14 Chapter 3

Basics of Reservoir Engineering

In this chapter we describe some very fundamental concepts of reservoir engineer- ing, which are essential for discussion of any reservoir application, and introduce some common terminologies used in reservoir engineering, and also in this thesis. We start with some formal definitions of reservoir engineering and then briefly de- scribe the recovery processes and conclude by describing the data sets used in data assimilation.

3.1 Definitions

Reservoir engineering grew out of the recognition that recovery of oil and gas from reservoirs could be made more predictable (and actually could be increased), if the reservoir were analyzed and well managed. Reservoir engineering started in the 1930s with some milestone research works concerning measurement of funda- mental rock properties including porosity and permeability; measuring fluid prop- erties with subsurface samples of oil and gas; and energies driving oil and gas out of the reservoirs [31]. Petroleum reservoir management is the application of state-of-the-art technolo- gies to a known reservoir system within a given management environment. Reser- voir management can be thought of as a set of operations and decisions by which a reservoir is identified, measured, produced, developed, monitored and evaluated from its discovery through depletion and final abandonment [35]. Sound reservoir management practices rely on the utilization of available hu- man, technological and financial resources to maximize profits from a reservoir by optimizing recovery while minimizing capital investments and operating expenses. “Reservoir management involves making certain choices: either let it happen, or make it happen” [27]. Though before 1970, reservoir engineering was strongly believed to be the major technical importance in reservoir management, in the 1970s and 1980s the collab- oration between geo-scientists and reservoir engineers proved to be very success- ful. Then the value of detailed reservoir description with geological, geophysical

15 and reservoir simulation techniques was emphasized [30]. By reservoir simulation, we mean the process of inferring the behaviour of a real reservoir from the performance of a model of that reservoir. The model may be physical, such as a scaled laboratory model, or mathematical [26].

3.2 Recovery processes

Sandstones and limestones are two general sources of hydrocarbon. Oil is trapped in small microscopic pores (the empty spaces among the sand grains) in the rock. A reservoir is well described by its rock and fluid properties. Rock properties are generally static and are called parameters in common practice. Several properties of the porous rock are important for hydrocarbon extraction. Two most important properties are:

1. porosity: the fraction of the rock that can be occupied by fluids, common symbol is φ, has no unit.

2. permeability: the ability of the rock to transmit fluids, common symbol is k and unit is Darcy (1 Darcy = 10−12m2).

Fluid properties of a reservoir are generally dynamic and depend on the rock prop- erties and the reservoir management. They are called state variables in common practice. Two mostly used state variables are:

1. pressure: the pressure of fluids within the pores of a reservoir, the common symbol is P and the unit is pascal or Newton per square meter (N/m2).

2. saturation: the ratio of the volume that a fluid occupies to the pore volume, has the symbols Sw, So etc. meaning water saturation and hydrocarbon sat- uration respectively. For a two phase flow consisting of only hydrocarbon and water, we have Sw + So = 1 (3.1) Being a ratio of two volumes, saturation has no unit.

If the pores are connected such that the fluids can easily flow through these linked pore paths, the rock is called permeable. If hydrocarbons are present in the pores of a reservoir rock, they must be able to move out of them. Hydrocarbons remain locked in place unless they can move from pore to pore. The operating company considers a reservoir suitable, economical and feasible to drill and produce if the rock is porous, permeable, and off course contain enough hydrocarbons. The prop- erties of reservoir rock have a big influence on the ability of the fluids to flow through the reservoir and often determine the strategy used during hydrocarbon recovery. The hydrocarbon recovery process can be divided into three phases in general.

16 1. primary recovery: In this phase the pressure in the reservoir is often high enough to force hydrocarbon or gas to flow through the porous medium and eventually out of the well. Extraction of hydrocarbon causes a decrease of the pressure and subsequently a decrease of the recovery rates. At some point in time, when recovery rates become too small, the economic production can only be maintained after applying special techniques. Very often the primary recovery leaves 70-80% or more of hydrocarbons still in the reservoir.

2. secondary recovery: It consists of injecting gas or water into the reservoir to increase the pressure and to push the remaining hydrocarbon out of the pores. Usually an injection occurs through the wells that are located at some distance from the production wells. The injected fluid should then displace some of the hydrocarbon and push it towards production wells. The sec- ondary recovery phase increases the recovery factor, but still 50% or more hydrocarbons remain in the reservoir.

3. tertiary recovery or enhanced hydrocarbon recovery (EOR): In this phase the fluid properties are altered to ease its flow. Two tactics are mainly used. In thermal recovery steam is injected into the reservoir to heat the hydrocar- bon to increase its fluidity. In chemical recovery several different chemicals like polymers or surfactants are used to reduce the oil viscosity, or increase the viscosity of the injected water, or lower the capillary pressure that stops oil droplets from flowing through a reservoir.

All these difficulties are result of the spatial heterogeneity of reservoir rock prop- erties. The existence of preferable paths, through which injected fluid is moving toward the production well, are very common. Hydrocarbons located outside this path is not influenced by the injection of fluids. This causes the production of in- jected fluid instead of hydrocarbons at an early stage. Having the knowledge of the spatial distribution of rock properties, one would be able to design the production strategy to postpone water breakthrough in the wells and maximize the recovery. However, the spatial heterogeneity and lack of direct measurements of rock prop- erties, which are only known in well locations, introduces a lot of uncertainties that need to be addressed, if reliable future predictions of reservoir performance are to be expected. After the exploration phase, in which potential reservoirs are identified and ex- ploration wells are drilled, initial geological models are created based on the know- ledge gathered from seismic surveys and well data. Initial predictions for the future reservoir performance are made, and if those predictions are financially profitable the reservoir enters a field development phase. When developing a field, the main target is to maximize the economic criterion, most often in terms of oil revenues. Choices are made about the number and locations of wells, the surface facilities that need to be built and the required infrastructure. A detailed geological model of a given reservoir is derived based on all avail- able information and an ‘upscaled’ simpler version is used for flow simulation in

17 practice. This numerical reservoir model should ideally mimic all the processes occurring in the reservoir. If the numerical model would adequately describe the real reservoir, it would be possible to predict the reservoir behavior properly and plan the best strategy to maximize the recovery from a given field. Unfortunately, a numerical reservoir model is only a crude approximation to the truth, mainly for two reasons. First, not all the processes occurring in a real reservoir can be mod- elled in an appropriate way. Very often some simplifications are imposed on the model, to make the problem easier to tackle. Second, there is usually a large un- certainty in the parameter values of the simulation model. Many rock properties, that influence reservoir flow, are poorly known, while there are also uncertainties in fluid properties and the amount of hydrocarbons present in the reservoir. The uncertainties involve the reservoir structure, the initial fluid contacts, and the val- ues of permeabilities, porosities, and fault transmissibilities, etc. These reservoir- related parameters are presumed to be known in numerical simulations. However, neglecting the uncertainties lead to contradictions between the results produced by numerical reservoir models and the data gathered from the real field. It’s then dif- ficult to make decisions based only on the output of a numerical model. Therefore, the measured data together with numerical simulations should be used in reservoir management for improving the production rates and increasing the recovery from a field.

The new concept of a closed-loop model-based reservoir management, depicted in figure 3.1, seems to be a proper framework for reservoir monitoring and man- agement [19]. It involves the use of (uncertain) reservoir and production models and combines them with available data from a real field, such as production history data or time-lapse seismic data, to continuously update the numerical models and reduce the uncertainties. The system, at the top of the figure represents reality: the reservoir with wells and other facilities. This system is monitored with the help of various kinds of sensors, which give knowledge about the state of the reser- voir, such as bottom hole pressures (BHP), oil and water flow rates, in the form of measured outputs. However, the information coming from these measurements is not perfect. Also the inputs into the system can be uncertain and be subject to noise. When production begins and data become available, it is possible to update the numerical reservoir and sensor models those simulate the behavior of the real reservoir. The gathered data are used to reduce the uncertainties of the paramet- ers through identification and update in the data assimilation loop. The updated reservoir models are then used in an optimization loop where a new optimized production strategy can be determined. In the figure, we see that data assimila- tion is a vital part of reservoir management work loop. Our goal is to enhance the performance of the data assimilation process by utilizing it’s inherent parallelism.

18 Noise Input System Output Noise (reservoir, wells, facilities) Controllable inputs

Optimization algorithms Sensors

Low order Geology, system models seismics, well logs, well tests, fluid properties, etc.

Data assimilation algorithms Predicted output Measured output

Figure 3.1: Reservoir management depicted as a closed loop model-based con- trolled process [19]

3.3 Different data sets

Data available from different sources can be used in the data assimilation loop to improve the characterization of the reservoir and the reliability of the flow predic- tions. Generally, data can be divided into two classes. 1. sparse data: available only at well locations

2. dense data: gathered everywhere in the reservoir These data sets differ in accuracy and field coverage, but both contain valuable information about the reservoir and the parameters influencing the flow, and should be used together to obtain a better reservoir model.

3.3.1 Production history data Usually production history data, obtained from wells in the form of bottom hole pressures and flow rates, are used in history matching algorithms to update the uncertain parameters. This type of data is typically acquired with an accuracy between 5%-20%. However, because the number of model parameters to be es- timated is very large, the production history data has a limited resolving power. It provides some information on the unknown properties in the neighborhood of the wells, but not further away from them.

19 3.3.2 Time-lapse seismic data Due to the developments in geophysics, especially in the field of seismic, it is possible to determine not only the position of the reservoir, but also to track the fluids movements in the reservoir itself. This additional information in the form of time-lapse seismic data can be utilized, together with production data, to nar- row the solution space when minimizing the mismatch between gathered measure- ments and their forecasts from the numerical model [28]. Time-lapse seismic is the process of carrying out a seismic survey before the production begins and then repeating surveys over the producing reservoir. Seismic data is sensitive to static properties like lithology and pore volume; and also to dynamic (i.e. time varying) properties like fluid saturation and pore pressure. Sensors yield seismic data in the form of travel time information and amplitudes. These seismic attributes un- dergo an inversion process like in [32], in which they are translated into time-lapse changes in density, acoustic propagation velocity and shear modulus. Time-lapse seismic data are available everywhere in the reservoir. Although less accurate than production data, they contains information about the reservoir properties every- where and can be used to infer parameter values away from wells.

3.3.3 Prior knowledge History matching is an ill-posed problem i.e. many different parameter sets may result in the same measurement predictions. In petroleum engineering, due to the complexity and high dimensionality of the models, and due to the lack of dir- ect measurements of the unknown parameters, the prior knowledge is usually de- scribed by an ensemble of models that all conform to the geological information. Most of the available data assimilation schemes take into account only the mean value and the derived from the ensemble, indirectly assuming a Gaussian distribution for the unknown parameters.

20 Chapter 4

Data Assimilation with EnKF

In this chapter we go through the foundation of the data assimilation concepts. We start with reviewing the Kalman filter. Later on we present the ensemble Kalman filter (EnKF) and finish this chapter with the practical implementation details of the EnKF. For any dynamic system, we have two broad sources of information:

1. measurements of the system or observations

2. understanding of the temporal and spatial evolution of the system or models

Mathematics provides a foundation to address questions such as: “What combina- tion of the observation and model information is optimal?”, and provides an estim- ate of the errors of the “optimum” or “best” estimate [22]. This is known as “data assimilation”. Consider a dynamical model along with initial and boundary conditions and a set of measurements that can be related to the model state. Then the state estim- ation problem is defined as finding the estimate of the model state that best fits the model equations, the initial and boundary conditions, and the observed data in some weighted measure. The problem may become overdetermined and con- sequently no general solution exists, unless we relax the equations and allow some or all of the dynamical model, the conditions, and the measurements to contain errors. We often assume Gaussian distributions for the error terms. We also as- sume that errors in the measurements are uncorrelated with errors in the dynamical model. The problem is then formulated as a quadratic cost function whose min- imum defines the best estimate of the state. Fig 4.1 shows the relation among the participants of a data assimilation process.

21 Figure 4.1: Schematic of how data assimilation (DA) works and adds value to observational and model information. The data shown are various representations of ozone data of a particular day [22]

4.1 Review of the Kalman Filter

4.1.1 Variance minimizing analysis scheme The Kalman Filter (KF), a variance minimizing algorithm, updates the state es- timate whenever measurements are available. Let us assume that Gaussian noise is present both in the model prediction and the measured data, and that no correl- ation is present between these noises. Let a vector of variables stored in ψ(x, t) be defined on some spatial domain D with spatial coordinate x. When ψ(x, t) is discretized on a numerical grid, representing the spatial model domain, it can be represented by the state vector ψk at each time instant tk. Let new measurements are available at the time tk. Then we define the new estimate for the state vector as

a f f ψk = ψk + Kk(dk − Mkψk ), (4.1)

a f where ψk and ψk are the analysed and forecast estimates respectively, dk is the vector of measurements, Mk is the measurement operator mapping the model state ψk to the measurements dk and Kk is the weighting factor or the Kalman gain. Then the covariance of the analysed state is given by

a 2 f 2 (Cψψ)k = (1 − KkMk) (Cψψ)k + Kk (C)k, (4.2)

f where (Cψψ)k is the error covariance of the predicted model state, and (C)k is the measurement error covariance matrix. An optimal value for Kk can be found a by choosing Kk such that the covariance of ψk is minimum. The solution is given by f T f T −1 Kk = (Cψψ)kMk (Mk(Cψψ)kMk + (C)k) (4.3)

22 and the corresponding covariance is given by

a f (Cψψ)k = (I − KkMk)(Cψψ)k (4.4)

4.1.2 Kalman filter

We assume that the true state ψt evolves in time according to the dynamical model

t t ψk = F ψk−1 + qk−1, (4.5) where F is a linear model operator and qk−1 is the unknown model error over one time step from k − 1 to k and the superscript t is the acronym for true. In this case a numerical model evolves according to

f a ψk = F ψk−1, (4.6) where the superscripts a and f denote analysis and forecast respectively. That is, given the best possible estimate (traditionally called analysis) for ψ at time tk−1, a forecast is calculated at time tk , using the approximate equation (4.6). The error covariance equation is then derived by subtracting (4.6) from (4.5), squaring the result and taking expectation we have

f a T Cψψ(tk) = FCψψ(tk−1)F + Cqq(tk−1), (4.7) where the error covariance matrices for the predicted and analysed estimates are defined as f f t f t T Cψψ = (ψ − ψ )(ψ − ψ ) , (4.8)

a a t a t T Cψψ = (ψ − ψ )(ψ − ψ ) . (4.9) Here the over line is an expectation operator, which is equivalent to averaging over an ensemble of infinite size. Figure 4.2 depicts the working principle of data assimilation by Kalman filter.

4.2 Ensemble Kalman filter

While the classical Kalman filter provides a complete and rigorous solution for state estimation of linear systems under Gaussian noise, the estimation problem for non-linear systems remains a problem. Results of rigorous solutions to the non-linear problem are either too narrow in applicability or are computationally expensive [14]. The ensemble Kalman filter is a Monte Carlo [34] approach for approximating Kalman filter to overcome the above mentioned limitations.

23

Time update („Predict“) Measurement update („Correct“) (1) Project the state ahead (1) Compute the Kalman gain t t −1 ψ = Fψ + q f T f k k −1 k −1 = T K k (Cψψ) M k M k(Cψψ) M k +(Cεε)k (2) Project the error covariance ahead k ( k ) f a T (2) Update estimate with measurement = + € Cψψ (tk) FCψψ (tk −1)F Cqq(tk −1) a f ⎛ f ⎞ = + ⎜ − ⎟ € ψ k ψ k K k⎝ dk M k ψ k ⎠ (3) Update the error covariance € a f € C = I K M C ( ψψ)k ( − k k)( ψψ)k

Initial estimate for and ψ k −1 Cψψ (tk −1)

€ €

Figure 4.2: The ongoing discrete Kalman filter cycle. The time update projects the current state estimate ahead in time. The measurement update adjusts the projected estimate by an actual measurement at that time.

4.2.1 Representation of error statistics

f a The error covariance matrices Cψψ and Cψψ for the predicted and analysed estim- ate in the Kalman filter are defined in terms of the true state in (4.8)) and (4.9). However, since the true state is never known, we redefine the ensemble covariance matrices in terms of the ensemble mean ψ instead of the true state, according to

e f f f f f T (Cψψ) = (ψ − ψ )(ψ − ψ ) , (4.10)

e a a a a a T (Cψψ) = (ψ − ψ )(ψ − ψ ) , (4.11) where now the over line denotes an average over the ensemble. Thus, we can use an interpretation where the ensemble mean is the best estimate and the spreading of the ensemble around the mean is a natural definition of the error in the ensemble mean.

4.2.2 Analysis scheme

We now derive the update scheme in the KF using the ensemble covariances as defined by (4.10) and (4.11). We now drop the time index k in the following equa- tions for convenience. It is essential that the observations be treated as random

24 variables having a distribution with mean equal to the observed value and covari- ance equal to C [8]. We start by defining an ensemble of observations

dj = d + j, j = 1 ...N (4.12) where N is the number of ensemble members. Next the ensemble covariance mat- rix of the measurement errors is defined as

e T C =  . (4.13)

This matrix converges to the actual error covariance matrix C used in the Kalman filter when we have infinity number of realizations. In the analysis step in EnKF, updates are performed on each of the ensemble members and is given by

a f e f T e f T e −1 f ψj = ψj + (Cψψ) M (M(Cψψ) M + C) (dj − Mψj ). (4.14) We should note that with a finite ensemble size, the use of the ensemble cov- ariances is an approximation of the true covariances. Furthermore, the matrices e f T e M(Cψψ) M and C are singular when the number of measurements is larger than the number of ensemble members, and we must use pseudo inversion instead of inversion. Equation (4.14) implies that

a f e f T e f T e −1 f ψ = ψ + (Cψψ) M (M(Cψψ) M + C) (d − Mψ ), (4.15) where d = d, since we have a zero mean ensemble of measurement perturbations. Thus, we have the same relation between the analysed and predicted ensemble mean, as we have between the analysed and predicted state in the standard Kalman e f,a e f,a filter, apart from using (Cψψ) and C instead of Cψψ and C respectively. We should also note that the introduction of an ensemble of observations does not affect the update of the ensemble mean. We now show that, by updating each of the ensemble members using the per- turbed observations, we can create a new ensemble with the correct error statistics. We derive the analysed error covariance estimate as a result of the above analysis scheme, although we keep the standard Kalman filter form for the analysis equa- tions. First, we obtain using (4.14) and (4.15)

a a f f ψj − ψ = (I − KeM)(ψj − ψ ) + Ke(dj − d), (4.16) where the Kalman gain is

e f T e f T e −1 Ke = (Cψψ) M (M(Cψψ) M + C) . (4.17)

Using equation (4.11), we derive the error covariance update as

e a e f (Cψψ) = (I − KeM)(Cψψ) . (4.18)

25 Initial state Predicted Vector state vector Ψ Ψ k,j k+1,j

Updated Predicted state vector EnKF state vector a Ψ Ψ k+1,j k+2,j

Observation Updated Vector state vector EnKF a Next steps d Ψ k+1,j k+2,j

Observation …. Vector d k+2,j

….

Figure 4.3: The procedure of data assimilation with EnKF for the ensemble mem- ber j.

For this derivation, we had to assume that the model uncertainties and the meas- urement uncertainties are independent. However, using a finite ensemble size and neglecting the cross term introduces sampling errors. When the Kalman gain matrix is singular, the inverse in (4.17) can be replaced with the pseudo inverse, and we can write the Kalman gain as e f T e f T e + Ke = (Cψψ) M (M(Cψψ) M + C) . (4.19) When the matrix in the inversion is of full rank, (4.19) becomes identical to (4.17). Figure 4.3 depicts the working principle for a single ensemble member in EnKF.

4.3 Practical implementation

4.3.1 Ensemble representation of the covariance

n We define the matrix A whose columns are the ensemble members ψi ∈ R by n×N A = (ψ1, ψ2, . . . , ψN ) ∈ R , (4.20) where N is the number of realizations and n is the size of the model state vector. The ensemble mean is stored in each column of A, which is defined as

A = A1N , (4.21)

26 N×N where 1N ∈ R is the matrix whose entries are all equal 1/N. The ensemble perturbation matrix is defined as

0 A = A − A = A(I − 1N ). (4.22)

e n×n The ensemble covariance matrix Cψψ ∈ R can be defined as

1 Ce = A0(A0)T . (4.23) ψψ N − 1

4.3.2 Measurement perturbations

m Given a vector of measurements d ∈ R , where m is the number of measure- ments, we define the N vectors of perturbed observations as

dj = d + j, j = 1 ...N, (4.24) which are stored in the columns of the matrix

m×N D = (d1, d2, . . . , dN ) ∈ R , (4.25) while the zero mean ensemble of perturbations are stored in the matrix

m×N E = (1, 2, . . . , N ) ∈ R , (4.26) from which we construct the ensemble representation of the measurement error covariance matrix 1 Ce = EET . (4.27)  N − 1

4.3.3 Analysis equation The analysis equation (4.14), expressed in terms of the ensemble matrices, is

a e T e T e −1 A = A + CψψM (MCψψM + C) (D − MA), (4.28)

m×n where the measurement matrix is M ∈ N . We define the ensemble of innovation vectors as

D0 = D − MA, (4.29)

0 m×N 0 where (D ∈ R ). Using D , along with the definitions of the ensemble error covariance matrices in (4.23) and (4.27), the analysis can be expressed as

Aa = A + A0A0T M T (MA0A0T M T + EET )−1D0, (4.30) where all references to the error covariance matrices are eliminated.

27 m×N We now define the matrix S ∈ R , holding the measurements of the en- semble perturbations, by S = MA0 (4.31) m×m and the matrix C ∈ R ,

T C = SS + (N − 1)C. (4.32)

Here we can use the full-rank, exact measurement error covariance matrix C or e the low-rank representation C defined in (4.27). The analysis equation (4.30) can then be written as

a 0 T −1 0 T −1 0 A = A + A S C D = A + A(I − 1N )S C D T −1 0 = A(I + (I − 1N )S C D ) = AX, (4.33)

N×N where we used the identity in (4.22). The matrix X ∈ R is defined as

T −1 0 X = I + (I − 1N )S C D . (4.34)

T Based on the assumption 1N S ≡ 0, (4.33) can also be written as

Aa = A(I + ST C−1D0) = AX. (4.35)

Then (4.34) becomes X = I + ST C−1D0. (4.36) When number of measurements is larger than the number of realizations, the matrix C is singular and instead of using inverse we should use pseudo-inverse of C in the above expressions. One popular way to compute pseudo-inverse of a matrix C is by SVD using the following formulation

C+ = VS+U T , (4.37) where U and V are the left and the right singular vectors of C respectively and S is the diagonal matrix containing the singular values of C in descending order and S+ is defined as  1 +  , if i = j and Sij 6= 0 S = Sij (4.38)  0, else The other way to write (4.32) is

C = SST + EET , (4.39) ⇒ C = (S + E)(S + E)T , (4.40) assuming ensemble perturbations and measurement perturbations are uncorrelated, that is SET = MA0ET = 0. (4.41)

28 If we compute the SVD

S + E = UΣV T , (4.42)

then (4.30) becomes

Aa = A + A0ST UΛ−1U T D0, (4.43)

where

Λ = ΣΣT . (4.44)

This is another approach for calculating the analysis avoiding inversion of a large matrix when m  N, based on the above mentioned assumption. There are two major sources of sampling errors in EnKF: the use of a finite ensemble of stochastic model realizations and the introduction of stochastic meas- urement perturbations [12], [13].

4.3.4 EnKF for combined parameter and state estimation

When using EnKF to estimate poorly known model parameters, we start by rep- resenting the density functions (pdf) of the parameters by an en- semble of realizations, which is augmented to the state ensemble matrix A at the update steps. The poorly known parameters are then updated using the variance minimizing analysis scheme, where the covariances between the predicted data and the parameters are used to update the parameters. Algorithm 1, based on (4.33) and (4.34), summarizes the above discussion on working procedure of a single step of an ensemble Kalman filter. For quick refer- ence, we recapitulate the dimensions of the matrices used in the algorithm: A1, A2, n a 0 n×N N×N 0 ... , AN ∈ R ; A, A ,A ∈ R ; 1N ,X2,X3,X4 ∈ R ; D,D , E, S, X1 ∈ m×N m m×n + m×m R ; d, dj, j ∈ R ; M ∈ R ; C,C , U, Σ,V ∈ R .

29 Algorithm 1 A single step of basic EnKF 1: procedure ENKF Cost 2: Run simulators to generate state vectors A1,A2, ... ,AN 3: Construct A = [A1 A2 ... AN ] 1 4: 1 = ones(N,N) N N 0 2 5: A = A(I-1N ) O(nN ) 6: S = MA0 O(mnN) 7: Load measurement d 8: Generate Gaussian noise j, for j=1, 2, ... ,N 9: Compute dj = d + j, for j=1, 2, ... ,N 10: Construct D = [d1 d2 ... dN ] 11: D0 = D - MA O(mnN) 12: Construct E = [1 2 . . . N ] 13: C = SST + EET O(m2N) 14: [U, Σ, V] = svd(C) O(m3) 15: C+ = VΣ+UT O(m3) + 0 2 16: X1 = C D O(m N) T 2 17: X2 = S X1 O(mN ) 3 18: X3 = (I-1N )X2 O(N ) 19: X4 = I + X3 O(N) a 2 20: A = AX4 O(nN ) 21: return Xa 22: end procedure

Algorithm 2 is the very primitive translation of algorithm 1 to the parallel ver- sion. We call this algorithm ‘very primitive’ because it simply states the order of execution of the statements in parallel, without any consideration for optimizations. Here we see several chunks of communications are interleaved between comput- ing whereas some chunks communicate large volume of data and also many serial works are done by the root, indicating this is certainly an inefficient parallel al- gorithm. Many levels of optimizations, which makes this algorithm very efficient, are described in detail in chapter 6.

30 Algorithm 2 A single step of basic parallel EnKF 1: procedure ENKF Cost 2: Run simulator to generate state vector Ai 3: (root) Collect states A1,A2, ... ,AN 4: (root) Construct A = [X1 X2 ... XN ] 5: Communicate A O(nN) 1 6: (root) 1 = ones(N,N) N N 2 7: Communicate 1N O(N ) 0 2 8: A = A(I-1N ) O(nN ) 9: Communicate A0 O(nN) 10: S = MA0 O(mnN) 11: Communicate S O(mN) 12: Load measurement d 13: (root) Generate Gaussian noise j, for j=1, 2, ... ,N 14: (root) Construct E = [1 2 . . . N ] 15: Communicate E O(mN) 16: (root) Compute dj = d + j, for j=1, 2, ... ,N 17: (root) Construct D = [d1 d2 ... dN ] 18: D0 = D - MA O(mnN) 19: Communicate D0 O(mN) 20: C = SST + EET O(m2N) 21: Communicate C O(m2) 22: [U, Σ, V] = svd(C) O(m3) 23: Communicate U, V, and Σ O(m2) 24: C+ = VΣ+UT O(m3) 25: Communicate C+ O(m2) + 0 2 26: X1 = C D O(m N) 27: Communicate X1 O(mN) T 2 28: X2 = S X1 O(mN ) 2 29: Communicate X2 O(N ) 3 30: X3 = (I-1N )X2 O(N ) 2 31: Communicate X3 O(N ) 32: X4 = I + X3 O(N) 2 33: Communicate X4 O(N ) a 2 34: A = AX4 O(nN ) 35: (root) Distribute Aa for further use by the simulators O(nN) 36: end procedure

31 32 Chapter 5

Parallel Computing with MPI and ScaLAPACK

In this chapter we briefly introduce parallel computing and some related com- mon technologies. First we go through the basic parallel architecture concepts. Then we describe the message passing interface (MPI), a key software techno- logy for parallel computing in distributed memory architecture. Then we list some parallel numerical computing libraries with a special and detailed description of ScaLAPACK.

5.1 Parallel computing

Frequency scaling limitation due to the physical constraints, and also power con- sumption and heat generation by computers have become a key concern in recent years. As a result parallel computing has become the dominant paradigm in com- puter architecture. In parallel computing, many computations are carried out sim- ultaneously. It is based on the principle that large problems can often be divided into smaller ones, which are then solved concurrently. Some broad forms of parallelism in computing are:

1. Bit level: Doubling the computer word size can halve the number of instruc- tions required to carry out operations working on operands of two times the word size. It was a common trend of speed up until 1986. At present a sim- ilar approach is taken in vector processing, where a single processor register is large enough to hold multiple scalars and a single instruction operates on multiple scalars (hence the term vector), significantly speeding up a single processor operations.

2. Instruction level: To carry out a single operation, an instruction goes through several stages such as instruction fetch (IF), instruction decode (ID), execute (EX), memory access (MEM) and register write back (WB). These are ac-

33 ProcessorProcessor ProcessorProcessor ProcessorProcessor ProcessorProcessor

One or more One or more One or more One or more Levelslevels ofof cachecache levels of cache levels of cache levels of cache

Main memory I/O

Figure 5.1: Shared memory architecture

complished by different hardware in the processor and sequence of instruc- tions can be pipelined to enhance the overall process in a single processor.

3. Data level: This type of parallelism are inherent in program loops, focusing on distributing the data across different processing elements to be processed in parallel. Data dependency is a typical challenge for data level parallelism. This is a very conventional parallelism from a view of multiple processors.

4. Task level: For some applications, a parallel program can do different cal- culations on either the same or different sets of data. This is another conven- tional parallelism for multiple processors.

On the other hand followings are the major categories of parallel computing from hardware point of view:

1. Shared memory multiprocessing or Symmetric multiprocessing (SMP): Here multiple identical processors share the same memory (at least the same logical address space) and connect via a bus. It has limited number of pro- cessors to ensure sufficient bandwidth and hence has limited scaling.

2. Distributed memory multiprocessing: Here the processing elements are connected via a communication network. Performance depends much on the network type and also the amount as well as the number of data communica- tions required. There is no limitation on the number of processors and hence it is highly scalable. It is the mostly used architecture for parallel computing.

Figure 5.1 and figure 5.2 depicts the basic differences among these two hardware architectures.

34 Processor Processor Processor + + + cache cache cache

Memory I/O Memory I/O Memory I/O

Interconnection network

Memory I/O Memory I/O Memory I/O

Processor Processor Processor + + + cache cache cache

Figure 5.2: Distributed memory architecture: MPI’s work place

5.2 MPI

Message Passing Interface (MPI), a standardized and portable message passing system, defines the syntax and semantics of a core of library routines for writing portable message-passing programs in Fortran or C programming language. The very initial implementation of the MPI 1.x standard was MPICH. LAM/MPI was another early open implementation. LAM/MPI and a number of other MPI efforts recently merged to form Open MPI, whereas the commercial organizations like IBM, HP, Intel, SGI, and Microsoft have their own implementations of MPI. Although MPI is a complex and multifaceted system having more than 500 sub- routines, typically 10 to 30 are useful and we can solve a wide range of prob- lems using just six of its functions: MPI Init, MPI Finalize, MPI Comm size, MPI Comm rank, MPI Send, MPI Recv [18]. Tasks run throughout the program execution can be divided into four categories:

1. Start up phase: launches tasks and establishes communication context among all tasks. The set of the routines which we need for our application is:

• MPI Init: initiates an MPI computation. • MPI Comm size: determines number of processes. • MPI Comm rank: determines the process identifier.

2. Point-to-point data transfer: usually takes place between pairs of pro- cesses and usually in a coordinated fashion. The transfer may be blocking or non-blocking. Explicit synchronization is needed for non blocking transfers. Here we list the routines which we require for our application:

35 • MPI Send: sends a message to a particular destination. This routine is blocking, that it returns after the transfer is completed. • MPI Recv: receives a message from a particular source. The routine, a blocking one, returns after the transfer is completed. • MPI Isend: sends a message to a particular destination. The routine returns immediately after the call and does not guarantee the comple- tion of the transfer. We need special mechanisms to be sure of the completion of the data transfer. • MPI Irecv: receives a message from a particular source. The non blocking routine returns immediately after the call and does not guar- antee the completion of the data transfer. We need special mechanisms to be sure of the completion. • MPI Wait: The sender and the receiver wait for the completion of the non blocking data transfer routines MPI Isend and MPI Irecv respect- ively.

3. Collective communication: happens between all tasks or a subgroup of tasks and only in blocking fashion. Broadcasts, reductions, scatter/gather operations are some examples. When appropriate, they are preferred for efficient library calls. We need the following five routines for our application.

• MPI Gather: All the processes send a fixed amount of data to the root and the root arranges all the received data sequentially. • MPI Gatherv: Different process sends different amount of data to the root and the root arranges all the received data sequentially. • MPI Scatter: The root process evenly distributes some data to all the processes. • MPI Scatterv: The root process unevenly distributes some data to all the processes. • MPI Barrier: Blocks until all processes in the communicator have reached this routine.

4. Clean up phase: does the reverse tasks done in the start up phase. This is done with the simplest function MPI Finalize without any argument.

5.3 Parallel numerical libraries for SVD

For parallelization of EnKF analysis step we mainly depend on matrix-matrix mul- tiplications and SVD computations. Matrix multiplication is less complex than the SVD computation with respect to computation time. There are several paral- lel numerical libraries to handle this two problems. There are differences among the algorithms implemented in these libraries. Some depend on iterative methods,

36 some use preconditioning, some use block algorithms. For eigen value algorithms, some compute a partial set of eigen values/vectors. Some are useful for special types of matrices (ex. sparse), whereas the matrices in our application are dense. Most of the libraries in the following list have special syntaxes (ex. data types, array and matrix structures) and handle data communication by themselves. We choose ScaLAPACK because of its closeness to the original C and MPI syntaxes and to have some grips on the data communication so that we can use non blocking communications where as other libraries mostly use blocking communications in- ternally. Moreover, ScaLAPACK uses the popular and easy to use LAPACK style interfaces.

1. SLEPC: The Scalable Library for Eigenvalue Problem Computations is a specialized software library for the solution of large scale sparse eigenvalue problems on parallel computers. It is an extension of PETSc and can be used for either standard or generalized eigen problems. It can also be used for computing a partial SVD of a large, sparse, rectangular matrix, and to solve quadratic eigenvalue problems.

2. BLOPEX: The Block Locally Optimal Preconditioned Eigenvalue Xolvers is a suite of routines for the scalable (parallel) solution of eigenvalue problems.

3. Lis: The Library of Iterative Solvers for linear systems is a scalable paral- lel library for solving systems of linear equations and standard eigenvalue problems with real sparse matrices using iterative solvers.

4. ELPA: The Eigenvalue soLvers for Petaflop Applications is a Fortran-based high-performance computational library for the (massively) parallel solution of symmetric or Hermitian, standard or generalized eigenvalue problems.

5. The NAG C Library: is a collection of numerical analysis routines de- veloped by the Numerical Algorithms Group which can be called from user applications running on a wide variety of hardware platforms.

6. ParPACK: is a portable parallel version of the ARPACK library. ARPACK is a collection of Fortran77 subroutines designed to solve large scale eigen- value problems. The package is designed to compute a few eigenvalues and corresponding eigenvectors of a general n by n matrix A.

5.3.1 ScaLAPACK ScaLAPACK (Scalable Linear Algebra PACKage) is a library of high-performance linear algebra routines for distributed-memory message-passing MIMD (Multiple Instruction Multiple Data) computers and networks of workstations supporting PVM and/or MPI. It is provided by a consortium of universities as well as an industry, consisting of University of Tennessee; University of California, Berke- ley; University of Colorado Denver; and NAG Ltd. ScaLAPACK solves dense and

37 ScaLAPACK

PBLAS Global

Local LAPACK BLACS

Communication primitives LINPACK EISPACK (MPI, PVM, etc.)

BLAS

Figure 5.3: ScaLAPACK Software Structure also banded linear systems but not sparse systems, least squares problems, eigen- value problems, and singular value problems. They can also handle many associ- ated computations such as matrix factorizations or estimation of condition num- bers. The fundamental building blocks of the ScaLAPACK library are distributed- memory versions of the Level 1, Level 2, and Level 3 BLAS, called the Parallel BLAS (PBLAS), and a set of Basic Linear Algebra Communication Subprograms (BLACS) for communication tasks that arise frequently in parallel linear algebra computations. Figure 5.3 depicts the structure of ScaLAPACK. ScaLAPACK in- cludes the following key ideas [4]:

1. block cyclic data distribution (for dense matrices) and block data dis- tribution (for banded matrices): the distribution information is used as parameters during runtime;

2. block-partitioned algorithms: in order to minimize the frequency of data movement;

3. well-designed low-level modular components: to simplify the task of par- allelizing the high level routines by making their source code the same as in the sequential case.

The goals of ScaLAPACK are the same as those of LAPACK:

1. efficiency (to run as fast as possible),

2. scalability (as the problem size and number of processors grow),

38 3. reliability (including error bounds),

4. portability (across all important parallel machines),

5. flexibility (so users can construct new routines from well-designed parts),

6. ease of use (by making the interface to LAPACK and ScaLAPACK look as similar as possible).

In ScaLAPACK, generally there are three steps required to divide the computations among the parallel processors [7]. They are:

1. Initialize the library (using standard routines).

2. Set up a processor grid.

3. Map the data structure used (array, vector, etc.) onto the processor grid.

• Create an array descriptor vector for each global array. • Map global array elements in a 2-D blocked-cyclic manner onto each processor grid.

After these steps are completed, each processor will be responsible for the calcu- lations on some subset of the data. This general approach is referred to by many names such as data decomposition, domain decomposition, data distribution, and so on.

Process grid

ScaLAPACK uses a processor grid during parallel work distribution. One of the member libraries under the ScaLAPACK umbrella is the BLACS (Basic Linear Algebra Communication Subroutines) library. In addition to array-operand com- munication calls, the BLACS library contains important routines for creating and examining the processor grid, which is expected and used by all ScaLAPACK routines. The processor grid is a two-dimensional array whose size and shape is controlled by the program. A processor is identified by its row and column number in the processor grid rather than its traditional MPI rank [9]. The processor grid diagram for eight processors used in some parallel program arranged in a 2×4 grid with row major ordering will be

0 1 2 3 0 0 1 2 3 1 4 5 6 7

The MPI rank for a processor is contained within each element of the grid. For example, the processor with MPI rank 6 has grid coordinates (1, 2).

39 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3

Figure 5.4: Data distribution examples: column blocked (left) and column cyclic (right)

Data distribution An important performance criteria for ScaLAPACK is the data distribution phe- nomena. We said earlier that ScaLAPACK data distribution method is block cyclic for dense matrices. To clarify this we show the differences among the column block, the column cyclic, the column block cyclic and the block cyclic distribu- tions for a 16×16 array among 4 processors in figures 5.4 and 5.5. In simple column block distribution each processor gets 4 consecutive columns and 16 rows of the original array. In column cyclic distribution each processor gets 4 columns and 16 rows but now the columns are not consecutive. In column block cycling dis- tribution each processor gets 4 columns and 16 rows but now every two columns are consecutive in the original matrix. In block cyclic distribution, which is sim- ultaneously column block and row block distribution, each processor gets 4 rows and 4 columns where every two rows and every two columns are consecutive in the original matrix. The rules for block cyclic data distribution are: 1. Divide the global array into blocks with mb rows and nb columns. 2. Place the first row of array blocks across the first row of the processor grid in order. If you run out of processor grid columns, cycle back to the first column of the first row. 3. Repeat the previous step with the second row of array blocks and use the second row of the processor grid. Cycling back if you need to. 4. Continue for the remaining rows of array blocks. 5. If you run out of processor grid rows, cycle back to the first processor row and keep repeating the steps above.

40 0 1 0 1 0 1 0 1 2 3 2 3 2 3 2 3 0 1 0 1 0 1 0 1 2 3 2 3 2 3 2 3 0 1 2 3 0 1 2 3 0 1 0 1 0 1 0 1 2 3 2 3 2 3 2 3 0 1 0 1 0 1 0 1 2 3 2 3 2 3 2 3

Figure 5.5: Data distribution examples: column blocked cyclic (left) and 2D blocked cyclic

The diagram in figure 5.6 illustrates a two-dimensional, block-cyclic distribution of a 9×9 global array with 2×2 array blocks over a 2×3 processor grid. (The colors represent the IDs of the six different processors.)

Useful routines We need the following ScaLAPACK routines for our application: 1. Cblacs pinfo: It reports the rank for the calling process, and the total num- ber of processes available before the BLACS environment set up. 2. Cblacs get: It reports information about the BLACS state. It can be used to return the system default context needed to construct a first BLACS grid and its associated context. The default context corresponds to the MPI default context of MPI COMM WORLD. 3. Cblacs gridinit: The BLACS grid created is identified by a context handle that is subsequently used to identify that particular process grid among the many that may be generated. This routine takes the available processes and assign or map them into a BLACS process grid. In other words, it estab- lishes how the BLACS coordinate system will map into the native machine’s process numbering system. 4. Cblacs gridinfo: It reports information about a BLACS context. Returns information on the current grid, particularly the coordinate of the process calling it. 5. Cblacs barrier: This routine holds up execution of all processes within the indicated scope until they have all called the routine.

41 Figure 5.6: An example of view of data distribution in ScaLAPACK [3]

6. Cblacs exit: This routine should be called when a process has finished all use of the BLACS. It frees all BLACS contexts and releases all memory the BLACS have allocated.

7. numroc : This function computes the local number of rows or columns of a block-cyclically distributed matrix contained in a process row or process column, respectively for the calling process.

8. descinit : It initializes the descriptor vector with 8 input arguments: number of row and columns in the operand matrix, row and column wise block sizes, row and column indices of the process containing the first row and the first column of the matrix, the BLACS context handle, the leading dimension of the local array.

9. indxl2g: It computes the global row or column index of a local element of a block-cyclically distributed matrix.

10. pdgesvd : This subroutine computes the singular values and, optionally, the left and/or right singular vectors of a real or complex general matrix.

11. pdgemm : It computes a scalar-matrix-matrix product and adds the result to a scalar-matrix product.

42 Chapter 6

Implementation

In this chapter we describe the details of the development of our application. First we describe the oil reservoir model, which we use in our application. Then we de- scribe the various components of the application’s software structure. We conclude by describing some optimization techniques adopted for better performance of the parallel application. We stated in chapter 2 that only a few researches have been carried out for par- ralelization of EnKF for oil reservoirs. More over, they work with only sparse measurement data like BHP, and oil and water flow rates from the wells and paral- lelize the embarrassingly parallel forecast step only and as a result have a limited speedup with large number of processors and consequently show limited scaling. On the other hand, dense data like saturation, calculated from time-lapse seismic from every grid block, though more noisy than the direct measurement data from the wells, give better estimation of the reservoir states, applying data assimilation. We incorporate time-lapse seismic data in our parallel implementation of EnKF for oil reservoirs for the first time. Inclusion of time-lapse seismic data makes the EnKF process more time consuming as the analysis step becomes larger than the forecast step in terms of execution time. In addition to this, the analysis part is more complex than the forecast step as it requires intermediate inter process com- munications. It is challenging to maintain or improve the speedup of the previous research works with the added complex analysis step which requires communica- tion of large matrices. However, by intelligent use and runtime configurations of the ScaLAPACK routines we minimize the requirement of data communications effectively and achieve very good speedup values. Even with large number of pro- cessors, our implementation shows better scaling.

6.1 Reservoir model

We work with a rectangular reservoir field with length 1980 m, width 1020 m and height 5 m. We assume all the fluid and rock properties of the reservoir to be con- stant along the height and hence we consider it as a two dimensional reservoir. For

43 5

10

15

20

25

30

35

40

45

50

10 20 30 40 50 60 70 80 90

Figure 6.1: Location of the wells in the model reservoir field computational purpose the field is logically divided into 5049 blocks in a 51×99 grid of square blocks of size 20 m in each side. We have two injection wells loc- ated at the coordinates (2, 2) and (50, 17) and two production wells located at the coordinates (17, 62) and (41, 98) as shown in figure 6.1. The black dots in the left side of the figure are the injection wells, whereas the red dots in the right side are the production wells. The typical state vector for oil reservoir consists of the following variables

1. pressure (p)

2. saturation (s)

3. bottom hole pressures in the wells (BHP )

4. oil flow rate in the wells (qo)

5. water flow rate in the wells (qw) We consider observations of saturations (in form of time-lapse seismic) at every block and bottom hole pressures, and oil and water flow rates in the wells (in form of measurements). We consider the following standard deviations of the measure- ment noise (Gaussian) for observations: 15% for saturation, 5% for BHP and oil flow rate, and 10% for water flow rate. We want to estimate the log permeability (log k) also. Then our state vector has the following form.  p   s     BHP  A=   qo     qw  log k

44

0.33 5

0.32 10

0.31 15

0.3

20

0.29

25

0.28

30

0.27

35

0.26

40

0.25

45

0.24

50

10 20 30 40 50 60 70 80 90

Figure 6.2: Assumed constant porosity field for all ensemble members

We assume a constant porosity field in figure 6.2 for all ensemble members and we do not consider it in our analysis process. Our measurement network excludes pressure values and permeability values. Then for computation of S and D in analysis steps described in chapter 4, we use the following portion (of length 5061) of the state vector:  s   BHP  A=   qo  qw And as estimating the log permeability values (log k) is our primary goal of the assimilation and updating other state values are of minor importance (and we neglect them) and hence in the update statement in (4.33) we use only the following portion (of length 5049) of the state vector. A= log k 

6.2 Software structure

For the forecast or simulation step we used simsim (simple simulator), written in MATLAB, developed by Prof. Jan Dirk Jansen, TU Delft. For parallel data assim- ilation we used MPI in C language. To interface with simsim we developed a driver named simsimdriver which contains all data assimilation computations as well as function calls for running the simulator for the forecast step. To make the interface simple we developed two wrapper functions in MATLAB:

1. initSimsim: It initializes and prepares all the required resources such as model, control and parameter informations in each processor. It also provides assumed permeabilities of the ensemble.

45 P0 P1

analysis analysis simsimdriver simsimdriver 3 7 6 6 7 3

k0 A0 A1 k1

1 1 2 5 5 4 4 2 MCR MCR

initSimsim runSimsim runSimsim initSimsim

simsim simsim

Figure 6.3: Software structure of the developed parallel application

2. runSimsim: It works with the updated assumption of permeabilities for a defined time period and returns the seismic data for saturation and measure- ment data of bottom hole pressure, and oil and water flow rates measured in the wells.

These wrapper functions and the original simsim functions are compiled to a library, usable in C programming, by the MATLAB compiler mcc tool with ap- propriate settings. The driver program in C and the compiled library needs to be compiled to an executable with the MATLAB compiler mbuild utility. Figure 6.3 depicts the relation and interaction among different components of the parallel application in the simplest 2 processors environment for 4 realizations, where each processor processes 2 realizations. State vectors and the permeability data for each realization are represented with different colors. In the figure we see the data/work flow sequence of the application. The vertical dotted line defines the division between the processors P0 and P1. The horizontal dotted line indicates the division of the execution platform between MATLAB and C in each processor. The MATLAB part in the application is handled by MATLAB Compiler Runtime (MCR). The original simulator simsim and the wrapper functions initSimsim and runSimsim are executed by MCR. We also observe that the simulation part in each processor is independent of the other, whereas the analysis part needs to be carried out in a coordinated fashion. Also notable, the data (state vectors A0 or A1) re- turned by one processor after the simulation step need to be distributed among all

46 the processors for the analysis step and the data (updated permeabilities K0 or K1) fed to the simulators in one processor need to be collected from all the processors. The reason is explained in the next section. We also notice that the steps 4 to 7 in the figure represents a single cycle for a single assimilation step.

6.3 Optimization

We already mentioned that we have two parts to parallelize:

1. Forecast or simulation: This part is embarrassingly parallel. Each pro- cessor will run a number of ensemble members to generate an ensemble of forecasts without any requirement of communication with other processors or other members in the ensemble. Only the updated permeabilities need to be collected from and predicted state values need to be sent to the root processor for gathering for the analysis step.

2. Analysis or update: We saw the analysis step involves SVD computation as well as a series of matrix multiplications. These operations also nicely scale in a parallel computing environment, though not as nicely as the forecast step because of the need for inter processor communication. Also there are some preparatory serial tasks for the analysis step.

We adopt the two most common ways to optimize parallel algorithms, load balan- cing and communication minimization, and discuss below.

6.3.1 Load balancing

We statically distribute equal number of realizations to each processor to run the simulation. As we need no processor-to-processor communication during the fore- cast step, dynamic load balancing is very complex in this step. We ensure proper load balancing in the analysis step by choosing small block sizes of the operand matrices for ScaLAPACK operations. We choose block size 64 in case of the long dimensions (for example the rows in the matrices A, S, D, X1, k; and both rows N and columns in the matrices C, C+, U, V ) and in case of short dimensions nproc (for example columns of the matrices A, S, X1, k; and both rows and columns of the matrices I − 1N , X2, X3, X4), where nproc is the number of processes we utilize for the parallel computing. In fact the smallest possible block size 1 ensures 100% load balancing, but then the number of communications for ScaLAPACK operations increases, leading to bad performance. ScaLAPACK suggests a 64×64 two dimensional blocks for a balanced load as well as tolerable communication overhead for optimum performance.

47 6.3.2 Minimizing data communication

To get the maximum performance we must reduce data communication. The em- barrassingly parallel forecast step of EnKF is executed in a column block distri- bution (figure 5.4) with respect to the predicted state ensemble A. If we want to execute also the analysis step in this orientation, we will loose performance sig- nificantly for the ScaLAPACK matrix operation routines pdgesvd and pdgemm, as ScaLAPACK suggests a square or at least a close-to-square rectangular pro- cessor grid arrangement for the best performance, because this arrangement en- sures the least internal communication for ScaLAPACK operations. So we have to sacrifice some communication overheads to gain significant performance from the ScaLAPACK routines. That is, we will go column block orientation for the fore- cast step and two dimensional block cyclic distribution for the analysis step. To do this we have to gather the forecast/simulation results (state vectors) to convert to 2D block cyclic distribution from column block distribution and vice versa after the analysis step (to feed the updated parameter values). Other than the internal communications of the ScaLAPACK operators, we put significant efforts to minimize data communication by two ways.

Construction and computations of the local matrices locally:

In the basic parallel EnKF step, listed in algorithm 2, there are a lot of data com- munication statements. The reason behind this is our over dependence on the master-slave relationship among the available processors. We assumed that each operand and resultant matrix in the analysis step must be either constructed by the root processor or assembled together by the root processor after parallel op- eration and transferred to the other processors for further parallel operation on it. But the local parts of the matrices in each processor can be constructed locally, mitigating the need for data communication. For example, in algorithm 2, the local parts of the matrix (I − 1N ) can be constructed locally. It needs comput- ing which local matrix element corresponds to the diagonal elements of the global matrix. After construction of (I − 1N ) and successful transfer of the matrix A and E, the local parts of the matrices A0, S, D, D0 can be computed locally, sig- nificantly reducing communication requirement in the steps 7, 9, 11 and 19 in the reference algorithm. An important point to note here is that this is valid only if later matrix operations involving these matrices can be carried out with the cur- rent ScaLAPACK data distribution. This requirement can be ensured by fixing the block sizes, as we discussed earlier. In the same way we get rid of the communic- + ating large matrices C, U, V , C ; and also X1, X2, X3, and X4 in steps 21, 23, 25, 27, 29, 31 and 33 respectively. All these improvements lead to the algorithm 3 with very little data communication requirement. An important thing to note at this point is that in the practical implementation of this algorithm we do not use the measurement operator M in steps 8 and 15, as we consider observations of saturations from every grid block of the reservoir. We included the measurement

48 operator M only for notational purpose and to comply with previous discussion and algorithms. Also note that in the practical implementation, the dimension and contents of the matrix A may change in step 24 due to the facts described in section 6.1. For quick reference, we recapitulate the dimension of the matrices used in the n a 0 n×N N×N algorithm: A1,A2,...,AN ∈ R ; A, A ,A ∈ R ; 1N ,X2,X3,X4 ∈ R ; 0 m×N m×n m + m×m D,D , E, S, X1 ∈ R ; M ∈ R ; d, dj, j ∈ R ; C,C , U, Σ,V ∈ R .

Algorithm 3 A single step of the optimized parallel EnKF 1: procedure ENKF Cost 2: Run simulator to generate state vector Ai 3: (root) Collect states A1,A2, ... ,AN 4: (root) Construct A = [A1 A2 ... AN ] 5: Communicate A O(nN) 1 6: 1 = ones(N,N) N N 0 2 7: A = A(I-1N ) O(nN ) 8: S = MA0 O(mnN) 9: Load measurement d 10: (root) Generate Gaussian noise j, for j=1, 2, ... ,N 11: (root) Construct E = [1 2 . . . N ] 12: Communicate E O(mN) 13: Compute dj = d + j, for j=1, 2, ... ,N 14: Construct D = [d1 d2 ... dN ] 15: D0 = D - MA O(mnN) 16: C = SST + EET O(m2N) 17: [U, Σ, V] = svd(C) O(m3) 18: C+ = VΣ+UT O(m3) + 0 2 19: X1 = C D O(m N) T 2 20: X2 = S X1 O(mN ) 3 21: X3 = (I-1N )X2 O(N ) 22: X4 = I + X3 O(N) a 2 23: A = AX4 O(nN ) 24: (root) Distribute Aa for further use by the simulators O(nN) 25: end procedure

Choosing non blocking communication:

Then we have only four data communication statements in algorithm 3: in step 3 to collect the forecast results from the column block oriented processor grids, in step 5 to transfer the matrix A, in step 12 to transfer the matrix E and in step 24 to distribute the updated permeability values to the column block oriented processor grid for the forecast step. Communications for the matrices A and E can be done

49 in non blocking fashion by overlapping communication and computations. To do that we have to rearrange the steps in the above algorithm without affecting the computational results. The finally optimized version is in algorithm 4. Other than the rearrangements, it divides the single statement C = SST + EET into two different statements E E = EET and C = SST + E E for better rearrangement. Unfortunately the communications in steps 3 and 24 in algorithm 3 can not be done in non blocking fashion due to lack of availability of significant work which can be overlapped. Also the root processor does some unavoidable works in serial fashion. But fortunately these tasks are very small in terms of computation time. To achieve non blocking communication, we have to do the data transfers manu- ally, using the routines MPI Isend and MPI Ircv, instead of simple and easy to use collective transfers by the routines MPI Gather and MPI Scatter. Also there are nice and simple CBLACS and ScaLAPACK data transfer routines Cdgesd2d, Cdgerv2d, Cdgebs2d, Cdgebr2d; and pdgemr2d respectively. We can not use these handy tools because all these routines are blocking communications methods, in- stead we utilized the mostly used triad routines for non blocking communication: MPI Isend, MPI Irecv and MPI Wait.

50 Algorithm 4 A single step of the final parallel EnKF 1: procedure ENKF Cost 2: Load measurement d 3: (root) Generate Gaussian noise j, for j=1, 2, ... ,N 4: (root) Construct E = [1 2 . . . N ] 5: Communicate E (asynchronously) O(mN) 6: Run simulator to generate state vector Ai 7: (root) Collect states A1,A2, ... ,AN 8: (root) Construct A = [A1 A2 ... AN ] 9: Communicate A (asynchronously) O(nN) 1 10: 1 = ones(N,N) N N 11: E E = EET O(m2N) 12: Compute dj = d + j, for j=1, 2, ... ,N 13: Construct D = [d1 d2 ... dN ] 0 2 14: A = A(I-1N ) O(nN ) 15: S = MA0 O(mnN) 16: C = SST + E E O(m2N) 17: [U, Σ, V] = svd(C) O(m3) 18: C+ = VΣ+UT O(m3) 19: D0 = D - MA O(mnN) + 0 2 20: X1 = C D O(m N) T 2 21: X2 = S X1 O(mN ) 3 22: X3 = (I-1N )X2 O(N ) 23: X4 = I + X3 O(N) a 2 24: A = AX4 O(nN ) 25: (root) Distribute Aa for further use by the simulators O(nN) 26: end procedure

51 52 Chapter 7

Results and Discussions

In this chapter we present and discuss some experimental results. For any parallel algorithm, the first requirement is the correctness compared to the serial counter- part. First we present data and graphs representing the correctness of our parallel implementation of EnKF. Then we present some data and graphs representing the speedup and efficiency, and other related issues of the implemented algorithm on different number of processors. We worked on two different clusters, first one is in a local area network (LAN) with heterogeneous hardware and the other one is the SARA Lisa high performance compute cluster. In both clusters, we used a single core of each processor to have homogeneous data communication among all the processing elements.

7.1 Verification

The main goal of data assimilation is to minimize the difference between the pre- dictions and the true values for the reference variables or parameters. So, we should show that with time (or assimilation steps) the difference is reducing in general. And also, the assimilation performance of EnKF should be better with a large number of realizations (say, a hundred) than that with a few realizations (say, a dozen). Root mean square (rms) difference between the assimilated parameters with the true values is a good tool to perceive the result. In our experiments, the models simulate the reservoir operations for ten years, whereas data assimilations take place in the first three years at six months interval, resulting in six assimilation steps. Table 7.1 shows the rms difference of the assim- ilated log permeability and true log permeability when executed on four processors with 16, 48 and 96 realizations. A decaying curve indicates the correctness of the implemented parallel algorithm. A significant difference between the curves can be observed for 16 and 96 realizations. A similar difference between the curves for 16 and 48 realizations is observed. As said earlier, with large number of realiza- tions, the EnKF performs better. Additionally we observe that with large number of realizations the performance of the EnKF is smoother. However the difference

53 Table 7.1: rms differences between the true and assimilated log permeabilities at different time steps Assimilation time (year) rms (16 rea) rms (48 rea) rms (96 rea) initial (no assimilation) 0.577885 0.53903 0.553806 0.5 0.583252 0.514202 0.507765 1 0.498312 0.430884 0.453633 1.5 0.524888 0.441685 0.409763 2 0.511347 0.412005 0.379721 2.5 0.431661 0.369525 0.382082 3 0.440173 0.350035 0.337289

between 48 and 96 realizations is not so obvious other than the smoother curve for 96 realizations. In fact, in presence of large number of observations 48 realizations seem to be sufficient to perform a good assimilation whereas doubling the number of realizations does not increase the EnKF performance significantly. The graph in figure 7.1 visualizes the results in table 7.1. Figures 7.2, 7.3 and 7.4 present the log permeability graphs for visual compar- ison.

7.2 Performance of the parallel algorithm

Initial tests and measurements were carried out in a LAN cluster with heterogen- eous processing speed and memory size configurations. Table 7.2 lists the spe- cifications for the 16 computers used. For the single processor run we used the 2.66 GHz Intel(R) Core(TM)2 Quad CPU Q8400 machine with 6MB cache and 4 GB main memory. This machine has roughly the average configuration of the available hardware. However, the machine with the best available configuration 3.33 GHz Intel(R) Core(TM) i5 660 with 4 MB cache and 8 GB main memory was used only for the run with 16 processors, so that the combinations of processors for different experiments with varying processor numbers have a well balance of the hardware configurations. The interconnection network is switch based with data rate 100 Mbps. Then measurements were taken in the SARA Lisa cluster having homogeneous hardware and faster interconnection network. In the Lisa cluster we used 8 core processors with 2.26 GHz speed, 8 MB cache and 24 GB main memory, and 4x DDR Infiniband network with bandwidth 1600 MB/sec and Latency < 6 µsec. In following discussion the forecast time, analysis time and the total time are related as

total time = initialization time + forecast time + analysis time. (7.1)

So we should be aware that total time is not the sum of the forecast time and the

54 0.6 16 realizations 96 realizations 48 realizations 0.55

0.5

0.45 rms difference

0.4

0.35

0.3 0 0.5 1 1.5 2 2.5 3 assimilation time (year)

Figure 7.1: rms differences of log permeability over time

Table 7.2: Hardware specification of the LAN cluster for initial tests and measure- ments Processor Name Speed (GHz) Cache RAM # Nodes Intel(R) Core(TM)2 Duo E8400 3.00 6 MB 4 GB 3 Intel(R) Core(TM)2 Duo E4500 2.20 2 MB 8 GB 7 Intel(R) Core(TM)2 Duo E8500 3.16 6 MB 4 GB 1 Intel(R) Core(TM)2 Quad Q8400 2.66 6 MB 4 GB 1 Intel(R) Core(TM)2 Quad Q8400 2.66 6 MB 8 GB 2 Intel(R) Core(TM)2 Duo E8400 3.00 6 MB 8 GB 1 Intel(R) Core(TM) i5 660 3.33 4 MB 8 GB 1

55

−11

5

10 −11.2

15

−11.4 20

25 −11.6

30

35 −11.8

40

−12

45

50 −12.2

10 20 30 40 50 60 70 80 90

−10.8

5 −11

10 −11.2

15

−11.4

20

−11.6

25

−11.8

30

−12

35

−12.2 40

−12.4 45

−12.6 50

10 20 30 40 50 60 70 80 90

−10.5

5

10 −11

15

20

−11.5

25

30

−12

35

40

−12.5 45

50

10 20 30 40 50 60 70 80 90

Figure 7.2: log permeability fields with 16 realizations: initial (top), after 6 assim- ilation steps in 3 years (middle), the true log permeability (bottom)

56

5 −11.3

10

−11.4 15

20 −11.5

25

−11.6 30

35 −11.7

40

−11.8 45

50

10 20 30 40 50 60 70 80 90

−10.8 5

−11 10

−11.2 15

−11.4 20

−11.6 25

−11.8

30

−12

35

−12.2

40

−12.4

45

−12.6

50

10 20 30 40 50 60 70 80 90

−10.5

5

10 −11

15

20

−11.5

25

30

−12

35

40

−12.5 45

50

10 20 30 40 50 60 70 80 90

Figure 7.3: log permeability fields with 48 realizations: initial (top), after 6 assim- ilation steps in 3 years (middle), the true log permeability (bottom)

57

−11.3

5

10 −11.4

15

−11.5 20

25

−11.6

30

35 −11.7

40

45 −11.8

50

10 20 30 40 50 60 70 80 90

−10.8

5

−11

10

−11.2 15

−11.4 20

25 −11.6

30 −11.8

35 −12

40 −12.2

45 −12.4

50

10 20 30 40 50 60 70 80 90

−10.5

5

10 −11

15

20

−11.5

25

30

−12

35

40

−12.5 45

50

10 20 30 40 50 60 70 80 90

Figure 7.4: log permeability fields with 96 realizations: initial (top), after 6 assim- ilation steps in 3 years (middle), the true log permeability (bottom)

58 Table 7.3: Performance of the parallel implementation for 48 realizations in the LAN cluster # processors 1 2 4 8 12 16 forecast time (h) 3.04 1.54 0.75 0.37 0.25 0.19 analysis time (h) 4.62 2.49 1.31 0.85 0.65 0.58 total time (h) 7.66 4.25 2.16 1.29 0.94 0.81 speedup (forecast) 1 1.97 4.06 8.11 12.02 16.23 speedup (analysis) 1 1.85 3.53 5.46 7.14 7.95 speedup (total) 1 1.80 3.54 5.96 8.12 9.51 efficiency (forecast) % 100 98.67 101.48 101.36 100.13 101.46 efficiency (analysis) % 100 92.64 88.13 68.25 59.48 49.69 efficiency (total) % 100 90.16 88.59 74.51 67.69 59.41

Table 7.4: Performance of the parallel implementation for 48 realizations in the SARA Lisa cluster # processors 1 2 4 8 16 24 48 forecast time (h) 2.99 1.55 0.75 0.39 0.20 0.13 0.06 analysis time (h) 3.84 2.08 0.94 0.44 0.21 0.18 0.09 total time (h) 6.85 3.65 1.71 0.84 0.42 0.33 0.20 SP (forecast) 1 1.93 3.97 7.73 15.34 23.01 47.86 SP (analysis) 1 1.84 4.10 8.76 17.97 21.86 42.17 SP (total) 1 1.88 4.00 8.19 16.22 20.94 34.43 EP (forecast) % 100 96.44 99.14 96.66 95.85 95.88 99.70 EP (analysis) % 100 92.25 102.47 109.47 112.31 91.08 87.85 EP (total) % 100 93.92 99.91 102.37 101.36 87.26 71.72

analysis time. Tables 7.3 and 7.5 show the time measurements, corresponding speedup and parallel efficiency with 48 and 96 realization in the LAN cluster. The graph in figures 7.5 and 7.7 show the gained speedup with initial measurements on the LAN cluster with 48 and 96 realizations. Tables 7.4 and 7.6 list the time measurements, corresponding speedup and parallel efficiency with 48 and 96 realization in the SARA Lisa cluster. Figures 7.6 and 7.8 depict the speedup in the SARA Lisa cluster.

7.2.1 Difference in speedup: forecast vs analysis step

We said several times that the forecast step is embarrassingly parallel and need no intermediate inter process communication and hence the interconnection network has absolutely no role in the speedup of this step and the speedup should follow

59 18 forecast analysis 16 total ideal

14

12

10

speedup 8

6

4

2

0 0 2 4 6 8 10 12 14 16 # of processors

Figure 7.5: Speedup of the parallel implementation for 48 realizations in the LAN cluster for different parts

Table 7.5: Performance of the parallel implementation for 96 realizations in the LAN cluster # processors 1 2 4 8 12 16 forecast time (h) 6.22 3.03 1.52 0.98 0.65 0.49 analysis time (h) 4.13 2.66 1.41 0.78 0.61 0.55 total time (h) 10.37 6.11 3.15 1.77 1.29 1.06 speedup (forecast) 1 2.05 4.09 6.37 9.52 12.68 speedup (analysis) 1 1.55 2.92 5.30 6.73 7.46 speedup (total) 1 1.70 3.29 5.84 8.04 9.76 efficiency (forecast) % 100 102.74 102.31 79.68 79.35 79.25 efficiency (analysis) % 100 77.71 73.02 66.23 56.05 46.65 efficiency (total) % 100 84.90 82.37 73.05 67.01 60.98

60 50 forecast analysis 45 total ideal 40

35

30

25 speedup 20

15

10

5

0 0 5 10 15 20 25 30 35 40 45 50 # of processors

Figure 7.6: Speedup of the parallel implementation for 48 realization on SARA Lisa cluster for different parts

Table 7.6: Performance of the parallel implementation for 96 realizations in the SARA Lisa cluster # processors 1 2 4 8 16 24 48 forecast time (h) 6.10 3.03 1.54 0.80 0.37 0.26 0.12 analysis time (h) 3.72 1.91 1.00 0.43 0.21 0.16 0.09 total time (h) 9.84 4.95 2.58 1.24 0.62 0.46 0.24 SP (forecast) 1 2.02 3.97 7.63 16.43 23.43 49.93 SP (analysis) 1 1.95 3.74 8.76 17.44 22.79 41.09 SP (total) 1 1.99 3.81 7.91 15.80 21.46 41.67 EP (forecast) % 100 100.83 99.26 95.33 102.70 97.62 104.02 EP (analysis) % 100 97.57 93.50 109.50 109.00 94.94 85.61 EP (total) % 100 99.44 95.19 98.81 98.75 89.40 86.82

61 16 forecast analysis total 14 ideal

12

10

8 speedup

6

4

2

0 0 2 4 6 8 10 12 14 16 # of processors

Figure 7.7: Speedup of the parallel implementation for 96 realizations in the LAN cluster for different parts

62 50 forecast analysis 45 total ideal 40

35

30

25 speedup 20

15

10

5

0 0 5 10 15 20 25 30 35 40 45 50 # of processors

Figure 7.8: Speedup of the parallel implementation for 96 realization on SARA Lisa cluster for different parts

63 the ideal case. On the other hand in the computation of SVD and matrix-matrix multiplication, naturally we need a huge number of intermediate inter process com- munication and this implies that the speedup in the analysis step will be less than that of the forecast step. This is observed in all the experimental results and cor- responding figures.

7.2.2 Super-linear speedup

In both the clusters super-linear speedup is observed for the forecast step. In the SARA Lisa cluster we observe this also for the analysis step. We described in chapter 2 that this occurs when number of processors becomes so large that their combined cache memory size is enough to hold the required data which is im- possible for a single or small number of processors. In the Lisa cluster every pro- cessor has a cache memory of 8 MB, whereas in the LAN cluster the maximum available cache memory size is 6 MB and the minimum is 2 MB. The shortfall in the cache memory size and the mentioned slow interconnection network do not allow speedup to be super-linear for the analysis step in the LAN cluster. However, it is important to note here, the forecast time for the same realization may vary a little bit on different runs. In our implementation of the analysis step, we need to generate random Gaussian noise. The system time works as the seed for the random number generators. Thus, noise generated by different runs of the random number or Gaussian noise generator are not same. This may result in slightly varying assimilated parameters on different runs of the same realization. So other than the larger combined cache memory effect, also this may contribute for super-linear speedup for the forecast step on all clusters.

7.2.3 Parallel scalability

For both 48 and 96 realizations on both the clusters, the efficiencies of the forecast steps almost always follow the linear speed up. This implies that the forecast step is strongly scalable regardless of the interconnection network of the cluster. In the forecast step with 96 realization the problem size is double of that with 48 realizations. This means the forecast step is also weakly scalable. However, for both 48 and 96 realizations on both the cluster the efficiencies of the analysis steps gradually decrease with increased number of processors. This implies that the analysis step is not strongly scalable. For proper scaling of SVD computation and matrix-matrix multiplication, ScaLAPACK suggests at least a 1000×1000 matrix per processor after data distribution. By this formulation, 25 is the roughly calculated upper limit of number of processors for the model we are working with to get good efficiency for the analysis step. We observe both for 48 and 96 realizations, on the SARA Lisa cluster the efficiency is well above 90% for up to 24 processors. We said earlier that increasing the number of realizations does not increase the problem size for the analysis step. To increase the problem size

64 for the analysis step we need increased number of observations. So weak scaling is not relevant in this context.

7.2.4 Difference in performance: 48 realizations vs 96 realizations With more number of realizations the problem size for the forecast step increases proportionally. These are observed in the figures 7.9 and 7.10. In both cases, the forecast time for 96 realizations is almost double of that for 48 realizations. On the other hand, with more number of realizations, the problem size for the analysis part remains almost the same, as the dimension of matrix in the SVD computation and the large matrix multiplication immediately after this remain same (m×m or 5061×5061 for our working model). We observe in the figures that the analysis time remains almost the same for both 48 and 96 realizations. However, some small matrices’ size becomes double or quadruple with double number of realizations. Though the computation time for operations involving these matrices are not very significant, we observe that in most of the cases, the analysis time decreases with increased number of realizations. In the last chapter we described that we used two types of block lengths for ScaLAPACK operations. We used a block length of 64 for long dimensions (with length 5061 or 5049) and a N block length of for short dimensions (with length 48 or 96), where N is the nproc number of realizations and nproc is the number of processors used. With increased number of realizations, we can use a bit more larger block length, closer to the optimal block length 64, for the short dimensions which lead to some performance gain, as seen in the figures.

7.2.5 Difference in speedup and performance: LAN cluster vs SARA Lisa cluster The forecast step needs no intermediate inter process communication. So the spee- dup of this step should not be affected by the interconnection network of the used cluster. We see in the figures, for both the clusters the speedup of the forecast step follows almost the ideal case. However, as SARA Lisa uses faster interconnection network, the effect of the intermediate inter process communications of the ana- lysis step should be less severe to it, compared to the LAN cluster and we follow from the figures that for the SARA Lisa cluster, the analysis step speedup is slightly less than the forecast step speedup. We see in the figure 7.11 the forecast time is almost equal in both clusters, but as the SARA Lisa cluster uses faster intercon- nection network, it consumes very less time for analysis step compared to the LAN cluster.

7.2.6 Effect of ScaLAPACK process grid Process grid orientation is an important performance issue for ScaLAPACK oper- ations as the number and amount of internal intermediate inter process communic-

65 forecast (48 rea) 6 forecast (96 rea) analysis (48 rea) analysis (96 rea)

5

4

3 time (hr)

2

1

0 1 2 4 8 12 16 # of processors

Figure 7.9: Execution time of the forecast and analysis step for 48 and 96 realiza- tions on the LAN cluster

66 7 forecast (48 rea) forecast (96 rea) analysis (48 rea) 6 analysis (96 rea)

5

4

time (hr) 3

2

1

0 1 2 4 8 16 24 48 # of processors

Figure 7.10: Execution time of the forecast and analysis step for 48 and 96 realiz- ations on the SARA Lisa cluster

67 5 forecast (LAN) forecast (SARA Lisa) 4.5 analysis LAN) analysis (SARA Lisa) 4

3.5

3

2.5 time (hr) 2

1.5

1

0.5

0 1 2 4 8 16 # of processors

Figure 7.11: Comparison of execution time of the forecast and analysis step for 48 realizations on different clusters

68 Table 7.7: Execution time (in minute) of the analysis step for different process grid orientation for 48 realizations process grid LAN cluster SARA Lisa cluster 1x16 77.43 15.97 2x8 44.48 13.78 4x4 35.66 12.83

ations for various ScaLAPACK operations depend on it. ScaLAPACK suggests a square grid or in case it is impossible to construct a square grid then very close to a square grid arrangement of available processors. The performance issue of the pro- cess grid orientation should be more vital for slow interconnection networks. Table 7.7 lists the various analysis times for possible 3 process grid arrangements of 16 processors on both the clusters used. Here we compare only the analysis time as only this will be affected by the process grid arrangements. Figure 7.12 depicts the data in the table. We observe that the 1×16 grid, which is the worst arrangement compared to a square grid arrangement, shows the worst performance whereas the 4×4 grid, the best arrangement, shows the best performance. As we said, process grid arrangement affects the intermediate internal inter process communications of the ScaLAPACK operations, hence the performance of the LAN cluster with slow interconnection network is affected severely.

7.2.7 Effect of ScaLAPACK block size

ScaLAPACK logically divides the data in the original matrix in two dimensional blocks and distributes the matrix in units of these blocks among the available pro- cessors in the process grid. A block with length and width 1, that is a block with just 1 element ensures the highest load balancing of the ScaLAPACK operations but increases the need for intermediate internal inter process communication for ScaLAPACK operations. So there is always a trade-off between load balancing and amount of data communication for the optimal block length and width which ensures the best performance. ScaLAPACK suggests 64 to be the optimal choice. However, at present with larger cache memory size, the optimal block length may vary for different applications. Table 7.8 lists the analysis time for different block lengths on both the clusters. We applied this only to the long dimensions (with length 5061 or 5049 for the used model) of the matrices described earlier, as short N dimensions (with length ) of matrices are very short to apply the optimal nproc choice of the block lengths. Here also the block lengths affect only the analysis step and we compare only with respect to the analysis time. Figure 7.13 depicts the data in the table. We observe that varying lengths of the block has a less affect on the performance compared to the variation of process grid arrangements. Also here the LAN cluster with slow interconnection network is more vulnerable to the

69 80 LAN

75

70

65

60

55 time (minute)

50

45

40

35 1x16 2x8 4x4 process grid

16 SARA Lisa

15.5

15

14.5

14 time (minute)

13.5

13

12.5 1x16 2x8 4x4 process grid

Figure 7.12: Execution time of the analysis step for different process grid orienta- tion for 48 realizations

70 Table 7.8: Execution time (in minute) of the analysis step for different block sizes for 48 realizations Block Length SARA Lisa Cluster LAN Cluster 1 14.71 37.99 16 12.95 35.83 32 12.84 36.12 64 12.83 35.66 128 13.14 37.40 256 14.81 37.36 512 13.23 40.15

variation in the block lengths. In the LAN cluster block length of 64 gives the best performance and 512 gives the worst one. In the SARA Lisa cluster block length of 1 and 256 seem to be the worst choices whereas 64 gives the best performance, though 32 seems to be very close to the best choice.

7.2.8 Effect of non blocking communication Data communication is always considered a bottleneck for parallel computing. One remedy is communication and computation overlapping through non block- ing communications. We implement this approach in our application for better performance. However, it is important to note here that the four well model, which we use for our application, works with 5061 observations. The large matrices of dimension 5061×5061 need no communication as their construction and further use are done locally on each processor. However, in each assimilation step, we need communication of 3 matrices of dimension 5061×N, where N is the num- ber of realizations. This implies that for the used model, data communication is not a very serious issue and we get very small difference between the perform- ance of the blocking and non blocking communication. In order to implement the blocking counter part of our application, we replace the MPI Isend with MPI Send, MPI Irecv with MPI Recv and omit the MPI Wait statements in our application to keep all settings, sequence of computation, correctness and above all the simpli- city. Table 7.9 lists the execution time of the analysis steps in the LAN cluster for 48 realizations and in the SARA Lisa cluster for 96 realizations. We use 96 real- izations for the Lisa cluster as the interconnection network for this is very fast and we need some good figures to feel the difference. Figure 7.14 shows the data in the above table. In the SARA Lisa cluster we observe the above mentioned claims. But in the LAN cluster we get some very unexpected results. Instead of reducing the analysis time, we see, non-blocking communication has increased the time. This may be an effect of the particular MPI implementation on the LAN cluster and also of the heterogeneous hardware components which may lead to unexpected delay for the

71 40.5 LAN cluster 40

39.5

39

38.5

38

time (minute) 37.5

37

36.5

36

35.5 1 16 32 64 128 256 512 block size

15 SARA Lisa cluster 14.8

14.6

14.4

14.2

14

13.8 time (minute) 13.6

13.4

13.2

13

12.8 1 16 32 64 128 256 512 block size

Figure 7.13: Execution time of the analysis step for different block sizes for 48 realizations

72 Table 7.9: Execution time (in minutes) of the analysis step for blocking and non- blocking data communication LAN cluster (48 realizations) SARA Lisa cluster (96 realizations) # p blocking non-blocking # p blocking non-blocking 2 149.46 192.82 2 115.38 114.50 4 78.56 101.90 4 60.71 59.74 8 50.71 62.05 8 25.66 25.51 12 38.80 46.19 16 14.54 12.81 16 34.83 35.66 24 11.84 9.81 - - - 48 7.72 5.44

synchronization. Anyway, for the LAN cluster we used execution time data with the blocking communications, as it performs better, in tables 7.5 and 7.7 and the corresponding figures.

73 200 Blocking Non-blocking 180

160

140

120

100 time (minute)

80

60

40

20 2 4 8 12 16 # processors

120 Blocking Non-blocking

100

80

60 time (minute)

40

20

0 4 8 16 24 48 # processors

Figure 7.14: Execution time of the analysis step for blocking and non-blocking data communication on the LAN cluster (top) and SARA Lisa cluster (bottom)

74 Chapter 8

Conclusion and Future Work

8.1 Conclusion

We mentioned in chapter 1 that oil companies are struggling to meet the increas- ing demand for energy. It is important, therefore, to develop new technologies and strategies that will optimize recovery from existing fields. Whatever the techno- logy or strategy is, it requires better knowledge and characterization of the reser- voir. Improvement of the reservoir description is achieved through adjustment of uncertain parameters such as permeability, using history matching or data assimil- ation. The ensemble Kalman filter (EnKF), a useful tool for data assimilation, is a very time consuming process. Reducing the data assimilation time in the over- all reservoir management cycle is an important issue. The inherent parallelism in EnKF strongly demands the parallelization of EnKF. In this thesis we described the parallelization of the whole EnKF process for oil reservoirs in presence of sparse measurement data and dense time-lapse seismic data. In other parallelizations of EnKF for a variety of applications, the forecast step is parallelized totally and the analysis step is parallelized depending on some form of localisation. Past parallel implementations of EnKF for oil reservoirs par- allelize only the forecast step and at best the analysis step is parallelized partially. The parallel performance of our implementation of the EnKF is satisfactory com- pared to the other implementations of parallel EnKF for oil reservoirs. In [29] with 100 realizations, parallel efficiencies of 50.8% and 47.2% on two different clusters with 50 processors are reported. On the SARA Lisa cluster with 48 processors for 96 realizations we achieved 86.8% efficiency. In [23] with 200 realizations on 16 processors a parallel efficiency of 76.8% is reported, whereas for our imple- mentation, the parallel efficiency with 16 processors on SARA Lisa cluster for 96 realization is 98.8%. We said earlier that both [29] and [23] work with small meas- urement data from the wells and the analysis step is not that much vital and these mostly parallelize the forecast step only. Below we summarize other important issues relevant to our implementation of parallel EnKF for oil reservoirs.

75 1. Parallel EnKF, like the traditional serial EnKF, can do data assimilation by variance minimization and gives better predictions utilizing past predictions and the past and present measurements. Also a large number of realizations do better assimilation than a very small number of realizations, however, in presence of large number of observations when there are sufficiently large number of realizations, increasing the number of realizations does not im- prove data assimilation performance significantly.

2. The forecast step shows linear speedup in any type of cluster and strongly scales whereas the analysis step shows sub-linear speedup and has an upper limit of number of processors (25 for our model with 5061 observations), beyond which the scaling degrades.

3. Super-linear speedup may be observed for the forecast step in any type of cluster whereas it may be observed for the analysis step in very fast inter- connection clusters for small number of processors (from 4 or 8 up to 16).

4. Increasing the number of realizations proportionally increases the problem size for the forecast step and the execution time is proportionally increased, however, the problem size for the analysis step remains almost the same and the execution time does not change significantly.

5. The forecast step is embarrassingly parallel and need no intermediate inter process communication, and hence the execution time of this step is almost indifferent of the hardware and interconnection network of the cluster used. But the analysis step requires huge number of intermediate inter process communication and the execution time significantly depends on the inter- connection network of the used cluster. A cluster with fast interconnection can complete the analysis step faster.

6. ScaLAPACK grid orientation has a significant effect on the analysis step. As the grid orientation determines the number and amount of data communica- tion, it mostly affects the performance of a cluster with slow interconnection network.

7. ScaLAPACK block lengths have a minor effect on the analysis time. Varying block lengths affect the number and amount of data communication. Hence also here the performance of a cluster with slow interconnection network is mostly affected.

8. Non blocking communication with small volume of data insignificantly im- proves the performance. But this should be of great benefit with models giving very large number of observations and larger number of realizations.

76 8.2 Future work

Implementation on GPU clusters

Further performance gain can be achieved by exploiting GPGPU or GPU comput- ing on each node. GPGPU stands for General-Purpose computation on Graphics Processing Units. It is the use of a GPU as a co-processor to accelerate CPUs for general-purpose scientific and engineering computing. The GPU, with the massively parallel processing power, accelerates applications running on the CPU by offloading some of the compute-intensive and time consuming portions of the code where the rest of the application still runs on the CPU [5]. To get the maximum potential performance from the GPU hardware, we have to consider four key issues, namely: Memory optimizations, Execution configur- ation optimization, Instruction optimizations and Control flow [25]. To get good performance from GPUs the problem size needs to be large enough, so that the performance benefits outweigh the inherent overheads for GPU computing. In our EnKF analysis step, only the SVD computation and the matrix multiplication im- mediately after this are suitable for execution on GPUs due to the sizes of the operand matrices. To keep all other settings of our implementation we need to manually implement the GPU versions of the routines pdgesvd and pdgemm for SVD computation and matrix-matrix multiplication respectively, maintaining all the syntaxes and semantics of ScaLAPACK and need to call the GPU versions instead of the original ones. However, as we are using simsim for forecast as a third party software, total implementation of the forecast step for GPU would be very complex. Rather we should look for some functions which consume the most of the computation time and can be easily converted to CUDA kernels and simply can be called from the ori- ginal MATLAB code using MATLAB parallel computing toolbox (PCT). By MAT- LAB profiling, it is observed that the functions simsim trans and simsim rel consume about 90% and 7% of the overall computation time of simsim respect- ively. The first function computes the oil and water transmissibility matrices based on some graph algorithms and the second one computes relative permeabilities using Corey functions and does some truncation operations. Fortunately, both of them nicely fit for GPU implementation because of their tasks. However, it is also observed that for a 10 years simultion and runtime of around 290 seconds on a 2.4 GHz Core 2 Duo and 2 GB RAM machine, these functions are called 7042 and 7227 times respectively. These data represents that currently these func- tions roughly run for 37064 and 2809 micro seconds respectively for a single call. With a rough estimation of 20 micro seconds for a CUDA kernel launch time, above results show potential performance gain by GPU implementation of these two functions. Again there are issues regarding data upload (GPU memory write) before the kernel execution and download (GPU memory read) after the kernel execution. These will certainly add additional overheads. Moreover GPUs are compatible with double precision floats on compute capability 1.3 or higher GPUs

77 only and double precision operations are very slow due to the presence of only a single double precision unit per multiprocessor. Double precision operations can be executed with single precisions with loss of accuracies. These indicate that the performance and accuracy would certainly depend on the practical implementa- tions of the kernels and can be an independent research issue. However [24], [10], [33] and [11] describe some independent implementations of reservoir simulation on GPUs and may be used in place of simsim to work on GPU clusters.

Dynamic load balancing Another research direction may be to attempt load balancing dynamically for the forecast step. Initially equal number of realizations can be assigned to each pro- cessor, and the completion time for each realizations can be tracked on different processors. In subsequent steps instead of equalization based on just the number, realizations may be equalized for completion time of the last step, as is done in most of the load balancing issues. However to do this a lot of complex data for the model, control and parameter information need to be communicated back and forth between MATLAB compiler runtime (MCR) and the simsimdriver as discussed in chapter 6 and shown in figure 6.3. Also major modifications to the wrapper func- tions initSimsim and runSimsim is required for that. Moreover some additional MATLAB functions may be needed.

Verifying weak scaling We could not verify weak scaling of the analysis step for our application as we worked with a fixed model with constant number of observations. Verification of weak scaling for the analysis step can be done by using a model which allows varying number of observations.

Other EnKF schemes EnKFs are divided into 2 broad categories: stochastic and deterministic, both hav- ing inherent advantages and disadvantages. The typical EnKF, which we paral- lelized, is a stochastic one. Some closely related schemes for ensemble based filters are: double EnKF (DEnKF), singular evolutive interpolated Kalman fil- ter (SEIK), ensemble adjustment Kalman filter (EAKF), serial ensemble square root filter (ESRF), maximum likelihood ensemble filter (MLEF), deterministic EnKF (DEnKF), ensemble randomised maximum likelihood filter (EnRML), en- semble transform Kalman filter (ETKF), left-multiplied ensemble square root filter (LESRF), ensemble optimal interpolation (EnOI), etc. Depending on the applic- ability to oil reservoirs they can be parallelized, with little modifications to our developed application, and compared for different parallel performance metrics.

78 Bibliography

[1] 1970s energy crisis, from wikipedia, the free encyclopedia. http://en. wikipedia.org/wiki/1970s_energy_crisis. Last visited on 09-05- 2012. [2] International energy outlook 2011. http://www.eia.gov/oiaf/aeo/ tablebrowser/. Last visited on 12-05-2012. [3] Scalapack data distribution. http://scv.bu.edu/˜kadin/PACS/numlib/ chapter2/parlib.2006.html. Last visited on 07-05-2012. [4] ScaLAPACK- Scalable Linear Algebra PACKage. http://http://www. netlib.org/scalapack/. Last visited on 29-04-2012. [5] What is GPU Computing? http://www.nvidia.com/object/ what-is-gpu-computing.html. Last visited on 22-05-2012. [6] G.M. Amdahl. Validity of the single-processor approach to achieving large scale computing capabilities. In AFIPS Conference Proceedings, volume 30, pages 483– 485, Atlantic City, N.J., April 1967. AFIPS Press, Reston, Va. [7] L. S. Blackford, J. Choi, A. Cleary, E. D’Azevedo, J. Demmel, I. Dhillon, J. Don- garra, S. Hammarling, G. Henry, A. Petitet, K. Stanley, D. Walker, and R. C. Whaley. ScaLAPACK Users’ Guide. Society for Industrial and Applied Mathematics, Phil- adelphia, PA, 1997. [8] G. Burgers, P. J. van Leeuwen, and G. Evensen. Analysis scheme in the ensemble Kalman filter. Mon. Weather Rev., 126:1719–1724, 1998. [9] Jack J. Dongarra and R. Clint Whaley. LAPACK Working Note 94, A User’s Guide to the BLACS, v1.1. http://www.netlib.org/blacs/lawn94.ps. May 05, 1997. [10] K.P. Esler, S. Atan, B. Ramirez, and V. Natoli. Accelerating reservoir simulation with GPUs. In 73rd EAGE Conference & Exhibition, Vienna, Austria, May 2011. European Association of Geoscientists & Engineers. [11] Ken Esler, Vincent Natoli, Safian Atan, and Benjamin Ramirez. Accelerating reservoir simulation with GPUs. http://hpcoilgas.citris-uc.org/ system/files/private/SEG_HPC2011.pdf. [12] G. Evensen. Sampling strategies and square root analysis schemes for the EnKF. Ocean Dyn., 54:539–560, 2004. [13] G. Evensen. Data Assimilation: The Ensemble Kalman Filter. Springer: New York, 2007. [14] S. Gillijns, O. Barrero Mendoza, J. Chandrasekar, B. L. R. De Moor, D. S. Bernstein, and A. Ridley. What is the ensemble kalman filter and how well does it work? In Proceedings of the 2006 American Control Conference, pages 4448–4453, 2006. [15] A. Grama, A. Gupta, G. Karypis, and V. Kumar. Introduction to Parallel Computing. Addison Wesley, 2003.

79 [16] J. Gustafson. Fixed time, tiered memory, and superlinear speedup. In Proceedings of the Fifth Distributed Memory Computing Conference (DMCC5), October 1990. [17] John L. Gustafson. Reevaluating amdahl’s law. Commun. ACM, 31(5):532–533, May 1988. [18] G. Hager and G. Wellein. Introduction to High Performance Computing for Scientists and Engineers. CRC Press, July 2010. [19] J.D. Jansen, S.D. Douma, D.R. Brouwer, P.M.J. Van den Hof, O.H. Bosgra, and A.W. Heemink. Closed loop reservoir management. In SPE Reservoir Simulation Symposium, The Woodlands, Texas, U.S.A., February 2009. Society of Petroleum Engineers. [20] Christian L. Keppenne. Data assimilation into a primitive-equation model with a par- allel ensemble kalman filter. MONTHLY WEATHER REVIEW, 128(6):1971–1981, June 2000. [21] Christian L. Keppenne and Michele M. Rienecker. Initial testing of a massively parallel ensemble kalman filter with the poseidon isopycnal ocean general circulation model. MONTHLY WEATHER REVIEW, 130(12):2951–2965, December 2002. [22] William Lahoz, Boris Khattatov, and Richard” Menard, editors. Data Assimilation: Making Sense of Observations. Springer, 2010. [23] B. Liang, K. Sepehrnoori, and M. Delshad. An automatic history matching module with distributed and parallel computing. Petroleum Science and Technology, 27(10). [24] Hui Liu, Song Yu, Zhangxin Chen, Ben Hsieh, and Lei Shao. Parallel preconditioners for reservoir simulation on GPU. In SPE Latin American and Caribbean Petroleum Engineering Conference, Mexico City, Mexico, April 2012. Society of Petroleum Engineers. [25] NVIDIA. Optimization: Opencl best practices guide. http://developer. download.nvidia.com/compute/cuda/3_2/toolkit/docs/ OpenCL_Best_Practices_Guide.pdf. Last visited on 04-02-2012. [26] Donald W. Peaceman, editor. Fundamentals of numerical reservoir simulation. El- sevier, 1977. [27] Abdus Satter and Ganesh C. Thakur. Integrated Petroleum Reservoir Management : A Team Approach. PennWell Publishing Company, 1994. [28] J.-A. Skjervheim, G. Evensen, Norsk Hydro, S.I. Aanonsen, B.O. Ruud, and T.A. Jo- hansen. Incorporating 4d seismic data in reservoir simulation models using ensemble Kalman filter. In 2005 SPE Annual Technical Conference and Exhibition. [29] Reza Tavakoli, Gergina Pencheva, and Mary F. Wheeler. Multi-level parallelization of ensemble kalman filter for reservoir history matching. In 2011 SPE Reservoir Simulation Symposium. [30] Ganesh C. Thakur. What is Reservoir Management? Society of Petroleum Engin- eers, June 1996. [31] Brian F. Towler. Fundamental Principles of Reservoir Engineering. Society of Pet- roleum Engineers, Richardson, Texas, 2002. [32] Helene Hafslund Veire, Hilde Grude Borgos, and Martin Landrφ. Stochastic in- version of pressure and saturation changes from time-lapse multi component data. Geophysical Prospecting, 2007. [33] Mark Wakefield. The use of GPUs for reservoir simulation. http://static. og-hpc.org/Rice2011/Workshop-Presentations/OG-HPC% 20PDF-4-WEB/Wakefield-GPUs-for-reservoir-simulation.pdf. [34] Eric W. Weisstein. . http://mathworld.wolfram.com/ MonteCarloMethod.html. Last visited on 02-02-2012. [35] M.L. Wiggins and R.A. Startzman. An approach to reservoir management. In 65th Annual Technical Conference and Exhibition of the Society of Petroleum Engineers.

80