University of Calgary PRISM: University of Calgary's Digital Repository

Graduate Studies The Vault: Electronic Theses and Dissertations

2020-01 Preconditioned Iterative Solvers on GPU and an In-Situ Combustion Simulator

Yang, Bo

Yang, B. (2020). Preconditioned Iterative Solvers on GPU and an In-Situ Combustion Simulator (Unpublished doctoral thesis). University of Calgary, Calgary, AB. http://hdl.handle.net/1880/111505 doctoral thesis

University of Calgary graduate students retain copyright ownership and moral rights for their thesis. You may use this material in any way that is permitted by the Copyright Act or through licensing that has been assigned to the document. For uses that are not allowable under copyright legislation or licensing, you are required to seek permission. Downloaded from PRISM: https://prism.ucalgary.ca UNIVERSITY OF CALGARY

Preconditioned Iterative Solvers on GPU and an In-Situ Combustion Simulator

by

Bo Yang

A THESIS

SUBMITTED TO THE FACULTY OF GRADUATE STUDIES

IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE

DEGREE OF DOCTOR OF PHILOSOPHY

GRADUATE PROGRAM IN CHEMICAL AND PETROLEUM ENGINEERING

CALGARY, ALBERTA

JANUARY, 2020

c Bo Yang 2020

Abstract

This thesis consists of two parts: preconditioned iterative solvers on GPU and an in-situ combus- tion simulator.

The purpose of the first research is to develop a new parallel solution platform based on GPU features.

An application of HPC (high-performance computing) technology to reservoir simulation has become an inevitable trend. As a platform for HPC, GPU can provide an effective solution for per- sonal computers and workstations. In this research, not only a series of special CPR (constrained pressure residual) preconditioned solvers are developed for black oil models, but also a variety of other preconditioned solvers are completed as contrast solvers.

The numerical experiments verify a significant improvement in the parallel performance of the solvers on GPU. They also provide an overall comparison among the combinations of different

GPUs, solvers, and preconditioners. The results demonstrate that the CPR developed has excellent advantages in both parallelism and convergence for the solution of a benchmark reservoir model.

The purpose of the second research is to develop a new comprehensive ISC (in-situ combustion) simulator with the PER (pseudo equilibrium ratio) method and to compare functions with those of a benchmark simulator with the VS (variable substitution) method.

ISC is considered a promising recovery method because of its low cost and less environmental impact. However, an ISC simulator is regarded as one of the most complex simulators to de- velop. The PER method can reduce the complexity of simulator development because it lowers the influence of the phase disappearance and appearance on the mathematical system of reservoir simulation.

The ISC simulator in this study is developed with comprehensive typical functions of the ISC process. For the verification of the equivalence in numerical results between the PER method and the VS method, the numerical experiments are carried out in an omnidirectional range. Be-

ii cause the results show a very close match, the research provides reliable experimental support for popularizing the use of the PER method to develop an in-situ combustion simulator. Acknowledgements

I would like to express my most genuine gratitude to my supervisor, Dr. Zhangxing Chen. It is

fortunate for me to be his student. His extensive knowledge and good personality deeply infect

me. He will always be a model for me.

I would like to show my sincerest appreciation to my co-supervisor, Dr. Wenyuan Liao. It

is my great pleasure to receive the guidance of mathematics and career concerns from him. His

advice and care will be unforgettable to me.

I would like to present special recognition to Dr. Antony Ware for his valuable suggestion, tolerant attitude, and patient explanation as an indispensable part of this research.

I would like to give my great gratefulness to committee members and examiners, Drs. Qingye

Lu, Jalel Azaiez, Antony Ware, and Pengtao Sun for their experience, comments, and time.

I am very grateful to Dr. Hui Liu for his help in the past years. The technical discussions with him enabled many difficulties to be overcome. For the ISC simulator research, I wish to thank Min

Yang who cooperated in the collation of test cases and Ruijian He who participated in the coding of the dry combustion part.

Thanks go to the Department of Chemical and Petroleum Engineering and the Reservoir Sim- ulation Group, Schulich School of Engineering.

iv To my wife Xiaoling Zhong Table of Contents

Abstract ...... ii Acknowledgements ...... iv TableofContents ...... vi ListofTables ...... ix ListofFigures...... xi ListofSymbols ...... xvi 1 RESERVOIRSIMULATOROVERVIEW ...... 1 1.1 Petroleum...... 1 1.2 Reservoir Numerical Simulation ...... 2 1.3 ReservoirSimulators ...... 6 1.4 Layout...... 8 2 INTRODUCTION(1)...... 12 2.1 High-Performance Computing ...... 12 2.2 Preconditioned Iterative Solvers on GPU ...... 16 2.3 LiteratureReview...... 21 2.4 ResearchObjectives...... 24 3 GPUANDBLACKOILMODEL ...... 26 3.1 GPU...... 26 3.1.1 GPUArchitecture...... 26 3.1.2 CUDA ...... 29 3.2 BlackOilModel ...... 32 3.2.1 MathematicalModel ...... 32 3.2.2 SPE10Model2...... 36 4 PRECONDITIONEDITERATIVESOLVERS...... 39 4.1 SparseMatrices...... 39 4.1.1 Discretization...... 39 4.1.2 Permutation...... 41 4.1.3 Storage ...... 43 4.2 IterativeSolvers...... 49 4.2.1 LinearSolutionMethods ...... 49 4.2.2 KrylovSubspaceMethods ...... 51 4.2.3 AMGMethod...... 53 4.3 Preconditioners ...... 62 4.3.1 Principle ...... 62 4.3.2 ILU...... 63 4.3.3 BILU ...... 70 4.3.4 CPR...... 74 4.4 ParallelImplementation...... 78 4.4.1 ParallelComponents ...... 78 4.4.2 Parallel Triangular Solver ...... 80 4.4.3 RAS...... 82 4.4.4 Parallel Algorithms ...... 84

vi 5 NUMERICALEXPERIMENTS(1) ...... 87 5.1 Introduction...... 87 5.2 Environments ...... 89 5.3 Solvers ...... 91 5.4 ILU ...... 92 5.5 BILU ...... 95 5.6 RAS...... 97 5.7 AMG ...... 99 5.8 CPR...... 102 6 CONCLUSIONS(1)...... 108 6.1 Conclusions...... 108 6.2 FutureWork...... 110 7 INTRODUCTION(2)...... 112 7.1 ThermalRecovery...... 112 7.2 In-SituCombustionProcess ...... 114 7.3 LiteratureReview...... 115 7.4 ResearchObjectives...... 119 8 IN-SITUCOMBUSTIONSIMULATOR...... 121 8.1 ModelDescription ...... 121 8.1.1 PhasesandComponents ...... 121 8.1.2 ChemicalReactions ...... 122 8.1.3 PhysicalProperties ...... 123 8.1.4 Assumptions ...... 132 8.2 MathematicalSystem ...... 132 8.2.1 MassConservation ...... 132 8.2.2 EnergyConservation ...... 135 8.2.3 Constraints ...... 136 8.2.4 WellEquation...... 137 8.2.5 PhaseChange...... 137 8.2.6 HeatLoss...... 139 8.2.7 PDESystem ...... 141 8.3 NumericalMethods...... 142 8.3.1 DiscretizationScheme ...... 142 8.3.2 SpecificTreatments...... 144 8.3.3 NonlinearSystem...... 146 8.3.4 Newton-Raphson Method ...... 147 8.3.5 SimulatorFlow...... 150 9 NUMERICALEXPERIMENTS(2) ...... 153 9.1 Introduction...... 153 9.2 DryCombustionTube...... 154 9.3 WetCombustionTube ...... 175 9.4 Full-ScaleHeatLoss ...... 184 9.5 2-DMulti-PerfWells ...... 198 9.6 3-D Inverted Five-Spot Pattern ...... 208 10 CONCLUSIONS(2)...... 217

vii 10.1Conclusions...... 217 10.2FutureWork...... 219 REFERENCES ...... 220

viii List of Tables

1.1 Total petroleum and other liquids production of major oil-producing countries. . . . 3 1.2 Open source reservoir simulators...... 8 1.3 Commercial reservoir simulators...... 9 1.4 Reservoir simulator general classification...... 10

2.1 Evolution of computing architecture...... 13 2.2 Performance comparison between the GPU and CPU...... 18 2.3 Linearsolvers...... 21 2.4 Preconditioners...... 21

3.1 Micro-architecture and CUDA cores...... 27 3.2 Relationship between concepts of software and hardware inGPU...... 30 3.3 Execution configuration arguments for kernel...... 32 3.4 Phases and components for black oil model...... 34

4.1 ApplicationofFDM,FEM,andFVM...... 40 4.2 Basic storage format of sparse matrices...... 43 4.3 Linear solution methods for sparse systems...... 50

5.1 Workstation environments...... 88 5.2 3-DPoissonmatrices...... 88 5.3 MatricesfromSPE10model2...... 88 5.4 [Env] BiCGSTAB, ILU(0), Poisson, Env 1...... 90 5.5 [Env] BiCGSTAB, ILU(0), Poisson, Env 2...... 90 5.6 [Env] BiCGSTAB, ILU(0), SPE10, Env 1...... 90 5.7 [Env] BiCGSTAB, ILU(0), SPE10, Env 2...... 91 5.8 [Svr] GMRES, ILU(0), Poisson, Env 2...... 92 5.9 [Svr] GMRES, ILU(0), SPE10, Env 2...... 92 5.10 [ILU] BiCGSTAB, ILU(k), SPE10 1, Env 2...... 93 5.11 [ILU] GMERS, ILU(k), Poisson 1, Env 1...... 94 5.12 [ILU] BiCGSTAB, ILUT(fixed p), SPE10 1, Env 2...... 95 5.13 [ILU] BiCGSTAB, ILUT(fixed τ), SPE10 1, Env 2...... 95 5.14 [BILU] GMRES, BILU(0), Poisson 1, Env 1...... 96 5.15 [BILU] GMRES, BILUT(7, 0.03), SPE10 1, Env 1...... 96 5.16 [RAS] BiCGSTAB, RAS ILU(0), Poisson 1, Env 2...... 97 5.17 [RAS] BiCGSTAB, RAS ILUT(7, 0.03), SPE10 1, Env 2...... 98 5.18 [AMG] AMG, Poisson 1, Env 2...... 100 5.19 [AMG] AMG, Poisson 2, Env 2...... 100 5.20 [AMG] AMG, Poisson 3, Env 2...... 100 5.21 [AMG] AMG, Poisson 4, Env 2...... 100 5.22 [AMG] AMG, V-cycle, Poisson 1, Env 2:level ...... 100 5.23 [AMG] AMG, V-cycle, Poisson 1, Env 2: smoothing iteration...... 101 5.24 [CPR] BiCGSTAB, CPR ILU(0), SPE10, Env 2...... 102

ix 5.25 [CPR] GMRES, CPR ILU(0), SPE10, Env 2...... 102 5.26 [CPR] εrel comparison for BiCGSTAB, SPE10 3, Env 2...... 104 5.27 [CPR] εrel comparison for GMRES, SPE10 3, Env 2...... 105 5.28 [CPR] BiCGSTAB, CPR ILUT(fixed p), SPE10 1, Env 2...... 106 5.29 [CPR] BiCGSTAB, CPR ILUT(fixed τ), SPE10 1, Env 2...... 106

8.1 Phases and components for ISC model...... 122 8.2 Equationsandunknowns...... 142 8.3 Componentindexing...... 142

9.1 [Drytube]Inputdata...... 157 9.2 [Drytube]Swt...... 162 9.3 [Drytube]Slt...... 162 9.4 [Full-scale]Inputdata...... 186 9.5 [Full-scale]Swt...... 191 9.6 [Full-scale]Slt...... 192

x List of Figures and Illustrations

1.1 A breakdown of the products made from a typical barrel of USoil...... 2 1.2 Recoveryprocess...... 4 1.3 General stages of reservoir simulation...... 5 1.4 Asamplereservoirmodel...... 6 1.5 Reservoir simulator and linear solver...... 7

2.1 Summit(OLCF-4)...... 14 2.2 Single-coreCPUcomputer...... 15 2.3 Multi-coreCPUcomputer...... 15 2.4 GPUcomputer...... 16 2.5 Computercluster...... 16 2.6 Hybridcluster(1)...... 16 2.7 Hybridcluster(2)...... 17 2.8 Intel Core i7-6950X processor...... 17 2.9 NVIDIATeslaK40GPU...... 17 2.10 Floating-point operations per second for the CPU and GPU...... 18 2.11 Memory bandwidth for the CPU and GPU...... 19 2.12 Patterns of coefficient ...... 19

3.1 CPUandGPUarchitecture...... 27 3.2 Fermi streaming multiprocessor (SM)...... 28 3.3 FermiSMwarpscheduling...... 29 3.4 Threads, blocks, and grids, with memory spaces...... 31 3.5 ProcessingflowonCUDA...... 33 3.6 Blackoilmodel...... 34 3.7 SPE10 model 2: φ forwholemodel...... 37 3.8 SPE10 model 2: φ forUpperNess...... 37 3.9 SPE10 model 2: logarithm of Kh forTarbert...... 37 3.10 SPE10 model 2: logarithm of Kh forUpperNess...... 37 3.11 SPE10 model 2: well configuration...... 38

4.1 Sparsematrixpattern...... 41 4.2 A sample pattern of matrix A...... 44 4.3 COOformat...... 44 4.4 ELLformat...... 45 4.5 CSRformat...... 45 4.6 HYBformat...... 46 4.7 HEC format for matrix A...... 47 4.8 ELLpartstoredbycolumn...... 47 4.9 HECdatastructure...... 48 4.10 BCSR (3 3)format...... 48 4.11 BCSRdatastructure.× ...... 49 4.12 ICSRdatastructure...... 49

xi 4.13 Error component in fine grid or coarse grid...... 54 4.14Two-grid...... 55 4.15Multigrid...... 55 4.16 AMGhierarchy...... 56 4.17 AMG:examplepattern...... 57 4.18 AMG:examplegraph...... 57 4.19AMG:V-cycle...... 60 4.20 AMG:W-cycle...... 60 4.21 AMG: operations in V-cycle...... 60 4.22 AMG: F-cycle with µ = 1...... 61 4.23 AMG: F-cycle with µ = 2...... 61 4.24 Gaussian elimination: LU factorization...... 65 4.25 Pattern NZ(A)...... 66 4.26 ILU(0)factorization...... 66 4.27 ILU(1)factorization...... 67 4.28 ILU(2)factorization...... 67 4.29 Decoupled ILU(k):symbolicphase...... 68 4.30 Decoupled ILU(k):factorizationphase...... 69 4.31 ILUTfactorization...... 69 4.32 Decoupled block-wise ILU(k):symbolicphase...... 72 4.33 Decoupled block-wise ILU(k):factorizationphase...... 72 4.34 Block-wise lower part and upper part...... 73 4.35 Block-wise diagonal part and modified upper part...... 74 4.36 Feature analysis of Jacobian matrix...... 77 4.37CPRprocedure...... 78 4.38 RAS ILU(0)process...... 83

5.1 [Env] Env 1 vs. Env 2 for Poisson...... 91 5.2 [Env] Env 1 vs. Env 2 for SPE10...... 91 5.3 [Svr] BiCGSTAB vs. GMRES for Poisson...... 93 5.4 [Svr] BiCGSTAB vs. GMRES for SPE10...... 93 5.5 [ILU] BiCGSTAB, ILU(k)...... 94 5.6 [ILU] GMRES, ILU(k)...... 94 5.7 [ILU] BiCGSTAB, ILUT(fixed p)...... 95 5.8 [ILU] BiCGSTAB, ILUT(fixed τ)...... 95 5.9 [BILU] BILU(0), Poisson 1...... 96 5.10 [BILU] BILUT, SPE10 1...... 96 5.11 [RAS] Poisson 1,GPUite...... 98 5.12 [RAS] Poisson 1,Speedup...... 98 5.13 [RAS] SPE10 1,GPUite...... 99 5.14 [RAS] SPE10 1,Speedup...... 99 5.15 [AMG] Poisson,GPUite...... 101 5.16 [AMG] Poisson,Speedup...... 101 5.17[AMG]Level...... 101 5.18 [AMG] Smoothing iteration...... 101

xii 5.19 [CPR] CPR ILU(0) vs. ILU(0) for BiCGSTAB ...... 103 5.20 [CPR] CPR ILU(0)vs.ILU(0)forGMRES ...... 103 5.21 [CPR] lgεrel forBiCGSTAB...... 103 5.22 [CPR] lgεrel forGMRES...... 103 5.23 [CPR] CPR ILUT(fixed p)...... 107 5.24 [CPR] CPR ILUT(fixed τ)...... 107

7.1 Thermalrecoveryclassification...... 113 7.2 Hotwaterflooding...... 114 7.3 Steamflooding...... 114 7.4 CSS...... 115 7.5 SAGD...... 115 7.6 In-situ combustion process...... 116

8.1 Chemicalreactions...... 123 8.2 Poreoccupancyvs.poresize...... 125 8.3 Two-phase relative permeabilities for an oil-water system...... 126 8.4 Two-phase relative permeabilities for an oil-gas system...... 126 8.5 Radialflow...... 134 8.6 Heatloss...... 139 8.7 Gridandcell...... 143 8.8 Adjacent cells and upstream weighting...... 144 8.9 Wellmodel...... 145 8.10Flowinwell...... 146 8.11 Reservoirmodel...... 147 8.12 Newton’s iteration...... 148 8.13 PatternofJacobianmatrix...... 149 8.14 Simulatorflow...... 152

9.1 [Dry tube] T at0.001days...... 163 9.2 [Dry tube] T at0.25days...... 163 9.3 [Dry tube] T at0.50days...... 163 9.4 [Dry tube] T at0.75days...... 163 9.5 [Dry tube] T at1.00day...... 163 9.6 [Dry tube] T at1.25days...... 163 9.7 [Dry tube] Pbh of injection well vs. t...... 164 9.8 [Dry tube] Cumulative oil production vs. t...... 164 9.9 [Dry tube] Po vs. t...... 165 9.10 [Dry tube] xo,LO vs. t...... 165 9.11 [Dry tube] xo,HO vs. t...... 166

9.12 [Dry tube] xg,O2 vs. t...... 166

9.13 [Dry tube] xg,COx/N2 vs. t...... 167 9.14 [Dry tube] Cc vs. t...... 167 9.15 [Dry tube] T vs. t...... 168 9.16 [Dry tube] Sw vs. t...... 168 9.17 [Dry tube] Sg vs. t...... 169

xiii 9.18 [Dry tube] Sw vs.distance...... 169 9.19 [Dry tube] So vs.distance...... 170 9.20 [Dry tube] Sg vs.distance...... 170 9.21 [Dry tube] Cc vs.distance...... 171

9.22 [Dry tube] T, xg,H2O, xg,LO, and xg,HO at 0.25 days vs. distance...... 172 9.23 [Dry tube] Occupation of pore volume at 0.25 days vs. distance...... 172

9.24 [Dry tube] T, xg,H2O, xg,LO, and xg,HO at 0.50 days vs. distance...... 173 9.25 [Dry tube] Occupation of pore volume at 0.50 days vs. distance...... 173

9.26 [Dry tube] T, xg,H2O, xg,LO, and xg,HO at 0.75 days vs. distance...... 174 9.27 [Dry tube] Occupation of pore volume at 0.75 days vs. distance...... 174 9.28 [Wet tube] T at0.001days...... 178 9.29 [Wet tube] T at0.25days...... 178 9.30 [Wet tube] T at0.50days...... 178 9.31 [Wet tube] T at0.75days...... 178 9.32 [Wet tube] T at1.00day...... 178 9.33 [Wet tube] T at1.25days...... 178 9.34 [Wet tube] T vs. t ofCell5...... 179 9.35 [Wet tube] T vs.distanceat0.50days...... 179 9.36 [Wet tube] Sw vs.distanceat0.50days...... 180 9.37 [Wet tube] So vs.distanceat0.50days...... 180 9.38 [Wet tube] Sg vs.distanceat0.50days...... 181 9.39 [Wet tube] Cc vs.distanceat0.50days...... 181 4 9.40 [Wet tube] Occupation of pore volume of 2.5 10− water at 0.50 days vs. distance.182 × 4 9.41 [Wet tube] Occupation of pore volume of 5.0 10− water at 0.50 days vs. distance.182 × 4 9.42 [Wet tube] Occupation of pore volume of 7.5 10− water at 0.50 days vs. distance.183 × 3 9.43 [Wet tube] Occupation of pore volume of 1.0 10− water at 0.50 days vs. distance.183 9.44 [Full-scale] T at0.001days...... 193× 9.45 [Full-scale] T at30days...... 193 9.46 [Full-scale] T at60days...... 193 9.47 [Full-scale] T at90days...... 193 9.48 [Full-scale] T at120days...... 193 9.49 [Full-scale] T at150days...... 193 9.50 [Full-scale] T vs.t...... 194 9.51 [Full-scale] T vs.distance...... 194 9.52 [Full-scale] Sw vs.distance...... 195 9.53 [Full-scale] So vs.distance...... 195 9.54 [Full-scale] Sg vs.distance...... 196 9.55 [Full-scale] Cc vs.distance...... 196 9.56 [Full-scale] Occupation of pore volume without heat loss at 90 days vs. distance. . 197 9.57 [Full-scale] Occupation of pore volume with heat loss at 90 days vs. distance. . . . 197 9.58 [2-D 1] T at0.001days...... 200 9.59 [2-D 1] T at60days...... 200 9.60 [2-D 1] T at120days...... 200 9.61 [2-D 1] T at180days...... 200 9.62 [2-D 1] T at240days...... 200

xiv 9.63 [2-D 1] T at300days...... 200 9.64 [2-D 1] T vs.toftoplayer...... 201 9.65 [2-D 1] T vs.tofmiddlelayer...... 201 9.66 [2-D 1] T vs.tofbottomlayer...... 202 9.67 [2-D 1] So vs.toftoplayer...... 202 9.68 [2-D 1] So vs.tofmiddlelayer...... 203 9.69 [2-D 1] So vs.tofbottomlayer...... 203 9.70 [2-D 2] T at0.001days...... 204 9.71 [2-D 2] T at60days...... 204 9.72 [2-D 2] T at120days...... 204 9.73 [2-D 2] T at180days...... 204 9.74 [2-D 2] T at240days...... 204 9.75 [2-D 2] T at300days...... 204 9.76 [2-D 2] T vs.toftoplayer...... 205 9.77 [2-D 2] T vs.tofmiddlelayer...... 205 9.78 [2-D 2] T vs.tofbottomlayer...... 206 9.79 [2-D 2] So vs.toftoplayer...... 206 9.80 [2-D 2] So vs.tofmiddlelayer...... 207 9.81 [2-D 2] So vs.tofbottomlayer...... 207 9.82 [3-D] T at0.001days...... 209 9.83 [3-D] T at60days...... 209 9.84 [3-D] T at120days...... 209 9.85 [3-D] T at180days...... 209 9.86 [3-D] T at240days...... 209 9.87 [3-D] T at300days...... 209 9.88 [3-D] Slice view of T at0.001days...... 210 9.89 [3-D] Slice view of T at60days...... 210 9.90 [3-D] Slice view of T at120days...... 210 9.91 [3-D] Slice view of T at180days...... 210 9.92 [3-D] Slice view of T at240days...... 210 9.93 [3-D] Slice view of T at300days...... 210 9.94 [3-D] T at60daysoftoplayer...... 211 9.95 [3-D] T at60daysofmiddlelayer...... 211 9.96 [3-D] T at60daysofbottomlayer...... 212 9.97 [3-D] So at60daysoftoplayer...... 212 9.98 [3-D] So at60daysofmiddlelayer...... 213 9.99 [3-D] So at60daysofbottomlayer...... 213 9.100[3-D] T at120daysoftoplayer...... 214 9.101[3-D] T at120daysofmiddlelayer...... 214 9.102[3-D] T at120daysofbottomlayer...... 215 9.103[3-D] So at120daysoftoplayer...... 215 9.104[3-D] So at120daysofmiddlelayer...... 216 9.105[3-D] So at120daysofbottomlayer...... 216

xv List of Symbols, Abbreviations and Nomenclature

()avg, j average between the j-th adjacent cell and the current cell α phase of w, o, or g

∆t time step

η desired unknown change

ρˆ α (mass) density of phase α in ISC

ρˆ well average (mass) density of flow in well gˆ gravitational constant

κ thermal conductivity

λ thermal diffusivity

A j pre-exponential factor of chemical reaction j

Kl∗ g,i pseudo K value of component i between phase l = w,o and gas ∼ phase

Km search subspace of dimension m

Kl g,i K value of component i between phase l = w,o and gas phase ∼

Lm left subspace of dimension m

R universal gas constant

Z average compressibility for gas phase

µα viscosity of phase phase α

∇ divergence operator, ( ∂ , ∂ , ∂ ) · ∂x ∂y ∂z · ∂~ ∂ ~ ∂~ ∇ gradient operator, ( ∂x i + ∂y j + ∂z k) ω damping factor between 0.0 and 1.0

Φ fluid potential

φ porosity

xvi ρc molar density of coke

ρα density of phase α in black oil; molar density of phase α in ISC

ρG∗s density of free gas at standard conditions

ρGo density of gas component in oil phase

ρGs density of solution gas at standard conditions

ρOo density of oil component in oil phase

ρOs density of oil phase at standard conditions

ρWs density of water phase at standard conditions ~b vector of right hand side

~F equation vector of nonlinear system

U~ unknown vector of nonlinear system

~x vector of unknowns

1 A− inverse of matrix A AT transpose of matrix A

Ai j a block-wise element of matrix A ai j an element at row i and column j of matrix A

Bα formation volume factor of phase α

Cc coke concentration

Cp isobaric heat capacity

Cv volumetric heat capacity

COx CO1 component and CO2 component Coke coke component

E j activation energy of chemical reaction j

fg,i fugacity of component i in gas phase

fl,i fugacity of component i in liquid phase l = w,o

H2O water component

xvii Hα enthalpy of phase α

Hr, j reaction enthalpy of chemical reaction j HO heavy oil component

I unit matrix

K tensor of absolute permeability

Krα relative permeability of phase α LO light oil component

M a preconditioning matrix or a preconditioner

Mα average molecular weight of phase α

Mi molecular weight of component i

MSCF 1000 ft3

Np number of perforations

Nα number of components in phase α ncell number of cells in a reservoir model

Nr number of chemical reactions nwell number of wells in a reservoir model

NZ′(A) filled-in pattern of matrix A NZ(A) nonzero pattern of matrix A

O2 oxygen component

Pα pressure of phase α

Pb oil bubble point pressure

Pc critical pressure

Pbh bottom hole pressure

Pcgo capillary pressure between gas phase and oil phase

Pcow capillary pressure between oil phase and water phase

Qα flow rate of phase α in well

xviii qi well term

qHr reaction enthalpy

qH enthalpy through a well

qloss heat loss

ri chemical reaction term

R j reaction rate of chemical reaction j

Rso gas solubility

s1,...,s12 stoichiometric coefficient

Sα saturation of phase α

Sc coke saturation

Sl saturation of liquid span linear span {} T reservoir temperature t time

T o,Po reference pressure and temperature (77 F,14.7 psi)

Tc critical temperature

Uα internal energy of phase α

uα Darcy velocity of phase α

Uc internal energy of coke

Ur internal energy of rock

V control volume

Vg volume of gas phase

Vo volume of oil phase

Vw volume of water phase

VG∗s volume of free gas at standard conditions

VGs volume of solution gas at standard conditions

xix VOs volume of oil phase at standard conditions

VWs volume of water phase at standard conditions W,O,G component of water, oil and gas

w,o,g phase of water, oil and gas

WI well index xα,i mole fraction of component i in phase α Z depth

Abbreviations

Env 1 workstation environment with K20X GPU

Env 2 workstation environment with V100-PCIe-16GB GPU

Poisson 3-D Poisson matrices used for testing

SPE10 Jacobian matrices of SPE10 model 2 used for testing

speedup acceleration ratio of GPU algorithm to serial CPU algorithm

AMG algebraic multigrid

API application programming interface

API gravity American petroleum institute gravity

BCSR block-wise CSR

BiCGSTAB biconjugate gradient stabilized

BILU block-wise ILU

CG conjugate gradient

CGCR constrained generalized conjugate residual

CMG computer modelling group ltd

COO coordinate format

CPR constrained pressure residual

CPR ILU(k) CPR, with ILU(k) as the ILU part

CPR ILUT CPR, with ILUT as the ILU part

xx CPU central processing unit

CSC compressed sparse column format

CSR compressed sparse row format

CSS cyclic steam stimulation

CUDA compute unified device architecture

ELL Ellpack-Itpack format

ELLPACK a high-level system for solving elliptic boundary value problems

EOR enhanced oil recovery

EOS equation of state

FASP fast auxiliary space preconditioning

FDM finite difference method

FEM finite element method

FMG full multigrid

FORTRAN a programming language for scientific computing

FVF formation volume factor

FVM finite volume method

GB 109 bytes

GCR generalized conjugate residual

GFLOPS 109 floating point calculations per second

GMG geometric multigrid

GMRES generalized minimal residual

GOR gas oil ratio

GPU graphics processing unit

HEC hybrid of ELL and CSR format

HPC high-performance computing

HYB hybrid of ELL and COO format

xxi ICSR data structure specially designed for storing a nonzero pattern

ILU incomplete LU factorization

ILU(k) ILU with level of fill

ILU(0) ILU with level of fill, k = 0

ILUT ILU with threshold strategy

ILUT(p,τ) ILU with parameters p and τ

IMPES implicit pressure explicit saturation

ISC in-situ combustion or the in-situ combustion simulator

METIS a software package for graph partitioning

MG multigrid

MPI message passing interface

MSR modified sparse row format

NVIDIA an American technology company designing GPUs

OPEC organization of the petroleum exporting countries

OpenMP open multi-processing

ORTHOMIN orthogonal minimum residual

PAMG parallel algebraic multigrid

PDE partial differential equation

PER pseudo equilibrium ratio

QMR quasi minimal residual

RAS restricted additive Schwarz

RS Ruge-Stuben¨ coarsening strategy

RSSTD standard RS prolongation strategy

SAGD steam-assisted gravity drainage

SM streaming multiprocessor

SOR successive over-relaxation

xxii SP streaming processor

SPD symmetric positive definite

SPE society of petroleum engineers

SPE10 the tenth SPE comparative solution project

SpMV -vector multiplication

SSOR symmetric successive over-relaxation

STARS CMG advanced processes & thermal reservoir simulator

TFQMR transpose-free QMR

UML unified modeling language

VS variable substitution Chapter 1

RESERVOIR SIMULATOR OVERVIEW

This chapter gives an overview of reservoir simulators related to the research topics involved in the thesis. A brief introduction is given to the formation, products, and recovery processes of petroleum. Then an overview discussion of the stages of reservoir numerical simulation is pre- sented, including mathematical model, discretization, and numerical solution. The extensive ap- plication of reservoir simulation in the petroleum industry promotes the research of reservoir sim- ulators. Some open-source and commercial simulators are collected. The black oil approach and the compositional approach are introduced for the classification of simulators from the angle of

fluid composition and property calculations. At the end of this chapter, a layout of the thesis is presented.

1.1 Petroleum

Petroleum plays an indispensable role in modern industry and people’s daily lives. The products from crude oil can be used for transportation, power generation, and a variety of daily chemi- cal products. Figure 1.1 presents a pie chart of petroleum products supported by the data of the

U.S. Energy Information Administration. Table 1.1 shows the total petroleum and other liquids production per day of the world’s leading oil-producing countries in 2018.

It is now widely believed that petroleum is formed naturally from the debris of animals and plants over a long period in sedimentary beds. Generally speaking, a petroleum reservoir is a hy- drocarbon deposit located underground in a porous medium or a fractured rock formation. Repos- itories are detected using hydrocarbon exploration methods by petroleum geologists.

The methods of oil extraction are usually divided into primary recovery, secondary recovery, and enhanced recovery; see Figure 1.2. The primary recovery stage uses natural energy to drive oil

1 Heavy fuel oil, 4% Lubricants, 1% Asphalt, 3% Other products, 11%

Diesel and other fuel, 26% Gasoline, 46%

Jet fuel, 9%

Figure 1.1: A breakdown of the products made from a typical barrel of US oil (adapted from [1]). and gas from a reservoir through a well to the surface. The recovery factor (a recoverable amount of hydrocarbons initially in place) in the primary recovery stage is typically 5-15% [3]. With the development of primary recovery, the reservoir pressure decreases gradually. When a natural reservoir drive does not provide enough energy to produce oil and gas, the secondary recovery method is applied by using external energy to increase reservoir pressure and thus promote oil recovery. The most commonly used secondary recovery method is water flooding. Normally, the secondary recovery can increase the recovery factor to 35-45% [3]. Enhanced recovery, known as tertiary oil recovery, improves oil recovery by reducing the viscosity of the oil or increasing the

fluidity of the oil. For example, thermal recovery is a typical enhanced recovery. The temperature in a reservoir can be increased by injecting steam or hot water, thereby enhancing the fluidity of the oil. Enhanced recovery usually adds additional 5-15% to the recovery factor [3].

1.2 Reservoir Numerical Simulation

Reservoir numerical simulation is a process that uses a computer to solve a reservoir mathematical model, simulate underground oil-water-gas flow, and give oil-water-gas distributions at a specific

2 Table 1.1: Total petroleum and other liquids production of major oil-producing countries (from

[2]). bbl Country Production (1000 day , 2018) 1 United Statesh i 17936

2 Saudi Arabia 12419

3 Russia 11401

4 Canada 5328

5 China 4810

6 Iraq 4616

7 Iran 4456

8 United Arab Emirates 3791

9 Brazil 3428

10 Kuwait 2908

time. From a commercial point of view, the reason for using reservoir simulation is its ability

to predict oil production and cash flow. In reservoir research, the primary purpose of numerical

simulation is to predict the oil and gas production under different operating conditions and to figure

out an optimal production scheme.

Figure 1.3 describes the six phases involved in a typical reservoir simulation case from model- ing to solving. Based on the research of a physical model and related parameters, a corresponding mathematical model can be established. The mathematical model is usually composed of mass conservation equations, Darcy’s law, well equations, and constraints. For non-isothermal mod- els, such as thermal recovery, the energy conservation equation should be included. For polymer drives, a model should also include the relevant mass conservation of polymer. A mathematical model must be a complete solvable mathematical system. The accuracy of a model in describing physical problems determines whether a numerical simulation is practical or not.

Because the variables describing a fluid usually change with time and space, the mass con-

3 Primary recovery

Gas cap Solution Rock Water Gravity expansion gas drive expansion drive drainage

Secondary recovery

Water Pressure flood maintenance

Enhanced recovery

Miscible: Thermal: Chemical: -Hydrocarbon flood -Steam injection -Alkaline -CO flood -In-situ combustion 2 -Surfactant -Alcohol flood -Wellbore heating -Polymer -Enriched gas drive -Hot water injection -Foam -Vaporizing gas drive

Figure 1.2: Recovery process (adapted from [4]).

servation and energy conservation are generally described by PDEs (partial differential equations)

combined with specific boundary and initial conditions. Moreover, these PDEs cannot be solved

analytically. Therefore, a numerical solution by discretization has become an alternative way to

solve the PDEs. The commonly used discrete methods are the FDM (finite difference method),

FVM (finite volume method), and FEM (finite element method). For structured meshes, the FDM

is a common choice. It can be easily programmed with high efficiency. However, for reservoirs

with complicated geometric features, the use of unstructured meshes is inevitable. The FEM and

FVM are the necessary choices. Figure 1.4 displays a sample reservoir model established by a

simulator of CMG (Computer Modelling Group Ltd.).

After PDEs are discretized, the original mathematical system is transformed into a nonlinear algebraic system that can be solved by an explicit method or an implicit method. As long as the time step in an explicit method is smaller than a specific restriction on its size, the solution can converge. However, because the time step required for a convergence condition is usually small,

4 Discretization Physical process Non-linear PDEs (FDM, FVM, FEM)

Linear Non-linear Numerical solution algebraic algebraic system system

Figure 1.3: General stages of reservoir simulation (from [5]). the solution process would be very long. An implicit method usually has no restriction on the time step. Moreover, it is generally more stable. Therefore, the implicit method is widely used in contemporary reservoir simulation.

The Newton-Raphson method is used to solve a nonlinear system in the implicit method. It tries to find successive approximations to the root of a nonlinear system by solving a series of Jacobian systems. Indeed, a conventional Jacobian system, as a linear system, can be solved by direct methods. However, for large and complex models, the Jacobian systems are always large-scale and sparse, which are impossible to be solved effectively by a conventional direct method. Linear iterative solvers become the key to solve this kind of linear system. Generally, a complicated simulation case usually takes several hours, or even days to run. Most of the time is spent on solving the Jacobian systems. For instance, over 70% of the time is occupied by the solution processes of linear systems for a black oil model [7]. So it is necessary to research and develop more effective linear iterative solvers for next-generation simulators. Based on the importance of the solution of linear systems, we extract linear solvers as a separate part to be discussed as a particular subject and call the rest as the reservoir simulator part in this thesis. Figure 1.5 gives a brief structure of such a division.

5 Figure 1.4: A sample reservoir model (from [6]).

1.3 Reservoir Simulators

A reservoir simulator is a piece of highly professional software. Based on different reservoir

models or mathematical models, there are different reservoir simulator components. Reservoir

simulator development requires knowledge of reservoir engineering, computer science, and com-

putational mathematics. How to develop a more extensive, more applicable, and more effective

reservoir simulator is also a challenging task for researchers. Table 1.2 summarizes some of the

existing open-source reservoir simulators. Table 1.3 summarizes some of the existing commercial

reservoir simulators in the market.

From the view of fluid composition and property calculations, reservoir simulators can be

divided into two categories: the black-oil approach and the compositional approach. Table 1.4

collects several commonly used reservoir simulators.

Hydrocarbons are a multi-component mixture. They are divided into an oil component and

a gas component in the black oil model. The volume variation or density variation of the fluid

phases between a reservoir and the surface is reflected in the FVF (formation volume factor). A

black oil simulator is suitable for situations where the recovery processes are not sensitive to the

6 reservoir simulator

Jacobian matrix

linear solver

Figure 1.5: Reservoir simulator and linear solver.

fluid components, such as the primary recovery, gravity drainage, gas cap expansion, solution gas

drive, water injection, and gas injection, which have no mass transfer. The black-oil simulator

is considered a popular and universal simulator in reservoir simulation research and industry. A

geomechanical simulator is usually developed based on the black oil approach.

A compositional simulator provides more refined simulations of hydrocarbon components and

other fluid components. Calculations of the density and state of a fluid phase depend on the EOS

(equation of state) modeling. A compositional simulator is used for cases when the recovery

processes are sensitive to compositional changes in fluids, such as the primary depletion of volatile

oil, gas condensate reservoirs, miscible gas injection, CO2 injection and N2 injection. The thermal recovery mechanism is generally used in heavy oil reservoirs. Steam or hot water is used as a driving fluid injected into a reservoir to reduce the viscosity of the heavy oil. A thermal simulator is usually built based on the compositional approach. The convection-diffusion equation of heat is added to the model to account for heat transfer and convection during the recovery process. An ISC simulator is usually regarded as a further development of a thermal recovery simulator. However, the ISC simulator has its distinctive characteristics. A combustion process creates an ultra-high temperature region at a combustion front. The chemical reactions lead to the transformation of specific components. The drive process is more of a gas phase dominated process, with phase

7 Table 1.2: Open source reservoir simulators. Corporation Simulator Features

U.S. Department BOAST black oil

of Energy[8] IMPES

for educational purposes

SINTEF MRST[9] MATLAB Reservoir Simulation Toolbox

unstructured grids

large and complex models

Open Porous Media[10] OPM CO2 sequestration initiative enhanced oil recovery

state changes. All these problems necessitate significant research needed for the construction and

solution of an ISC model in the ISC simulator. A chemical flooding simulator is used for simulating

the process of polymer and surfactant floods. It is also built on the compositional approach but

requires additional equations to describe and track the injected chemicals. A geochemical reservoir

simulator is suitable for cases when minerals in a reservoir participate in a recovery process. The

formation water geochemistry is also affected by CO2 injection.

1.4 Layout

This chapter has given an overview of reservoir simulators. The research in the thesis includes two separate topics related to the reservoir simulator technology. The first one is the development of a new GPU-based parallel solution platform of preconditioned iterative solvers for large-scale heterogeneous black oil simulation. The other one is the development of a new in-situ combustion simulator with integrated functions based on the PER method.

For the two topics, their theories, implementation techniques, numerical experiments, discus- sions, analyses, and conclusions are detailed in the following chapters individually. The layout of the thesis is given below, where “(1)” and “(2)” in the chapter titles are used to differentiate

8 Table 1.3: Commercial reservoir simulators. Corporation Simulator Features

Schlumberger INTERSECT [11] High-resolution modeling

Chemical enhanced oil recovery

Thermal enhanced oil recovery

Schlumberger ECLIPSE[12] Black oil

Compositional

Streamline

Local grid refinements

Stone Ridge ECHELON[13] Black oil

Technology Fully implicit

GPU accelerated

Rock Flow Dynamics[14] tNavigator Black oil

Compositional

Thermal compositional

High-performance Computing

Computer Modelling Group[15] CMG Suite Black oil (IMEX)

Compositional/unconventional (GEM)

Thermal processes (STARS)

Coats Engineering[16] Sensor Black oil

Compositional repetition titles between the two topics.

In §2, this chapter introduces the concepts of HPC (high-performance computing) and precon- ditioned iterative solvers on GPU. The literature review and research objectives of the first topic are presented.

In §3, this chapter explains the GPU architecture, GPU parallel techniques, CUDA program-

9 Table 1.4: Reservoir simulator general classification. Type Name Features

Black oil Black oil simulator Three phases (oil, gas, and water)

approach Three components (oil, gas, and water)

Based on FVF

Not sensitive to composition changes

Black oil Geomechanical simulator Stress pattern in a reservoir

approach Solid mechanics stress equations

Compositional Compositional simulator Multi-component and multi-phase

approach Based on EOS modelling

Sensitive to composition changes

Volatile oil, CO2 and N2 injection Compositional Thermal simulator Hot water floods

approach Steam floods/CSS/SAGD

Compositional In-situ combustion simulator Air/O2/steam injection approach Chemical reactions

Compositional Chemical flooding simulator Polymer and surfactant floods

approach Chemical mass conservation equations

Compositional Geochemical flooding simulator Minerals in the formation

approach Formation water geochemistry ming, the black oil model, and the SPE10 model 2 because the work belongs to the cross research of the GPU technology, the large-scale heterogeneous black oil model, and the parallel precondi- tioned solvers.

In §4, this chapter introduces the source and storage of sparse matrices first. Then it details iterative solvers, preconditioners, and parallel implementation techniques.

In §5, this chapter presents the numerical experiments and performance comparison and anal-

10 ysis of different preconditioned solvers performed on GPUs.

In §6, this chapter gives the research conclusions and future work of the first topic.

In §7, this chapter states the introduction of the thermal recovery and the ISC process. Research works of related simulators are reviewed. The research objectives of the second topic are detailed.

In §8, this chapter presents the model description, mathematical system, and numerical meth- ods of the ISC simulator.

In §9, this chapter gives a comprehensive set of experimental verification of the results between the ISC simulator and the benchmark simulator CMG STARS.

In §10, this chapter summarizes the research conclusions and future work of the second topic.

11 Chapter 2

INTRODUCTION(1)

With the development of reservoir simulation towards sophisticated, fine, and large-scale models, it has been an increasingly critical research task to improve the linear solution speed of reservoir simulators. This chapter firstly describes the evolution of computer architecture, including CPU, multiple-core CPU, GPU, cluster, and hybrid cluster. The concept of HPC (high-performance computing) is presented. As GPU provides a promising parallel computing solution for personal computers and workstations, the linear solvers based on GPU technology has become a hot-spot research area for next-generation reservoir simulators. Then, this chapter introduces the GPU performance and gives an overview of the iterative solvers and preconditioners. This research requires to develop a new parallel solution platform based on the characteristics of GPU hardware.

The literature review and research objectives are detailed in the last part.

2.1 High-Performance Computing

Scientific computing has been widely used in various industrial and research fields, such as macro- molecular computing, geological mapping, oil exploitation, electronic design, climate modeling, media, and entertainment, to name a few. Modern reservoir models are developed in the direc- tion of sophisticated, fine, and large-scale features. With the improvement of demand, people put forward higher requirements on computing performance, especially on computing speed.

Computer architecture, including hardware and software technologies, has evolved with time.

Table 2.1 briefly summarizes the evolution of computer architecture.

The CPU (central processing unit) is traditionally the core of computer computing. It is respon- sible for the execution of algorithmic instructions, logical control, and input-output operations. It is relatively simple to implement an algorithm on a CPU. Although serial programs have the ad-

12 Table 2.1: Evolution of computing architecture. Program Platform Technique

Serial program CPU Traditional program

Parallel program Multiple-core CPU OpenMP

in single node GPU CUDA, ROCm

Parallel program Cluster MPI

on cluster Hybrid cluster vantages of easy to design, implement, and transplant, they are far from satisfying the requirement for a complex computing task, such as a fine-size model or a large-scale model.

Most modern CPUs are multi-core CPUs. That is, there are multiple processors on a single

CPU. Due to the change of hardware platform and computing structure, an original serial program or algorithm cannot be transplanted to a new parallel platform. For this reason, the algorithm needs to be re-implemented by new programming techniques. OpenMP (open multi-processing) technol- ogy provides support for multi-thread programming on multi-core CPUs. OpenMP is defined as an application programming interface (API) that supports shared memory multi-processing pro- gramming [17]. OpenMP supports the use of multiple platforms and can be implemented through languages such as C, C++, and FORTRAN.

GPU (graphic processor unit) as a parallel computing device can be installed into a node (com- puter) to undertake parallel computing tasks. CPU sends data and instructions that need to be executed in parallel to GPU, and then the results are sent back to CPU after GPU completes par- allel computations. NVIDIA has launched the CUDA architecture and language to support the development of parallel programs on GPU. With the support of multi-core CPUs and GPUs, the computing power of a single node or workstation is significantly enhanced. The advantages of easy installation, portability, and low cost enable the rapid development of this parallel technology.

A cluster of computers can work as an alternative solution, which is thought to be a set of connected computers used to perform a common task. A cluster consists of several nodes, for

13 example, a small business cluster, or thousands of computers, such as a supercomputer. In most

cases, each node has the same hardware and operating system. The nodes of a supercomputer are

usually connected through a fast local network. The Summit (OLCF-4), shown in Figure 2.1, is

the world’s fastest supercomputer developed by IBM used at Oak Ridge National Laboratory as of

November 2018.

Figure 2.1: Summit (OLCF-4).

HPC (high-performance computing) refers to the use of parallel computing techniques to solve complex computing problems. From a hardware view, HPC indicates computing systems and environments that usually use many processors as part of a single machine or multiple computers organized in a cluster operating as a single computing resource. From a software view, parallel computing is the primary technology to implement HPC. Because traditional serial algorithms can not be used on parallel hardware directly, the development of parallel algorithms suitable for an

HPC platform is the main obstacle and challenge. Parallel computing needs to divide a massive problem into many small subproblems according to specific rules and compute them on multiple processors in one node or a cluster. The processing results of these small problems are then merged into the final results of the original problem. Because the computation of these small problems is

14 accomplished in parallel, the processing time of these problems is reduced.

Different computing architectures are represented by different distributions of processors (cores) and memory. Figure 2.2 shows a traditional single-core CPU computer. Figure 2.3 presents a multi-core CPU computer with a shared-memory structure. The multiple CPU cores visit and share the same memory. The OpenMP technology is used for programming on a shared-memory structure. Figure 2.4 describes a GPU computer. GPU works as a role of a device depending on the instructions of the CPU. GPU has its memory that communicates with the main memory.

The CUDA technology is designed for programming on NVIDIA GPU. The ROCm technology is used for programming on AMD GPU. Figure 2.5 shows a computer cluster, in which all nodes are connected through the network and form a scalable distributed-memory structure. Communi- cation between nodes can be accomplished by MPI technology. Besides enlarge the number of nodes, each node of a cluster can also be made up of a multi-core CPU computer or even a GPU computer, presenting a hybrid structure that enhances the computing capability of each node; see

Figures 2.6 and 2.7. Shared-memory systems and distributed-memory systems are two major types of parallel architecture.

... CPU core CPU core CPU core

Memory Memory

Figure 2.2: Single-core CPU computer. Figure 2.3: Multi-core CPU computer.

Modern reservoir simulations are concentrating on developing reservoir models towards com- plexity, refinement, and a large scale. A cluster or supercomputer has the advantages of extensi- bility and parallelism, suitable for super-large-scale computations. However, it is more convenient to use a personal computer or a workstation to speed up for general large-scale models. In terms of hardware portability, software installation and deployment, and cost, a personal computer or a

15 CPU core CPU core ... CPU core GPU cores Memory Memory

Main memory GPU memory

Network

Figure 2.4: GPU computer. Figure 2.5: Computer cluster.

... CPU core CPU core CPU core ... CPU core ...

Memory Memory

Network

Figure 2.6: Hybrid cluster (1). workstation has advantages over a cluster or a supercomputer. Because the GPU technology can significantly improve the parallel computing ability of personal computers or workstations, the development of the next generation reservoir simulator based on GPU has become a mainstream research direction.

2.2 Preconditioned Iterative Solvers on GPU

The common advantage of CPU is dealing with complex logic. Although CPU can also provide a certain amount of parallel computing, this ability is limited. CPU is no longer competent for heavy parallel computing, and the emergence and development of GPU become an inevitable trend. For hardware platforms, GPU has been widely used in personal computers, mobile phones, embedded systems, servers, and supercomputer systems nowadays.

16 CPU core GPU cores CPU core GPU cores ...

Main memory GPU memory Main memory GPU memory

Network

Figure 2.7: Hybrid cluster (2).

It is necessary to note that GPU is not used as a substitute for CPU, but as an auxiliary com-

puting device that deals with parallel tasks scheduled by CPU. Usually, a GPU can be installed

on the motherboard, or it can be embedded into a CPU mold. CPU and GPU work together in a

division of labor. Management control and complex logical work are usually executed by CPU, but

if large amounts of parallel computing are encountered, such tasks are assigned by CPU to GPU

for execution. Figure 2.8 shows an Intel Core i7-6950X processor. Figure 2.9 shows a NVIDIA

Tesla K40 GPU. Both are advanced computing equipment in the past few years.

Figure 2.8: Intel Core i7-6950X processor. Figure 2.9: NVIDIA Tesla K40 GPU.

GPU technology has undergone rapid development in the past decade. Figure 2.10 shows a comparison between GPU and CPU in floating-point computations. Figure 2.11 shows a com-

17 11000 10500 Intel CPU Single 10000 Intel CPU Double Precision 9500 NVIDIA GPU Single Precision 9000 NVIDIA GPU Double Precision 8500 8000 7500 7000 6500 6000 5500 5000 4500 4000 3500

Theoretical GFLOP/s at base clock 3000 2500 2000 1500 1000 500 0 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 Year

Figure 2.10: Floating-point operations per second for the CPU and GPU, 1 GFLOPS = 109 floating- point calculations per second (adapted from [18]). parison between GPU and CPU in memory bandwidth. In the two aspects, the performance and development speed of GPU is much better than those of CPU. For instance, Table 2.2 provides performance comparisons between GPU NVIDIA Tesla K40 and CPU Intel Core i7-5960X.

Table 2.2: Performance comparison between the GPU and CPU. Memory speed Double precision calculation

GPU NVIDIA Tesla K40 288 GB/s 1660 GFLOPS

CPU Intel Core i7-5960X 68 GB/s 384 GFLOPS

Because of the outstanding performance of GPU in high-performance parallel computing and

the high memory bandwidth transmission capacity, GPU computing is applied in more and more

industrial fields and scientific disciplines.

For reservoir simulation models, linear systems discretized from PDE equations usually have two distinct characteristics: large-scale and sparse. Figure 2.12 describes the non-zero pattern of

18 800 Intel CPU Tesla GPU 700 GeForce GPU

600

500

400

300 Theoretical peak GB/s

200

100

0 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 Year

Figure 2.11: Memory bandwidth for the CPU and GPU (adapted from [18]).

the coefficient matrix of a linear system discretized by a standard finite difference method. For

such a matrix, practical storage and solution are the key factors to consider. The storage space of

a matrix can be reduced by using the features of its vast majority of zero elements. Because of the

enormous cost in solving and parallel implementation of a direct method, the iterative technique

has been used as a more practical choice for solving large-scale systems [19].

(1) Tri-diagonal for 1-D problem (2) Penta-diagonal for 2-D problem (3) Hepta-diagonal for 3-D problem

Figure 2.12: Patterns of coefficient matrix.

Krylov subspace methods are the most effective iterative technique for solving large sparse linear systems currently [19]. These methods are based on a Krylov subspace projection technique

19 to find an approximate solution to a problem. Based on this idea, a series of Krylov subspace methods have been developed. Table 2.3 lists some conventional Krylov subspace methods: CG

(conjugate gradient), GMRES (generalized minimal residual), and BiCGSTAB (biconjugate gra- dient stabilized). The efficiency of the Krylov subspace methods generally decreases gradually as the mesh size of a system or the number of blocks in a discretization grid increases.

Multigrid methods are designed specially for solving elliptic PDEs, or Poisson-like systems

[19]; see Table 2.3. Due to the use of the characteristics of meshes or matrices, multigrid methods show vastly excellent solution performance. Because the idea of a set of extensive multiple grids is applied in the multigrid methods, theoretically, the problem of efficiency reduction caused by the inflation of a mesh size can be avoided skillfully. Therefore, multigrid methods have superior scalability for the implementation of parallel algorithms. As the original multigrid method, GMG

(geometric multigrid) requires the physical characteristics of a problem, especially the nodes of the underlying mesh. Different from GMG, AMG (algebraic multigrid) applies algebraic methods to generate multi-layered grids; that is to say, it gets rid of the dependence on a physical mesh and uses only the characteristics of a coefficient matrix to solve the problem. As an extension and generalization of GMG, AMG has a wider range of use than GMG and still maintains excellent performance benefits.

The lack of efficiency and robustness is a frequently encountered problem when the Krylov subspace iterative methods are applied for solving very large-scale sparse systems. The precondi- tioning technique can transform an original linear system into an equivalent system that is much easier to solve [19]. The transformation is applied by a preconditioner that reduces a condition number of the original system. A preconditioner is very specialized. That is to say, for different problems, the performance of the same preconditioner varies greatly. Researchers try to search for effective preconditioners based on the physical characteristics of a problem, the matrix features of a system, or in any other possible way. The search and implementation of effective precondi- tioners are still considered a challenging activity, which is also a critical problem in the study of

20 contemporary iterative methods.

Table 2.4 collects some widely used preconditioners. ILU (incomplete LU factorization) is a

family of general-purpose preconditioners, which can be constructed by the process of incomplete

Gaussian elimination. According to the way the elements are filtered, ILU is usually divided into

ILU(k) and ILUT. ILU(0) is the most commonly used form of ILU(k). If the coefficient matrix of a system has block characteristics, the BILU (block-wise ILU) is another choice. In addition to the solution of elliptic PDEs, the AMG method is also used to solve the preconditioning matrices with symmetric positive definite features. For a reservoir simulation system, a Jacobian matrix block of the pressure variable is nearly symmetric positive definite so that this part can be solved by the

AMG method. Based on this idea, Wallis et al. proposed the CPR (constrained pressure residual) as a two-stage preconditioner using both AMG and ILU for improving the solution process of reservoir models [20]. RAS (restricted additive Schwarz) is a domain decomposition method. It can be considered a preconditioner to optimize the parallel structure of a preconditioning matrix. Table 2.3: Linear solvers. Table 2.4: Preconditioners. Type Name Type Name

Krylov CG ILU ILU(k)

GMRES ILUT

BiCGSTAB Blockwise-ILU

Multigrid GMG Other RAS

AMG CPR

2.3 Literature Review

Barrett et al. summarized the commonly used iterative methods and preconditioners and wrote

them into language-independent templates that help developers avoid understanding the principles

of specific iterations, but help them implement the algorithms more easily on devices [21].

Saad described the latest theory and practices for iterative methods and preconditioners in his

21 book [19]. As the most functional group of techniques used in applications, the Krylov subspace methods, preconditioners, and domain decomposition have been described in detail, as well as parallel implementations.

Li and Saad presented an overview of preliminary experience in developing iterative linear solvers on GPU [22]. They observed that GPU provides a much lower performance advantage for irregular (sparse) computations than for regular (dense) computations. However, GPU can still be beneficial as a co-processor to CPU to speed-up complex computations if used carefully.

Klie et al. proposed a multi-algorithm preconditioned solver on GPU computations, which was applied for the black oil and compositional models. The solver was designed to select an appropriate solver configuration at each time step according to the simulation situation. They proved that the GPU-based solver provided a feasible acceleration scheme for the future research of reservoir simulation [23].

The FASP (fast auxiliary space preconditioning) method provides a general framework for constructing numerical solutions of PDEs. Hu et al. used the FASP method to design a precondi- tioner for reservoir simulation, in which they selected an appropriate auxiliary space for a Jacobian system derived from a fully implicit scheme coupled with implicit wells [24].

Wallis proposed the CGCR (constrained generalized conjugate residual) method for reservoir simulation, which is GCR (generalized conjugate residual) with a multi-step preconditioner. The

CGCR can also be viewed as a new variant of the Petrov-Galerkin-Krylov method combining the least-squares and Galerkin approaches [20].

Cao et al. developed a two-stage CPR (constraint pressure residual) preconditioner for large- scale parallel, structured, and unstructured linear systems of equations associated with reservoir simulation. They used their in-house PAMG (parallel algebraic multigrid) solver for the first stage and a parallel ILU-type scheme for the second [25].

Liu et al. developed a family of CPR-like preconditioners on the distributed-memory platform.

The preconditioners are used for highly heterogeneous reservoir simulations, including two new

22 three-stage preconditioners and one four-stage preconditioner [26].

With the increasing scale of a solution system, the traditional one-level methods reach their

limits. As a hierarchical multi-level method, AMG has become a feasible scheme for solving

elliptic partial differential equations. An amount of work has focused on the introduction of the

concepts and strategies of the AMG method [27, 28, 29, 30, 31].

Stuben¨ presented a review of algebraic multigrid, in which he noted that the main problem in designing efficient AMG algorithms is the trade-off between convergence and numerical work, and keeping the balance between the two should be the ultimate goal of any practical algorithm [32].

Cleary et al. studied the application scope and algorithmic scalability of the AMG method.

They also discussed some cases, in which AMG does not work well, and gave some clues for the problems [33].

Henson et al. implemented a parallel AMG code with some approaches to parallelize the coarse-grid selection. They considered three basic coarsening schemes and certain modifications to the basic schemes for addressing specific performance issues [34].

Yang presented an overview of the parallel implementation of the AMG method, including the schemes of coarsening and smoothing. The paper also included references to existing software packages such as Hypre, LAMG, ML, and pARMS [35].

For the implementation of the Krylov subspace solvers and AMG solvers on GPUs, the ap- proach of matrix storage and the operation of SpMV (sparse matrix-vector multiplication) are important components. NVIDIA proposed a hybrid matrix format named HYB for GPU compu- tations [36, 37]. Saad et al. designed a JAD matrix and its SpMV algorithm for GPU computing

[22, 23].

Chen et al. completed some work of the SpMV, ILU, RAS, and AMG on GPU [38, 39, 40].

This thesis continues and strengthens the work. We have systematically investigated and developed a set of preconditioners and solvers. The ILU(k) and ILUT were further investigated. The BILU was developed and discussed. The implementation of AMG was advanced and analyzed. Some

23 work, which is presented in the thesis, was published; see [41, 42, 43].

2.4 Research Objectives

In order to provide an acceleration solution for large-scale heterogeneous black oil simulation in a single node or workstation, this research develops a new parallel solution platform based on the characteristics of GPU hardware. The platform needs to provide effective preconditioned iterative solvers specifically for the solution of a large-scale heterogeneous black oil model. Moreover, it is required to provide comprehensive performance analysis and evaluation compared to general preconditioned solvers.

This work belongs to the cross research of the GPU technology, the parallel algorithm design of preconditioned solvers, and the feature analysis of a large-scale heterogeneous black oil model.

The research includes the development of a series of CPR preconditioned solvers for the black oil model, and also a variety of general preconditioned iterative solvers as a contrast.

The work of this research is listed below.

1. According to the features of GPU hardware, an appropriate matrix storage structure

is adopted, which can provide the advantage of saving storage space and efficient

access.

2. Using the batch access characteristics of GPU, the specific vector parallel oper-

ations and the SpMV are required to realize. These parallel operations serve as

building blocks for the parallel development of linear solvers and preconditioners

designed from their corresponding serial algorithms.

3. Based on the storage structure and the parallel operations, the general linear solvers

and preconditioners are implemented on GPU. GMRES and BiCGSTAB are used

for general matrices. AMG is special for Poisson-like matrices. The ILU precon-

ditioners include ILU(k), ILUT, and BILU. For the domain decomposition, RAS is

realized.

24 4. The CPR preconditioners, especially for a large-scale heterogeneous black oil model,

are developed, including CPR ILU(k) and CPR ILUT.

5. The large-scale black oil model, SPE10 model 2, serves as a benchmark and is used

to perform testings on the solution platform. The improvement of the parallel per-

formance of preconditioned solvers based on GPU technology is required to test.

The performance differences in different preconditioned solvers should be com-

pared and analyzed. The feasibility and advantages of CPR series preconditioners

as an efficient large-scale black oil simulation solver on the GPU platform need to

be verified.

25 Chapter 3

GPU AND BLACK OIL MODEL

The purpose of the first research is to develop a new parallel solution platform for large-scale het- erogeneous black oil simulation based on the characteristics of GPU hardware. This work belongs to the cross area of the large-scale heterogeneous black oil model, the GPU technology, the pre- conditioned solvers, and the parallel algorithm design. In this chapter, we introduce GPU, CUDA, the black oil model, and the SPE10 model 2. As a separate part, the solvers, preconditioners, and parallel algorithm design are presented in the next chapter. As an HPC hardware, GPU is adopted in the development of this solution platform. CUDA language is a particular programming API designed for NVIDIA GPU. The black oil model is a classical model for reservoir simulation. The

SPE10 model 2, an oil-water model based on the black oil approach, is selected as a benchmark for testings of the developed solution platform in the research because it has a very fine grid and heterogeneous geological features.

3.1 GPU

3.1.1 GPU Architecture

The NVIDIA GPU is used for the development of the current research. GPU is the core compo- nent of a device or graphics card. A device cannot work alone, and it needs to work with a host on which CPU plays a leading role. Figure 3.1 shows a Host-Device architecture with CPU, GPU, and memory. CPU acts as a leader with a strong ability of scheduling, management, and coordination, while GPU is likened to an employee who accepts an assignment from CPU. GPU has the charac- teristics of multiple cores, fast memory access, and high floating-point calculation performance. It is appropriate for parallel algorithms to deal with a large amount of data. In this research, the tasks with complex logic, such as the main flow of an iterative algorithm and preconditioner assembly,

26 are executed on CPU. In contrast, the tasks with strong parallelisms, such as vector operations, matrix-vector multiplications, and the solution of a preconditioner system, are carried out by GPU.

Host Device GPU Multiprocessor DRAM Multiprocessor CPU Local memory Multiprocessor

Global DRAM Chipset Registers memory Shared memory

Figure 3.1: CPU and GPU architecture (adapted from [44]).

An SP (streaming processor) is the primary processing unit, which is responsible for the exe- cution of instructions on GPU. The parallel computation on GPU is virtually handled by multiple

SPs simultaneously. An SP is also called a CUDA core. A large number of cores is the most typical feature of GPU. Table 3.1 gives the micro-architecture and the number of CUDA cores of the GPUs used in the research. A CUDA core is a lightweight processor that cannot be compared with a CPU core. Table 3.1: Micro-architecture and CUDA cores. Model Micro-architecture CUDA cores (total)

K20X GPU Accelerator [45] Kepler 2688

V100-PCIe-16GB [46] Volta 5120

An SM (streaming multiprocessor) is the part to perform actual parallel computations in GPU.

An SM is composed of a number of SPs, and it also has its control units, registers, execution pipelines, and caches. Figure 3.2 shows an architecture of the Fermi streaming multiprocessor.

Each Fermi SM features 32 CUDA cores, and each CUDA core has a fully pipelined integer ALU

(arithmetic logic unit) and FPU (floating-point unit) [47].

An SM organizes threads in groups of warps. A warp contains 32 parallel threads. Figure 3.3 shows a sample of Fermi SM warp scheduling. Each Fermi SM owns two warp schedulers and two

27 Instruction Cache

Warp Scheduler Warp Scheduler

Dispatch Unit Dispatch Unit

Register File (32,768 x 32-bit)

LD/ST Core Core Core Core LD/ST SFU LD/ST Core Core Core Core LD/ST $6%"$PSF LD/ST Core Core Core Core %JTQBUDI1PSU LD/ST 0QFSBOE$PMMFDUPS SFU LD/ST Core Core Core Core LD/ST '16OJU*/56OJU LD/ST Core Core Core Core 3FTVMU2VFVF LD/ST SFU LD/ST Core Core Core Core LD/ST

LD/ST Core Core Core Core LD/ST SFU LD/ST Core Core Core Core LD/ST

Interconnect Network

64 KB Shared Memory / L1 Cache

Uniform Cache

Figure 3.2: Fermi streaming multiprocessor (SM) (from [47]). instruction dispatch units. So two warps can be performed concurrently. An instruction is executed twice on 16 cores to acquire a warp of 32 threads, with two active warps at the same time. Fermi

SM can achieve nearly peak hardware performance by using this model of a dual-issue. Most instructions can be dually issued, such as two integer instructions, two floating instructions, or a mixture, but double-precision instructions do not support a dual-issue [47].

28 Figure 3.3: Fermi SM warp scheduling (from [47]).

3.1.2 CUDA

CUDA (compute unified device architecture) is defined as a parallel computing platform and API

(application programming interface) model by NVIDIA. CUDA enables developers to write pro-

grams, which are executed on NVIDIA GPUs, with C, C++, FORTRAN, OpenCL, DirectCompute,

and other languages [47]. Usually, a CUDA platform is considered a software layer that makes it

convenient to access the virtual instruction set and parallel computational elements of GPU pro-

gramming. Figure 3.4 describes the concepts of threads, blocks, and grids, which are used in

CUDA programming, with the corresponding memory spaces.

Multiple threads execute a parallel program. A thread maps to an SP/CUDA core to execute. A

thread is a concept of software, while an SP/CUDA core is a concept of the hardware. Each thread

is assigned a private memory space used for register spills, function calls, and C automatic array

variables [47]. The memory space has the lifetime of a thread and only accessible by the thread. It

is composed of registers and local memory. The access speed of registers is the fastest. The local

memory is relatively slow and used for what does not fit in registers.

29 A block is composed of a set of concurrently executing threads that cooperate through syn- chronization and shared memory. A shared memory space has the lifetime of a block and is used for inter-thread communication, data sharing, and result sharing in parallel algorithms [47]. The shared memory can be accessed by any thread of the block for which the shared memory is created.

The access speed can be comparable to that of access registers given no bank conflicts or reading from the same address.

Multiple blocks are organized into a grid. All the blocks in the same grid have the same number of threads. A grid is used for parallel computing that requires a number of thread blocks. Global memory space is allocated for a grid to read inputs and write results. The global memory can be accessed from either a host or a device. It has the lifetime of an application. The access speed of global memory is much slower than that of register or shared memory. The global memory is used for storing matrices and vectors in the research. It is better to read or write the global memory in as few transactions as possible by combining memory access requests. To achieve the global memory coalescing, we use a specific matrix storage approach to guarantee that the threads in a warp can follow the same execution path with the least branch divergence.

Besides the memories introduced above, GPU also provides constant memory and texture memory to optimize memory performance. A summary and relationship between the concepts of software and hardware are listed in Table 3.2. Table 3.2: Relationship between concepts of software and hardware in GPU. Software Hardware

Grid GPU/Device

Block SM

32 threads Warp

Thread SP/CUDA Core

A kernel is defined by the global declaration specifier in CUDA languages. It provides the approach for the host to execute parallel tasks using threads on the device. Developers orga-

30 Figure 3.4: Threads, blocks, and grids, with memory spaces (from [47]). nize the threads by blocks and grids in a kernel. Any call to a kernel function must specify its execution configuration of the dimension of the grid and blocks. The execution configuration is set by inserting an expression of <<>> between the function name and the argument list. Table 3.3 collects the explanation of the configuration arguments. A call to a kernel is asynchronous. In other words, the kernel function returns before the device has completed its execution. For example, if a kernel is declared as

global void func(float* parameter);

, it should be called by

func<<< Dg, Db >>>(parameter);

31 . Table 3.3: Execution configuration arguments for kernel. Name Type Brief

Dg dim3 Dg.x Dg.y Dg. z, dimension of the grid × × Db dim3 Db.x Db.y Db. z, dimension of the blocks × × Ns size t Dynamically allocated shared memory; optional

S cudaStream t Associated stream; optional

The dim3 is an integer vector struct of x, y, z . In the research, one-dimensional blocks are { } used. As threads are executed in warps, the block size represented by Db.x is defined as 256, a multiple of 32. A two-dimensional grid is configured for kernel calls. So each block in the grid has a unique block coordinate (blockIdx.x, blockIdx.y), and each thread in its block has a unique thread id, threadIdx.x. Each thread also is assigned a unique global thread id in the scope of the grid. The global thread id is defined as

Id = blockDim.x (blockIdx.x + blockIdx.y gridDim.x)+ threadIdx.x, (3.1) ∗ ∗ where blockIdx, threadIdx, blockDim and gridDim are built-in variables.

Figure 3.5 gives a CUDA processing flow that is composed of five steps. The third step of the kernel launch is crucial. The GPU instantiates a kernel on a grid of blocks bound to the execution configuration. Each thread runs an instance of the kernel and is assigned a thread ID that can be accessed within the kernel through the threadIdx variable. Designing and writing an appropriate kernel determine the efficient solution to the problems studied.

3.2 Black Oil Model

3.2.1 Mathematical Model

The black oil model is a classical model for reservoir simulation. The reservoir is assumed to be isothermal. Table 3.4 gives the components, phases, and the relationships of density between

32 Thread

//1. allocate memory on device cudaMalloc(...);

//2. copy data from host to device cudaMemcpy(...); //3. execute kernel //kernel func<<<...>>>(...); ___global___ void func(...){

...... //4. copy data from device to host cudaMemcpy(...); }

//5. free memory on device cudaFree(...);

Host/CPU Device/GPU

Figure 3.5: Processing flow on CUDA.

components and phases. Lower-case and upper-case letter subscripts denote the phases and com-

ponents, respectively. The subscript s indicates the standard conditions [4]. ρ, B, V, and Rso represent molar density, formation volume factor, volume and gas solubility, respectively. The black oil model assumes that the fluids are composed of three phases: water (aqueous phase), oil

(oleic phase), and gas (gaseous phase), and three components: water component, oil component, and gas component. The water component and oil component can only stay in the water phase and oil phase, respectively. No mass transfer exists between the water phase and the other two phases

(oil and gas). The gas component is allowed to exist in both the gas phase and the oil phase. If existing in the form of the gas phase, the gas component is called free gas. Alternatively, it is called solution gas in the oil phase. The gas solubility Rso is the standard condition volume of gas

33 dissolved in a unit volume of stock-tank oil at given reservoir pressure and temperature.

Table 3.4: Phases and components for black oil model. Phase w o g Component

W ρ = ρWs w Bw O ρ = ρOs Oo Bo R G ρ = soρGs ρ = ρGs Go Bo g Bg

VGs Rso = VOs

The oil FVF (formation volume factor) Bo is defined as the ratio of the volume Vo of the oil phase at reservoir conditions to the volume VOs of the oil phase, in which only oil component is left, at standard conditions. Bw and Bg have similar definitions. FVF provides a link to the volume or density between the conditions of the reservoir and the surface. Figure 3.6 shows the structure of the black oil model under the two conditions. The superscript in V and ρ is used for ∗ G∗s G∗ s distinguishing free gas from solution gas. Actually, ρ equals ρ . G∗ s Gs

Figure 3.6: Black oil model.

The equations of mass conservation, Darcy’s law, saturation, and capillary pressure are given,

34 respectively, as

∂ (φSwρw)= ∇ (uwρw)+ qwρw ∂t − ·  ∂ (φS ρ )= ∇ (u ρ )+ q ρ , (3.2)  ∂t o Oo o Oo o Oo  − ·  ∂ ∂t [φ(SoρGo + Sgρg)] = ∇ (uoρGo + ugρg)+ qoρGo + qgρg  − ·   Krα uα = K (∇Pα ραgˆ∇Z) α = w,o,g, (3.3) − µα −

∑ Sα = 1, (3.4) α=w,o,g

Pcow = Po Pw −  , (3.5)  Pcgo = Pg Po  − , which form a closed system. P, u, φ, S, and K represent pressure, Darcy velocity, porosity,  saturation and permeability, respectively. The black oil model has different states under different situations. If the three components all exist and Po is less than Pb (oil bubble point pressure) in a reservoir, the three phases coexist. The state is called the saturated state. The variables Po, Sw, and

Sg are chosen as unknowns. If the three components all exist and Po is greater than Pb, only the water phase and oil phase exist in the reservoir; i.e., there is no free gas. The state is said to be the undersaturated state. The density and viscosity of the oil phase depend on both Po and Pb at this state. The variables Po, Sw, and Pb are used as unknowns. If there is no gas component, the black oil model degrades into a two-phase (oil-water) model.

Generally, Po and Sw are selected as unknowns [4]. We will use the oil-water model from the SPE10 model 2 to generate the matrices for our solver tests.

It is often convenient for the conservation equations to work with “standard volumes”,

∂ φSw = ∇ uw + qw ∂t Bw − · Bw Bw  ∂  φSo  u  qo  = ∇ o + (3.6)  ∂t Bo − · Bo Bo  S u q ∂ φ SoRso + g  = ∇ uoRso + g + qoRso + g ∂t Bo Bg Bo Bg Bo Bg  − ·  h  i   , rather than “mass” [4].

35 3.2.2 SPE10 Model 2

The SPE Comparative Solution Projects contain a series of comparative solution projects organized by the Society of Petroleum Engineers. The purpose of the projects aims at providing benchmark datasets used to compare the performance of different simulators or algorithms. The first to ninth comparative solution projects focus on models of black-oil, compositional, dual-porosity, thermal, and miscible. The purpose of SPE10 (the tenth SPE comparative solution project) is to compare upgridding and upscaling approaches for two problems: a 2-D 2-phase (oil and gas) model that has a vertical cross-section of 2000 cells and a 3-D 2-phase (oil and water) model of 1.1 million cells.

The second model of the SPE10 (SPE10 model 2) would be hard, though not impossible, to run the solution with classical methods, and is chosen to be sufficiently detailed [48].

As the finest grid and heterogeneous model among the SPE Comparative Solution Projects, the SPE10 model 2 is often selected as a benchmark to test the computing performance of an

HPC simulator or solver. The SPE10 model 2 was originally created for the PUNQ project [49].

It is composed of part of a Brent sequence. The top part of the model is a Tarbert formation representing a prograding nearshore environment. The lower part, Upper Ness, is fluvial. The model is described in a Cartesian grid. It has a dimension of 1200 2200 170 ft. The top 70 ft, × × 35 layers, represents the Tarbert formation, and the bottom 100 ft, 50 layers, represents Upper

Ness. The fine-scale model has 60 220 85 cells (1.122 106 cells), where each cell has a × × × dimension of 20 ft 10 ft 2 ft [48]. × × The Upper Ness contains channels that increase the difficulty of upscaling. About 2.5% of the cells are inactive with zero porosity. The porosity field and the permeability field have a strong correlation. Figure 3.7 shows the porosity for the whole SPE10 model 2. Figure 3.8 shows the porosity for Upper Ness. The permeability is a diagonal tensor with equal values in the horizontal directions: Kx = Ky. Both the Tarbert and the Upper Ness are characterized by large variations,

8-12 orders of magnitude. The model has a uniform Kv/Kh of 0.3 in the channels and a Kv/Kh of

3 10− in the background [48, 50]. Figure 3.9 shows the logarithm of horizontal permeability for

36 Tarbert. Figure 3.10 shows the logarithm of horizontal permeability for Upper Ness.

A water drive recovery method is applied to the reservoir. The oil is produced from four vertical wells at the corners, each of which has a bottom hole pressure of 4000 psi, and a constant rate of

STB water, 5000 day , is injected into a vertical well at the center of the reservoir [48]. Figure 3.11 presents the well configuration.

Figure 3.7: SPE10 model 2: φ for whole model Figure 3.8: SPE10 model 2: φ for Upper Ness

(from [51]). (from [51]).

2200 2200 4

2000 2000 3

1800 1800 3

1600 1600 2

2 1400 1400

1 1200 1200 1 ft) ( y y (ft) 1000 1000

0 0 800 800

600 600 −1 −1

400 400

200 −2 200 −2

0 0 0 200 400 600 800 1000 1200 0 200 400 600 800 1000 1200 x (ft) x (ft)

Figure 3.9: SPE10 model 2: logarithm of Kh for Figure 3.10: SPE10 model 2: logarithm of Kh Tarbert (from [50]). for Upper Ness (from [50]).

37 Producer A

Injector

Producer D

Tarbert

Producer B Upper Ness

Producer C

Figure 3.11: SPE10 model 2: well configuration (adapted from [50]).

38 Chapter 4

PRECONDITIONED ITERATIVE SOLVERS

The process of a reservoir simulation usually generates a series of Jacobian systems in Newton’s iteration. When a geological model is heterogeneous, and its grid has enormous grid blocks, the

Jacobian matrices are always large-scale, sparse, and hard to be solved. Currently, the Krylov subspace iterative methods are adopted to solve the Jacobian systems. However, the efficiency of an iterative solver depends critically on the quality of the preconditioner applied. In this chapter, the different matrix storage formats are presented firstly. Then, the Krylov subspace methods and the AMG method are investigated; The preconditioners, ILU(k), ILUT, and BILU, are analyzed;

The CPR mechanism is studied. In the parallel implementation part, the parallel components, the

parallel triangular solver, the RAS, and the parallel algorithms of BiCGSTAB and GMRES are

introduced.

4.1 Sparse Matrices

4.1.1 Discretization

The FDM (finite difference method), FEM (finite element method), and FVM (finite volume

method) are widely used in the numerical calculation of partial differential equations. Table 4.1

lists some applications of these methods.

The FDM is the most direct method for the discretization of PDEs. It approximates PDEs

by difference equations. The FDM is well suited for structured meshes. For large-scale grids

and parallel computing, the FDM has excellent advantages in implementation and efficiency over

the FVM and FEM. Therefore, the FDM plays a dominant role in numerical solutions of PDEs

nowadays.

The FEM works as a general discrete method for PDEs. The first step of the FEM is to divide

39 Table 4.1: Application of FDM, FEM, and FVM (adapted from [52]). Method Application

FDM Weather calculations, astrophysics, seismology, physical realism

in computer graphics, and special effects

FEM All kinds of structural analysis, heat transfer, chemical engineering,

electromagnetics, multi-physics, and computational-fluid-dynamic

FVM Heat transfer, chemical engineering, and computational-fluid-dynamic a domain into small but finite-sized elements of simple shapes such as triangles in 2-D, or tetrahe- drons in 3-D. The next step approximates the solution to the PDEs on each element by a simple function, such as a linear or quadratic polynomials. In this way, the PDE system in the whole field can be described by a simple set of linear or nonlinear equations. Because the accuracy of the approximation can be improved by increasing the order of the polynomials of the approximation on each element, the FEM is a very accurate discrete method. However, the implementation of the

FEM is usually more laborious than the FDM.

The FVM has excellent strength in dealing with conservation laws appearing in transport prob- lems. Like the FEM, the FVM also needs to divide a domain into many small, simple shapes.

Then flux conservation equations are applied to the volume of each small block. Because the flux entering a volume equals to that leaving the adjacent volume, the FVM is conservative. It is typical to exploit upstream schemes in finite volume approximations [19]. The difficulty of implementing the FVM lies between the FDM and FEM.

Based on the development of modern high-performance computing technology, the fully im- plicit scheme is widely used in the new generation simulators. The discretization approaches eventually convert an original PDE system into a nonlinear system, which can be solved by the

Newton-Raphson method. The Jacobian matrix of a linear system arising from the Newton-

Raphson method is usually large and sparse. Such a matrix is composed of a tiny portion of nonzero elements and a large number of zero elements. The pattern structure of the nonzero el-

40 ements divides the sparse matrices into two general types: structured matrix and unstructured

matrix. Figure 4.1 gives a rough principle for the use of discretization methods.

True Structured FDM sparse matrix

Physical model Geometry Regular

Unstructured FVM or FEM sparse matrix False

Figure 4.1: Sparse matrix pattern.

Whether the pattern of a sparse matrix is structured has little effect on the efficiency of the direct method, but it is vital for iterative methods [19]. The main reason is due to the extensive use of SpMV (sparse matrix-vector multiplication) in iterative methods. For high-performance computing or parallel computing, the SpMV of a regularized matrix is more efficient because its nonzero elements are mainly distributed on diagonals.

4.1.2 Permutation

Permuting or reordering rows and columns is considered to be one of the most effective operations in the implementation of parallel computing algorithms [19]. By permutation of rows or columns, the elements of a sparse matrix are concentrated more closely in the regions near the diagonal, ben- efiting the implementation of partitioning techniques in parallel computing. Let π = i1,i2,...,in { } represent a permutation of the set 1,2,...,n . We define the permutation of matrix A as { }

Aπ = a i = 1,...,n; j = 1,...,n, ,⋆ { π(i), j}

and

A π = a i = 1,...,n; j = 1,...,n, ⋆, { i,π( j)}

where A is a square matrix. Aπ,⋆ and A⋆,π are called row π-permutation and column π-permutation of A, respectively [19].

41 T Based on the definition, the permutation matrices Pπ and Pπ are written as

Pπ = Iπ,⋆, (4.1) and

T Pπ = I⋆,π. (4.2)

Thus, the row π-permutation and column π-permutation of A are expressed as

Aπ,⋆ = PπA, (4.3) and

T A⋆,π = APπ . (4.4)

When the rows of A and the elements of the right-hand side~b are permuted in the same way, the new system,

(PπA)~x = Pπ~b, (4.5) is equivalent to the original system with the same solution. When the columns of A and the elements of the vector of unknowns,~x, are permuted in the same way, the new system,

T ~ (APπ )(Pπ~x)= b, (4.6) and the original system share the same solution. These two types of permutations are used to redistribute the pattern of A and keep the solution of the original system unchanged concurrently.

When the same π-permutation is performed for both rows and columns of A as

T ~ (PπAPπ )(Pπ~x)= Pπb, (4.7) the diagonal elements of the matrix remain on the diagonal even if not in their original positions.

The permutation, called a symmetric permutation, is very useful for solving discrete systems which are usually diagonally dominant.

42 4.1.3 Storage

A matrix element carries the following information: value, row index, and column index. Because

a large-scale sparse matrix has a large number of zero elements, it is unwise to store all the values

and positions of each element. The general storage methods are to store the values and positions

of nonzero elements. Table 4.2 collects some basic sparse matrix storage formats that are COO

(coordinate format), ELL (Ellpack-Itpack format), and CSR (compressed sparse row format).

Table 4.2: Basic storage format of sparse matrices. Format Information Stored Space required

COO Nonzero values Large

Row indices

Column indices

ELL Nonzero values Large or medium

Some zero values and row numbers

Column indices

CSR Nonzero values Small

Beginning position of each row

Column indices

Figure 4.2 shows a sample pattern of a square matrix A with a size of 9. The pattern has 21

nonzero elements. The indices of rows and columns are numbered from 0. According to this

pattern, these basic formats are further explained below.

The COO format provides the easiest way for implementation, as shown in Figure 4.3. It

consists of three arrays, whose length is the number of nonzero elements. One array is used to store

nonzero values, and the other two store row indices and column indices, respectively. Because the

location of each element is recorded, the storage order of the elements can be in any order.

The ELL format is proposed in ELLPACK [53]. Figure 4.4 shows two rectangular arrays used for the ELL format. The nonzero values are stored in the first array. The column indices are stored

43 0 1 2 3 4 5 6 7 8

0 1 1 2 3 2 4 5 3 6 7 8 4 9 10 11 12 5 13 14 15 16 6 17 7 18 19 8 20 21

Figure 4.2: A sample pattern of matrix A.

value 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21

row index 0 1 1 2 2 3 3 3 4 4 4 4 5 5 5 5 6 7 7 8 8

column index 0 01 1 2 0 36 1 3 472 4 5 8 6 6 7 7 8

Figure 4.3: COO format. in the other one. The number of rows of the rectangular arrays is determined by the number of rows of matrix A. The number of columns of the rectangular array is determined by the maximum number of nonzero elements in all rows. For data structure design, a one-dimensional array can be used to store a rectangular array as long as the number of columns or rows is recorded. The blank positions in the column index array can be filled with any integer between 1 and n, such as row numbers. Saad suggested not to insert the same integers too often, e.g., a constant number, based on performance considerations [19]. The ELL format is especially suitable for storing sparse matrices with a roughly uniform number of nonzero elements per row. However, if a few rows have too many nonzero elements compared to other rows, a lot of storage or memory would be wasted.

Figure 4.5 shows the CSR format, one of the most commonly used storage formats. The CSR consists of three arrays: the first one storing the nonzero values, the second one storing the column indices of the values, and the third one storing the starting position of the first nonzero element

44 0 1 2 3 0 1 2 3

0 1 0 0 1 2 3 1 0 1 2 4 5 2 1 2 3 6 7 8 3 0 3 6 value4 9 10 11 12 column index 4 1 3 4 7 5 13 14 15 16 5 2 4 5 8 6 17 6 6 7 18 19 7 6 7 8 20 21 8 7 8

Figure 4.4: ELL format. of each row in the first two arrays. It needs to note that the row offset array contains a total of n + 1 locations, and in addition to storing the starting position of each row, the array uses the last location to store the starting position of a fictitious n + 1 row.

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

value 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21

column index 0 0 1 1 2 0 3 6 1 3 4 7 2 4 5 8 6 6 7 7 8

row offset 0 1 3 5 8 12 16 17 19 21

Figure 4.5: CSR format.

According to the idea of the compressed row, some variations of CSR are also used commonly, such as the CSC (compressed sparse column) format and the MSR (modified sparse row) format

[19]. The CSC stores data by column rather than by row, compared to CSR. The MSR is used to store the diagonal elements separately in cases where diagonal elements are nonzero and are frequently accessed.

It can be seen that if the nonzero elements of the matrix packed together can form a structure close to a rectangle, the matrix is suitable for ELL storage. If the packed nonzero elements still

45 form an irregular pattern, it is more appropriate to use the COO or CSR format. Because of its

regular storage mode, the ELL format is especially suitable for parallel access. Therefore, some

hybrid storage formats have been designed. Bell and Garland proposed an HYB (hybrid of ELL

and COO) format; see Figure 4.6. The data stored in the ELL format denoted by Figure 4.4 is split

into two parts: a regular part using ELL and an irregular part using COO.

0 1 0 1 0 1 2 3 4 0 1 0 0 value 8 11 12 15 16 1 2 3 1 0 1 2 4 5 2 1 2 3 6 7 3 0 3 3 4 4 5 5 value4 9 10 column index 4 1 3 row index 5 13 14 5 2 4 6 17 6 6 7 18 19 7 6 7 6 4 7 5 8 8 20 21 8 7 8 column index

ELL part COO part

Figure 4.6: HYB format.

In the current research, another hybrid storage format, HEC (hybrid of ELL and CSR), is adopted [38]. The HEC format consists of an ELL part and a CSR part used for regular data and irregular data, respectively; see Figure 4.7. The superiority of both HEC and CSR is utilized.

The row length l of the HEC part can be calculated from solving a minimum problem expressed by

w(l)= l n + Pr nz(l), (4.8) ∗ ∗

where n is the number of rows of the original matrix. Pr, a typical value being 20, represents the relative performance of the ELL part and the CSR part, and nz(i) stands for the number of nonzero

elements of the CSR part [36, 37, 38].

NVIDIA forms 32 threads into a unit called wrap that is responsible for scheduling and running

threads. All threads in the same warp will execute the same instructions in parallel. When dealing

46 0 1 0 1 0 1 2 3 4 0 0 1 0 value 8 11 12 15 16 1 2 3 1 0 1 2 4 5 2 1 2 3 6 7 3 0 3 6 4 7 5 8 value4 9 10 column index 4 1 3 column index 5 13 14 5 2 4 6 17 6 6 7 18 19 7 6 7 row offset 0 0 00 1 3 5 5 5 5 8 20 21 8 7 8

ELL part CSR part

Figure 4.7: HEC format for matrix A. with matrix algorithms, we often use a thread to manipulate elements in a row. Based on the two considerations, an appropriate method is to store the ELL part by column, and the elements of each column are supplemented with zero to be a multiple of 32. Figure 4.8 shows the storage of the ELL part. The length, stride, of the column of the ELL part with additional elements can be calculated by

stride = f loor[(n + ls 1)/ls] ls, (4.9) − × where ls is a multiple of 32, such as 64. The f loor function provides the return value of the largest integer that is smaller than or equal to the input parameter.

column index 0 1 stride

1 2 4 6 9 13 17 18 20 ... 3 57 10 14 19 21 ...

use warp (32 threads) to process

Figure 4.8: ELL part stored by column.

A HEC data structure expressed by UML is shown in Figure 4.9. Ax, A j, and Ap represent the pointers to the array of values, indices, and row offsets, respectively.

47 ELL

Ax: float * Aj: integer * HEC num_rows: integer num_cols: integer ell: ELL num_nonzeros: integer csr: CSR stride: integer num_rows: integer num_cols: integer num_nonzeros: integer

CSR

Ax: float * Aj: integer * Ap: integer * num_rows: integer num_cols: integer num_nonzeros: integer

Figure 4.9: HEC data structure.

The BCSR (block-wise CSR) is used to store a matrix in a block format. The blocks are

classified into two categories: zero block and nonzero block. If a block has at least one nonzero

element, it is called a nonzero block. Otherwise, it is called a zero block. For the convenience of

illustration, we still use the matrix denoted on Figure 4.2 and treat it as a 3 3 , and × each block contains 9 elements. A BCSR format stores all elements in a nonzero block, including zero elements; see Figure 4.10.

0 1 2

1 0 0 0 2 3 0 0 4 5 60 0 70 0 8 0 0 1 0 90 10 110 0 12 0 0 0 130 14 150 0 16 17 0 0 2 18 19 0 0 20 21

value

010 2 3 0 0 4 5 6 0 0 0 9 0 0 0 13 7 0 0 10 11 0 0 14 15 80 0 0 12 0 0 0 16 170 0 18 19 0 0 20 21

block 0,0 block 1,0 block 1,1 block 1,2 block 2,2

Figure 4.10: BCSR (3 3) format. ×

48 Figure 4.11 shows the data structure for the BCSR format. The property num bs represents the block size, and the properties num rows, num cols, num nonzeros, Ap and A j are all counted based on the block concept. According to the way of value storage shown in Figure 4.10, the size of array Ax equals num nonzeros num bs2. × All the nonzero positions of a matrix form a nonzero pattern if the actual values of elements

are ignored. For the implementation of preconditioned algorithms, it is often necessary to operate

on the nonzero pattern. A data structure, ICSR, is specially designed for storing a nonzero pattern;

see Figure 4.12. n represents the dimension of a matrix, and nz represents an array storing the length of all rows. A j, a two-dimensional array, is used to store column indices.

BCSR

Ax: float * Aj: integer * Ap: integer * ICSR num_rows: integer num_cols: integer n: integer num_nonzeros: integer nz: integer * num_bs: integer Aj: integer **

Figure 4.11: BCSR data structure. Figure 4.12: ICSR data structure.

4.2 Iterative Solvers

4.2.1 Linear Solution Methods

The direct methods attempt to solve linear systems through a series of finite operations, such as

Gauss elimination or its variations. Standard direct methods include the LU factorization, the

QR factorization, and the Cholesky factorization. In theory, a direct method provides an accurate

solution up to machine precision. The data structures used for direct methods are quite complicated

[19]. Because Gaussian elimination often leads to large fill-ins, the extra elements always make

the direct methods computationally expensive. For large-scale linear systems, such as those with

millions or more of unknowns, the direct methods can consume enormous computer resources or

cannot even be solved on the most advanced computers.

49 Table 4.3: Linear solution methods for sparse systems. Class Methods Features

Sparse direct methods LU Complicated data structure

QR Theoretical exact solution

Cholesky High computer resources

Stationary iterative methods Jacobi Low storage

Gauss-Seidel Easy to implement

SOR Convergence for limited matrices

SSOR

Krylov subspace methods CG Use matrix-vector operations

GMRES Usually combined with preconditioners

BiCGSTAB Fast converging

Multigrid methods GMG For elliptic PDEs

AMG Excellent performance

Superior scalibility

Difficult to implement

In contrast to the direct methods, the iterative methods (relaxation methods) gradually acquire an approximate solution of a system by a sequence of iterative steps, starting with an initial guess.

With the increasing scale of computing, iterative methods have played a significant role in ap- plication fields. Low storage requirements and easy to implement on parallel computers are the main advantages over direct methods [19]. Generally speaking, the convergence rate of an itera- tive method may not be satisfactory. However, if the iterative method is used in conjunction with a proper preconditioner, it may show an unusually high convergence efficiency. Therefore, the choice of a preconditioner is often more important than the choice of an iterative method [19].

At present, the two main types of iterative methods are stationary iterative methods and Krylov subspace methods. The four main stationary iterative methods are the Jacobi method, the Gauss-

50 Seidel method, the SOR (successive over-relaxation) method, and the SSOR (symmetric successive over-relaxation) method. Stationary methods are easier to understand and implement compared with Krylov subspace methods but are often less effective [21]. Most applications have used the more general and more effective Krylov subspace methods today. Though stationary methods are rarely used alone, they are adopted as functional building blocks for more advanced methods, such as AMG.

Krylov subspace methods are considered one of the most successful methods currently avail- able in numerical computations. Some well known Krylov subspace methods are the CG (conju- gate gradient) method, GMRES (generalized minimal residual) method, BiCGSTAB (biconjugate gradient stabilized) method, and QMR (quasi minimal residual) method. Krylov subspace meth- ods utilize matrix-vector multiplication operations for replacing costly matrix-matrix operations that often appear in traditional iterative methods, thus providing faster and more efficient conver- gence performance.

MG (multigrid) methods are currently the most efficient methods for solving Poisson-like linear systems. Because MG methods use a hierarchy of discretizations in their solution processes, they own effectively enhanced scalability. However, MG methods are also considered to be challenging to implement. Based on these features, the development cost and algorithm output should be weighed in practical use. The GMG (geometric multigrid) method discretizes a PDE problem on meshes, whereas the AMG (algebraic multigrid) method constructs the hierarchy of operators directly from a matrix. As it is always a tough job to define appropriate meshes on unstructured grids, the AMG method is relatively easier to apply than the GMG method.

4.2.2 Krylov Subspace Methods

A general sparse linear system with a dimension of n is written as

A~x =~b. (4.10)

51 Let~x0 represent an initial guess vector to the solution. Then,

~r0 =~b A~x0 (4.11) − stands for the residual vector of~x0. After a simple deduction, the solution~x vector of the system is expressed by

1 ~x =~x0 + A− ~r0. (4.12)

1 Therefore, finding a solution becomes how to seek A− ~r0 effectively.

Let~xm represent an approximate solution. A general projection method for solving the system is described by conditions

~xm ~x0 + Km (4.13) ∈ and

~r =~b A~xm Lm, (4.14) − ⊥ where Km and Lm are two subspaces of dimension m. The subspace Km is called the search

1 subspace, and it is used for searching A− ~r0. The subspace Lm is called the left subspace. The condition (4.14) is called the Petrov-Galerkin condition.

An order-m Krylov subspace constructed by matrix A and vector r0 is written as

2 m 1 Km(A,~r0)= span ~r0,A~r0,A ~r0,...,A − ~r0 . (4.15) { }

The Krylov subspace provides a convenient and effective way to search for an approximation of

1 A− ~r0. The iterative methods based on this approach are called Krylov subspace methods, which was proposed by Alexei Krylov, a Russian applied mathematician and naval engineer [54].

Although the dimension n of the system is large, the order m of a valid Krylov subspace is usually very small, such as 20 for the GMRES method. Because the spanning set is composed of a sequence of terms using a power iteration of A, the degree of linear dependence between these terms increases rapidly with the increase of the power of A. In order to avoid this problem, some

Krylov methods use orthogonal measures, such as the Arnoldi iteration or Lanczos iteration, to construct an orthogonal basis for the Krylov subspace.

52 The different versions of Krylov subspace methods are derived from different choices of the subspace Lm [19]. The first class is Lm = Km(A,~r0) or Lm = AKm(A,~r0). The CG and GMRES

T methods belong to this class. The second class is Lm = Km(A ,~r0). The BiCGSTAB method falls into this category.

The CG method runs well for sparse SPD (symmetric positive definite) linear systems. The

GMRES method and BiCGSTAB method are used for a nonsymmetric system. For a well-conditioned system, the GMRES method is fast and needs just a few steps to converge. However, for systems requiring hundreds of steps to converge, the BICGSTAB is a better choice [19].

Except for the Krylov subspace methods introduced above, there are also other Krylov sub- space methods, such as ORTHOMIN (orthogonal minimum residual), QMR (quasi minimal resid- ual), and TFQMR (transpose-free QMR). However, none of them perform efficiently for all matri- ces. There are always examples for which one method is better than the other. In general, a good strategy in the design and development of effective linear solvers is to find a suitable preconditioner rather than the best iterative solver [19].

4.2.3 AMG Method

Let A with elements ai j be the matrix derived from the discretization of a self-adjoint boundary value problem. A generally has the properties given below.

1. Sparse.

2. Symmetric: A = AT .

3. Positive definite: for all vector ~u =~0, ~uT A~u > 0. 6 n 4. Weakly diagonally dominant: Σ j=i ai j aii for 1 i n. 6 ≤ | | ≤ ≤

A symmetric positive definite matrix has real and positive eigenvalues. It can also be shown that if

A is symmetric and diagonally dominant with positive diagonal elements, A is an SPD (symmetric positive definite) matrix. An SPD matrix with positive elements on the diagonal and nonpositive off-diagonal elements is called M-matrix [55].

53 The original AMG theory was developed to solve M-matrices. However, these properties do not need to be satisfied for the AMG method to work. In practice, most applications do not strictly comply with the requirements of an M-matrix. The weak diagonal dominance can be ignored, or slight deviations from the M-matrix properties are allowed. It is obvious to understand that the closer the structure of a matrix is to an M-matrix, the more efficient the AMG method can be

[55, 56]. The matrices arising from the discretization of many (not all) scalar elliptic differential equations are M-matrices [55].

The purpose of solving a positive definite system by iterative methods is to damp its error with as few iterations as possible. The error can be decomposed into a series of components of different frequencies by a Fourier transform. The standard iterative methods all possess the smoothing property. That means the high-frequency or oscillatory components of the error are much easier to be eliminated by iterative methods than the low-frequency or smooth components [55]. If we apply a coarse grid instead of a fine grid, a high-frequency characteristic would appear on the same error component, as illustrated in Figure 4.13. Then a smoother (stationary or relaxation-type iterative method) can be used to smooth these error components. This feature provides a theoretical basis for the multigrid method.

error component

fine grid coarse grid

Figure 4.13: Error component in fine grid or coarse grid.

Figure 4.14 shows a two-grid multigrid used to explain the solution principle of the multigrid method. First, a smoother, a stationary iterative method such as the Gauss-Seidel method, is used to solve the linear system on a fine grid. The high-frequency error concerning this fine grid is

54 eliminated after several iterations. Then, the system is transferred to a coarse grid. The smoother is used on the coarse grid to smooth the errors once more. Whether a frequency is high is relative to a particular grid. A grid can effectively eliminate the high-frequency corresponding to its scale.

Although the two-grid can reflect the principle of multigrid, in order to obtain a better convergence effect, the use of multigrid is a necessary choice; see Figure 4.15. Each grid is responsible for the error components that are the high frequency for its scale. The coarsest grid, the bottom one, should eventually be solved by a direct method at a negligible cost.

fine grid

fine grid

coarse grid

Figure 4.14: Two-grid.

coarse grid

Figure 4.15: Multigrid.

A set of hierarchy grids is necessary for performing a multigrid method. Unlike the GMG

(geometric multigrid) method, which refers to geometry, the AMG (algebraic multigrid) method only depends on a linear system or a coefficient matrix. If the problem to be solved is A~x =~b, the

AMG method can operate on a hierarchy of increasingly coarse matrix problems, as illustrated in

Figure 4.16. All the matrices are established automatically by the numerical algorithm based on

55 the original linear system [56].

( ) ( ) = ( ) ( = 1, , ) ேభ ଵ ଵ ଵ ( ) ( ) ( ) (ଵܰ , ڮ ,ȭ௝ୀଵ ܽ௜௝ ݔ௝ = ܾ௜ (݅ = 1 ேమ ଶ ଶ ଶ ( ) ( ) ( ) (ଶܰ , ڮ ,ȭ௝ୀଵ ܽ௜௝ ݔ௝ = ܾ௜ (݅ = 1 ேయ ଷ ଷ ଷ ( ) ( ) ( ) (ଷܰ , ڮ ,ȭ௝ୀଵ ܽ௜௝ ݔ௝ = ܾ௜ (݅ = 1 ேర ସ ସ ସ ସܰ ڮ ݅ ȭ௝ୀଵ ܽ௜௝ ݔ௝ ܾ௜

ଵ ଶ ସ .([from [56 ܰ ب adapted) ڮ ب ܰ hierarchy ب Figure 4.16: AMGܰ

The approach to creating a hierarchy of AMG grids is briefly described by the following anal-

ysis referenced from [55]. For GMG, the unknown variables are defined on known discrete spatial

points from a fine grid so that a coarse grid can be built based on a subset of these locations. The

sub-vector of unknown variables on the coarse grid is used for the solution on the coarse grid.

Similarly, we can build fine and coarse grids for AMG based on the indices of unknown variables.

For the problem of A~x =~b, let~x expressed by

x1  .  ~x = . (4.16)      xn      represent its solution and assume A has the nonzero pattern shown in Figure 4.17. The points of the fine grid can be defined as the indices of unknowns as

points = 1,2,...,n . (4.17) { }

If each nonzero element on the pattern of A represents an edge, an undirected adjacency graph of matrix A can be generated, as shown in Figure 4.18. The connections between points in the grid are the edges on the graph. Therefore, the grid is entirely defined based on the information of matrix

A.

56 1 2 3 4 5 6 6 1

1 5 2 3 4 4 5

6 3 2

Figure 4.17: AMG: example pattern (adapted Figure 4.18: AMG: example graph (adapted

from [55]). from [55]).

In AMG, an error component is considered algebraically smooth when it converges slowly with

respect to a selected smoother. In the direction of strong couplings, an error component changes

slowly. Thus, the error component can be identified through the strength of the connection between

points. If ai j is large, there is a strong coupling from i to j. This judgment provides the basis for | | the coarsening process. However, we also need a fast coarsening, which produces low fill-ins on coarser levels in later steps. The principle of coarsening can be summarized in two criteria [56].

1. The set of coarsening points should not be strongly coupled with each other.

2. Under the previous constraint, the set of coarsening points should be a maximal set.

After a coarsening process is designed, a refinement process from a coarsened level to a fined level also needs to be defined. As stated, a smooth error component appears in the direction of strong couplings. According to this principle, we can interpolate points that have a strong relationship with their neighbor points on the course level.

Assume the hierarchy of grid has L levels numbered from 1 to L. On each level l < L, we define a restriction operator Rl and a prolongation operator Pl used for the coarsening process and

the refinement process, respectively. Usually, Rl is the transpose of Pl, expressed as

T Rl = Pl . (4.18)

By the Galerkin projection, a coarser grid Al+1 is constructed by

Al+1 = RlAlPl. (4.19)

57 After the system is transferred from a fine-grid to a coarse-grid, a pre-smoother is used to

damp high-frequency components errors appearing on the new grid. On the contrary process, a

post-smoother is used. Some lightweight stationary iterative methods (relaxation-type methods)

such as the Gauss-Seidel method are competent for the smooth task through only a few iterations.

An AMG solver is composed of two phases: a setup phase and a solution phase. The setup

phase is responsible for completing the construction of the restriction operator Rl, the prolongation

operator Pl, the coarse grid Al+1, the pre-smoother Sl and the post-smoother Tl. It is a preparation step for the solution phase. The solution phase needs to transfer data between the grids of different levels. The process is always designed in a recursive nature shown in Algorithm 1. Algorithm 1 is a general form called µ-cycle. Different µ gives different recursive algorithms. However, only two types of cycles, V-cycle and W-cycle, are used in practice [55].

When µ = 1 and the initial value l = 1, the cycle is named the V-cycle. The scheme of the

V-cycle algorithm is relatively simple; see Figure 4.19. When µ = 2 and the initial value l = 1, the cycle is named the W-cycle; see Figure 4.20. Figure 4.21 gives a detailed flow chart of V- cycle operations with L = 4. The pre-smoothing operation invokes a relaxation method, such as the Gauss-Seidel method, for a few iterations until the convergence deteriorates on the fine grid.

Then, we compute the error vector, the residual vector, and restrict it onto a coarser grid. The error vector is used as the right-hand side vector to obtain an approximation to the error on the next coarser grid. These operations are repeated until the coarsest grid is reached. The residual equation can be solved on the coarsest grid by a direct method with a negligible cost. During the return procedure, the solved error vector on a grid is prolongated to the residual vector on the next finer grid through the interpolation operation. The approximation obtained previously on the current level can be corrected by the prolongated error vector. Then, the post-smoothing operation invokes the relaxation method to smooth the oscillatory components of the error. These return operations were applied repeatedly until the finest grid was reached, on which we finally finish a whole iteration of AMG and get an approximate solution of the original linear system.

58 This procedure described is called the correction scheme relative to another scheme, nested iteration, stated next. The nested iteration seeks to find an approximation to the solution of a

linear system with only one sweep through the levels from bottom to top [19]. It uses a coarse

grid to obtain an improved initial guess for the next finer grid problem. The implementation of the

nested iteration refers to the full multigrid (FMG) method shown in algorithm 2. The symbol Pˆl

used in algorithm 2 is to distinguish from the symbol Pl in algorithm 1.

Algorithm 1 AMG µ-cycle solution algorithm: amg solve(l,µ) (adapted from [42]).

Require: ~bl,~xl, Al, Rl, Pl, Sl, Tl, 1 l < L, µ. ≤

if (l < L) then

~xl = Sl(~xl,Al,~bl) ⊲ Pre-smoothing

~rl =~bl Al~xl − ~bl+1 = Rl~rl ⊲ Restriction for i = 0; i < µ; i ++ do

amg solve(l + 1, µ) ⊲ Recursive invocation

end for

~xl =~xl + Pl~xl+1 ⊲ Prolongation

~xl = Tl(~xl,Al,~bl) ⊲ Post-smoothing else 1~ ~xl = Al− bl ⊲ Direct solution of level L end if

Figure 4.22 gives a scheme of F-cycle with µ = 1. From the coarsest grid, the scheme finds an

approximation to the solution that can be prolongated to an improved initial guess used on the next

finer grid. An arrow in the figure denotes this operation. Then a V-cycle is applied on the finer

level. After the V-cycle obtains a solution, the solution will be used for a next improved initial guess on the upper fine grid. The iteration continues to the top level, where the biggest V-cycle

59 Level 1 Level 1

Level 2 Level 2

Level 3 Level 3

Level 4 Level 4

Figure 4.19: AMG: V-cycle (adapted Figure 4.20: AMG: W-cycle (adapted from [19]).

from [19]).

Pre-smooth Post-smooth Compute residual Correct

Restrict Interpolate

Pre-smooth Post-smooth Compute residual Correct

Restrict Interpolate

Pre-smooth Post-smooth Compute residual Correct

Restrict Interpolate

Solve by direct method

Figure 4.21: AMG: operations in V-cycle.

is applied to acquire the final solution of the current AMG iteration. Figure 4.23 gives another

scheme of F-cycle with µ = 2. The procedure adopts a W-cycle instead of a V-cycle. The FMG

method is considered a powerful algorithm. Under certain conditions, the error of the resulting

approximation is guaranteed to be the order of the discretization, beyond which no more accuracy

is required in practice [19].

Because the AMG methods are limited to Poisson-like problems, they cannot be used directly on a whole Jacobian system, which is nonsymmetric and indefinite. However, the AMG methods may play a critical role in accelerating the solution to the pressure part of a preconditioner system.

The related CPR preconditioner is introduced in a later section.

60 Level 1

Level 2

Level 3

Level 4

Figure 4.22: AMG: F-cycle with µ = 1 (adapted from [19]).

Level 1

Level 2

Level 3

Level 4

Figure 4.23: AMG: F-cycle with µ = 2 (adapted from [19]).

Algorithm 2 AMG F-cycle solution algorithm: amgf solve(l,µ) (adapted from [55]). l = L ⊲ start from the coarsest grid 1~ ~xl = Al− bl ⊲ direct solution of level L for l = L 1, ,1 do − ··· ~xl =~xl + Pˆl~xl+1 ⊲ prolongation amg solve(l, µ)

end for

61 4.3 Preconditioners

4.3.1 Principle

In general, the performance of a Krylov subspace iterative method depends heavily on the quality

of preconditioners. The preconditioning technique transforms the original linear system into a

solution equivalent system that would be easier to solve by an iterative solver. The application of

preconditioners can enhance both the efficiency and robustness of the solution of a system[19]. For

a linear system represented by

A~x =~b, (4.20)

let M be a preconditioning matrix. According to different strategies, M can be applied to the left of A as

1 1 (M− A)~x =(M− ~b), (4.21)

or to the right of A as

1 (AM− )(M~x)=~b. (4.22)

When M is split into two factors: ML and MR,

M = MLMR, (4.23) the third strategy of split preconditioning is written as

1 1 1~ (ML− AMR− )(MR~x)=(ML− b). (4.24)

In a preconditioned Krylov subspace method, a preconditioner system expressed by

M~y = ~f (4.25) is required to solve at least once in each iteration.

The preconditioned systems of (4.21), (4.22) and (4.24) require a Krylov subspace to search

for a solution. Let~r0 be the initial residual of a preconditioned system. The Krylov subspaces for

62 the preconditioned systems of (4.21), (4.22) and (4.24) are given, respectively, by

1 1 1 2 1 m 1 Km(M− A,~r0)= span ~r0,(M− A)~r0,(M− A) ~r0,...,(M− A) − ~r0 , (4.26) { }

1 1 1 2 1 m 1 Km(AM− ,~r0)= span ~r0,(AM− )~r0,(AM− ) ~r0,...,(AM− ) − ~r0 , (4.27) { } and

1 1 1 1 1 1 2 1 1 m 1 Km(M− AM− ,~r0)= span ~r0,(M− AM− )~r0,(M− AM− ) ~r0,...,(M− AM− ) − ~r0 . (4.28) L R { L R L R L R }

Based on above analysis, some requirements given below are apparent for an effective precon-

ditioner in practice.

1. M is nonsingular.

1 2. A preconditioner system can be solved or M− can be calculated at a very low cost. 3. M is close to A.

For a split preconditioner, both ML and MR need to meet the first two requirements. The

ML and MR are typically designed as triangular matrices. It is an excellent choice to use the split preconditioner on a nearly symmetric system. There is little difference amongst the three preconditioning strategies. However, when M is very ill-conditioned, the different versions of residual in each strategy may affect the stopping criterion, which determines when an algorithm stops [19].

4.3.2 ILU

Gaussian elimination creates the LU factorization of a matrix A expressed by

A = LU, (4.29) where L is a unit lower triangular and U is an upper triangular. In order to facilitate the imple- mentation on a computer, the Gaussian elimination IKJ variant given in Algorithm 3 is a common choice. Figure 4.24 explains the features of the algorithm. At the i-th step of the i loop, the i-th row

63 composed of aik and ai j is the only place where elements are being updated. The previous rows are composed of an accessed zone that was read-only in previous steps but is not accessed in the current step, and a read-only zone that is read-only in the current step. The not accessed zone is factorized by an updating procedure row by row. We can see that the accessed zone and read-only zone are generated at the same time when the updating row moves forward. The accessed zone and the read-only zone formed the lower part L and the upper part U of the factorization result, respectively. Because the lower part L is a unit , it does not need to store a diag- onal. After the entire procedure, the resulting matrices L and U occupy the same space where the original matrix A was stored.

Algorithm 3 Gaussian elimination IKJ variant (adapted from [19]). 1: for i = 2, ..., n do

2: for k = 1, ...,i-1 do

3: aik = aik/akk

4: for j = k + 1, ..., n do

5: ai j = ai j aikak j − 6: end for

7: end for

8: end for

The ILU series preconditioners are based on the incomplete LU factorization that derives from the LU factorization with rules of dropping unwanted elements. The definition of the drop rule is generally divided into two categories. The first kind depends on the location of the nonzero elements of the original matrix, while the second kind depends on the magnitude of the nonzero elements of the original matrix. Therefore, there are two kinds of incomplete decomposition:

ILU(k) and ILUT.

A pair set that represents the nonzero pattern of matrix A is defined by

NZ(A)= (i, j) ai j = 0 & 1 i, j n . (4.30) { | , 6 ≤ ≤ } 64 j loop

1 2 k k+1i-1 i j n 1 2 Read-only k loop k Updating i-1 i loop i Not accessed

Accessed

n

Figure 4.24: Gaussian elimination: LU factorization.

The ILU(0) factorization uses the nonzero pattern NZ(A) without any fill-ins. Figure 4.25 shows a sample of NZ(A). Figure 4.26 describes the pattern of the ILU(0) factorization result. The ILU(0) algorithm refers to Algorithm 4. Because of its simplicity and efficiency, ILU(0) is adopted as a general ILU decomposition method.

Algorithm 4 ILU(0) factorization (adapted from [19]). 1: for i = 2, ..., n do

2: for k = 1, ..., i - 1 and for (i, k) NZ(A) do ∈ 3: aik = aik/akk

4: for j = k + 1, ..., n and for (i, j) NZ(A) do ∈ 5: ai j = ai j aikak j − 6: end for

7: end for

8: end for

ILU(k) is a version of ILU factorization with fill-ins, compared to ILU(0). The parameter k is

used to control the number of fill-ins. A larger k allows more fill-ins than a lower k. ILU(0) can

be seen as a special case of ILU(k) when k is set to zero. A level Li j used for comparison with k is

65 Figure 4.25: Pattern NZ(A). Figure 4.26: ILU(0) factorization.

assigned to each element. The rules for initializing and updating the Li j are given by

0, (i, j) NZ(A) ∈ Li j =  (4.31) ∞, (i, j) / NZ(A)  ∈ and 

Li j = min Li j,Lik + Lk j + 1 , (4.32) { } respectively [19]. If an element ai j is nonzero, its initial level is assigned zero. Otherwise, the initial level is infinity. Because the updating rule cannot affect a zero level that has already been the minimal value, the nonzero element ai j will keep its zero levels unchanged during the entire

ILU(k) factorization procedure. If ai j is zero, its level may be updated to a positive integer or zero. Algorithm 5 gives the mechanism of ILU(k). In order to distinguish from the loop variable k, the parameter k is written as kˆ in the algorithm. Lines 2 and 10 provide the filter criteria. A larger k allows more elements to enter the calculation, and fewer elements are assigned zero or deleted.

Figure 4.27 and Figure 4.28 show the pattern of ILU(1) and ILU(2). Compared with Figure 4.26, a pattern with a larger k has more fill-ins.

For more convenient implementation, we split Algorithm 5 into two phases: a symbolic phase and a factorization phase. The symbolic phase is only responsible for the generation of the filled-in pattern. Algorithm 6 explains the procedure. The data structure ICSR is used for this phase. After the symbolic phase, the filled-in pattern is stored in NZ′(A). Figure 4.29 shows an symbolic phase example of ILU(1). The factorization phase needs data to fill back. All original nonzero locations

66 Algorithm 5 ILU(k) factorization (adapted from [19]). 1: for i = 2,...,n do

2: for k = 1,...,i 1 & Lik kˆ do − ≤ 3: aik = aik/akk

4: for j = k + 1,...,n do

5: ai j = ai j aikak j − 6: Li j = min Li j,Lik + Lk j + 1 { } 7: end for

8: end for

9: for j = 1,...,n do

10: if Li j > kˆ then

11: ai j = 0

12: end if

13: end for

14: end for

Figure 4.27: ILU(1) factorization. Figure 4.28: ILU(2) factorization. are populated with the original data, but all the filled-in positions are filled with zeros. By now, the

ILU(k) factorization is converted into an ILU(0) factorization illustrated by Figure 4.30.

The dropping rule of ILU(k) is based entirely on the original nonzero pattern and k. The magnitude of elements has not been taken into account. In contrast, ILUT provides a threshold strategy based on the size of elements for ILU decomposition.

67 Algorithm 6 Decoupled ILU(k): symbolic phase. 1: Define NZ (A) as n n nonzero pattern ′ × 2: Initiate NZ′(A) by full filling with entries

3: for i = 2,...,n do

4: for k = 1,...,i 1 & Lik kˆ do − ≤ 5: for j = k + 1,...,n do

6: Li j = min Li j,Lik + Lk j + 1 { } 7: end for

8: end for

9: for j = 1,...,n do

10: if Li j > kˆ then

11: remove entry i j from NZ′(A)

12: end if

13: end for

14: end for

(1) NZ(A) (2) NZ’(A)

Figure 4.29: Decoupled ILU(k): symbolic phase.

Algorithm 7 gives the ILUT procedure, which uses two parameters and two dropping rules to screen the elements. τ is a tolerance criterion, such as 1 10 2. p is a counting criterion, which × − is defined as the average number of nonzero elements per row of the factorized matrix or other reasonable values. The two dropping rules are designed as below [19].

68 0 0 0 0 0 0 0 0

(1) Data fill back (2) ILU(0)

Figure 4.30: Decoupled ILU(k): factorization phase.

1. The first rule allows an element is dropped (set to zero) if its value less than τi, where τi =

τ ~ai 2 and ~ai is the vector of i-th row. k ∗k ∗ 2. If 1 j i 1, ai j belongs to the L part; if i j n, ai j belongs to the U part. The second ≤ ≤ − ≤ ≤ rule requires the p largest elements in the L part and the p + 1 largest elements in the U part to

be kept. The other relative small elements are all dropped.

As the screening criteria of the ILUT depend on element values rather than the nonzero pattern,

ILUT cannot utilize and keep the structure of the original matrix. The resulting pattern of ILUT is always irregular, which would affect the parallel performance. However, ILUT might acquire some improvement in the convergence performance because it keeps the qualified heavyweight elements and removes the lightweight ones. Given a sample of NZ(A) shown on Figure 4.25,

Figure 4.31 shows the ILUT factorization result, which is an irregular pattern in the area between the two outermost diagonals.

Irregular pattern zone

Figure 4.31: ILUT factorization.

69 Algorithm 7 ILUT(p,τ) factorization (adapted from [19]). 1: for i = 2, ..., n do

2: for k = 1, ..., i - 1 and for (i, k) NZ(A) do ∈ 3: aik = aik/akk

4: Apply dropping rule 1 to aik

5: if aik = 0 then 6 6: for j = k + 1, ..., n do

7: ai j = ai j aikak j − 8: end for

9: end if

10: end for

11: for j = 1, ..., n do

12: Apply dropping rule 1 to ai j

13: Apply dropping rule 2 to ai j

14: end for

15: end for

4.3.3 BILU

Besides the ILU preconditioners introduced, a block-wise ILU (BILU) is another option to estab- lish preconditioners for iterative methods. Given a block size that is a divisor of the number of rows, a matrix can be regarded as a block-wise matrix from the view that each block is a submatrix with elements at the number of the square of the block size. Assume matrix A of a discrete system

70 has (nm)2 elements:

a11 a12 ... a1 nm   a21 a22 ... a2 nm      ......  A =  . (4.33)    ai1 ai2 ... ai nm         ......       anm 1 anm 2 ... anm nm    Let m be the block size. The block-wise matrix A can be expressed by

A11 A12 ... A1n   A21 A22 ... A2n      ......  A =  , (4.34)    Ai1 Ai2 ... Ain         ......       An1 An2 ... Ann      where each bock Ai j (1 i, j n) is a m m submatrix. In contrast, the matrix observed from a ≤ ≤ × traditional view is called a point-wise matrix.

If there is at least one nonzero element in the m m elements of a Ai j, the block-wise element × Ai j, called block Ai j in short, is divided into a nonzero class. If all the m m elements of a block Ai j × are zero, the block is divided into a zero class. The BCSR data structure for a block-wise matrix is

used to store all nonzero blocks with their elements, including zero elements. The nonzero pattern

of a block-wise matrix is generated based on the nonzero blocks. The left two graphs in Figure 4.32

shows a block-wise matrix A and its pattern.

Both the symbolic phase and the factorization phase of block-wise ILU(k) are operations on

blocks. After applying the symbolic phase of the decoupled ILU(k) on the pattern of A, we get

the filled-in pattern NZ′(A). The next step is data fill back. Besides that the nonzero blocks are all copied from the original matrix A, the filled-in positions will be filled with zeros, illustrated on

the left graph in Figure 4.33. The block-wise preconditioner introduces more filled-in positions,

71 which can benefit the accuracy of calculations during the solution of a preconditioner system.

(2) Pattern of A (3) Filled-in pattern NZ’(A)

(1) Block-wise matrix A

Figure 4.32: Decoupled block-wise ILU(k): symbolic phase.

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

(1) Data fill back (2) Block-wise ILU(0) factorization

Figure 4.33: Decoupled block-wise ILU(k): factorization phase.

Algorithm 8 gives the procedure of a block-wise ILU(0), which is a little different from the

Algorithm 4. The operations are all based on the block matrices, such as the multiplication between

matrices Aik and Ak j. The right graph in Figure 4.33 shows the factorization result, which is a combination of a block-wise lower triangular matrix L and a block-wise upper triangular matrix

U. The procedure of block-wise ILU(k) factorization is complicated. It is briefly summarized in algorithm 9.

The block-wise L can also be viewed as a point-wise triangular matrix because of its pure unit diagonal blocks. The left graph in Figure 4.34 shows such a feature. Thus, the parallel solver developed for point-wise triangular matrices can be used directly for L. However, the block-wise

U is not a point-wise triangular matrix. The right graph in Figure 4.34 shows the diagonal of U,

72 Algorithm 8 Block-wise ILU(0) factorization. 1: for i = 2, . . . ,n do

2: for k=1,...,i-1&andfor (i,k) NZ (A) do ∈ ′ 1 3: Aik = AikAkk−

4: for j=k+1,...,nandfor& (i, j) NZ (A) do ∈ ′ 5: Ai j = Ai j AikAk j − 6: end for

7: end for

8: end for

Algorithm 9 Decoupled block-wise ILU(k) algorithm. 1: Generate the block-wise pattern

2: Apply symbolic factorization by Algorithm 6

3: Back fill data to the original positions and fill zero to the filled-in positions

4: Apply block-wise ILU(0) factorization by Algorithm 8

which would be an irregular shape generally. In order to adapt U for the point-wise parallel solver, we need to further factorize U into a block D and an upper triangular matrix U′ as illustrated in Figure 4.35. This factorization is expressed by

1 U = D(D− U)= DU′. (4.35)

1 1 1 1 1 1 1 1 1 1

(1) L (2) U

Figure 4.34: Block-wise lower part and upper part.

73 1 1 1 1 1 1 1 1 1 1

(1) D (2) U’

Figure 4.35: Block-wise diagonal part and modified upper part.

Based on the analysis above, the preconditioner system

M~y = ~f (4.36) is written as the factorization style expressed by

LDU′~y = ~f . (4.37)

The solution of the preconditioner system is composed of three steps:

L~y′ = ~f , (4.38)

1 ~y′′ = D− ~y′, (4.39) and

U′~y =~y′′. (4.40)

(4.38) and (4.40) can be solved by a parallel triangular solver. (4.39) is composed of an inverse 1~ operation on D and a matrix-vector multiplication D− y′, which can also be easily implemented using parallel technology.

4.3.4 CPR

Mass conservation equations play a significant role in reservoir simulation systems. The terms of transmissibility and potential differences contain connections between adjacent cells on a discrete

74 grid. The generalized elliptic operation, which represents the derivative in space, is written as

∇ (~a∇Φ), (4.41) ·

where the tensor ~a can be looked as the transmissibility, and the scalar Φ is the fluid potential.

We assume a water-oil model with nw wells. The block-centered FDM (finite difference method) is applied to discretize the PDE system on a structured 3D grid consisting of n = nx × ny nz cells (grid blocks). A no-flow boundary condition is assumed. Upstream weighting is used × for the discretization of fluid properties in ~a. The harmonic average is used for the discretization

of rock and geometric properties in ~a. Then, let

~F(~x)=~0 (4.42) represent the nonlinear system based on fully implicit discretization , which is composed of 2n+nw equations. Each cell has an oil equation and a water equation. Each well has a well equation. A primary unknown variable of Po, Sw, or Pbh corresponds to an oil equation, a water equation, or a well equation, respectively. So, there are 2n + nw unknowns. We sort and number these equations. The n oil equations are numbered from 1 to n. The n water equations are numbered from n + 1 to 2n. The well equations are arranged according to the number of wells and placed at the end of the system. The variables Po, Sw and Pbh are represented by the vectors, respectively,

~Po,1   ~Po,2 ~Po =  , (4.43)  .   .       ~   Po,n      ~Sw,1   ~Sw,2 ~Sw =  , (4.44)  .   .       ~   Sw,n      75 and

~Pbh,1   ~Pbh,2 ~P =  . (4.45) bh  .   .       ~P   bh,nw    The unknown vector~x is finally written as  

~Po   ~x = ~Sw . (4.46)      ~Pbh      The block form of a Jacobian matrix A derived from each Newton’s iteration can be written as

App Aps Apw   A = Asp Ass Asw . (4.47)      Awp Aws Aww      n n n n n n The diagonal blocks App R , Ass R and Aww R w w correspond to the unknown vectors ∈ × ∈ × ∈ × n n n ~Po R , ~Sw R and ~Pbh R w , respectively. The other blocks, off-diagonal blocks, all represent ∈ ∈ ∈ the relationship between two type of unknowns and are called coupling blocks.

The subsequent analysis focuses on the blocks of oil equations and water equations. Let i and j

represent two adjacent cells. Cell i lies upstream of cell j. Figure 4.36-(1) shows the feature of App.

Although the pattern looks symmetric, App is not a because adjacent cells have different weights in the transmissibility at the interface, and there are differences in the formation

volume factors and GORs on each side. However, because pressure differences between adjacent

cells are usually small, the matrix A is nearly symmetric. As a relaxed version of symmetric

positive definiteness, App can still be solved effectively by AMG [57, 58]. Figure 4.36-(2) shows

the feature of Ass. Upstream weighting is adopted to calculate the flux by using the phase mobilities that are upstream with respect to the flow of the phases [59]. Because of upstream weighting, the

pattern presents a triangle with its right-angled vertex on the lower left.

76 P o, i P o, j S w, i S w, j

i i

j j

(1) A pp (2) A ss

Figure 4.36: Feature analysis of Jacobian matrix.

Wallis et al. introduced the CPR (constraint pressure residual) method as a mixed precondi- tioner scheme of AMG and ILU [20]. The basic idea is to take advantage of the SPD feature of the block App, which can provide a highly efficient solution for the pressure part of a preconditioner system by AMG. The CPR preconditioner consists of two phases. The first phase is to solve the pressure part ~Po for the preconditioner system through App by AMG. The second phase is to apply the ILU factorization for the solution of the entire preconditioner system based on the pressure variables acquired in the first phase.

For the solution of the preconditioner system M~y = ~f , some notations need to be stated first before elaboration on the CPR algorithm. A restriction operator R is defined to cut the segment ~fo corresponding to the oil pressure part~yo out of the right-hand side vector ~f . An opposite operation

is defined by a prolongation operator P that restores~yo to an original dimension by placing a vector ~0 at the rear. The two operations are expressed by

R~f = ~fo (4.48) and ~y P o ~yo =  . (4.49) ~0   1   AMG(App)− ~fo represents the solution~yo of a Poisson-like system App~yo = ~fo from the AMG 1 method. ILU(A)− ~r represents the solution of a residual system ILU(A)~yr =~r, where the coeffi-

77 cient matrix ILU(A) is the ILU factorization of A. Algorithm 10 gives the algorithm of the CPR preconditioner. The CPR procedure is also displayed by Figure 4.37, where steps (1) and (2) make up the CPR.

Algorithm 10 The CPR preconditioner. 1 1: ~y = P AMG(App)− (R~f )   2: ~r = ~f A~y − 1 3: ~y =~y + ILU(A)− ~r

(1) CPR: AMG

A pp

(2) CPR: ILU

A ss

Figure 4.37: CPR procedure.

4.4 Parallel Implementation

4.4.1 Parallel Components

The Krylov subspace algorithms and the AMG algorithms are composed of a set of components.

It is necessary to implement these building blocks first because the algorithms are built upon them.

Let~x,~y, and~z represent vectors, and let α and β be scalars. The basic components are composed of

SpMV (sparse matrix-vector multiplication), scalar-vector multiplication, vector addition and dot product, which are written as, respectively,

~y = A~x, (4.50)

78 ~y = α~x, (4.51)

~z =~x +~y, (4.52)

α =(~x,~y). (4.53)

Based on these basic operations, some extended components are developed as

~y = α~x + β~y, (4.54)

~z = α~x + β~y, (4.55)

~z = αA~x + β~y, (4.56) and

1 α =(~x,~x) 2 . (4.57)

The last one is the Euclidean norm.

According to the mathematical definitions of the basic operations, all the components can be

designed and implemented in parallel easily except for the SpMV, which is a little complicated.

Because each CUDA core of GPU is responsible for a row in matrix A, we will calculate the SpMV

79 by

x1     x2  a a ... a     11 12 1n  .      .             x     n                    x1  a a ... a x   11 12 1n 1         x2    a21 a22 ... a2n x2  a21 a22 ... a2n  .      =   .  . (4.58)  . . . .  .        . . .. .  .                  xn           an1 an2 ... ann  xn                     .   .       x   1         x2   a a a     n1 n2 ... nn  .      .                 xn       Therefore, each part on the right-hand side can be computed by a CUDA core. All the cores will run in parallel.

As the HEC format used for storing matrix is composed of two parts: ELL and CSR, the

SpMV algorithm is also needed to be implemented in two parts. Algorithm 11 describes the SpMV algorithm.

4.4.2 Parallel Triangular Solver

An ILU preconditioner system finally needs to be transformed into triangular systems to solve. A parallel triangular solver was developed for this purpose. The principle of this solver comes from the level schedule method [19, 22].

Because an upper triangular system can be changed into a lower one to solve, we only discuss

80 Algorithm 11 SpMV algorithm [38]. 1: for i=1...n do ⊲ Use one GPU kernel to deal with this loop

2: The i-th CUDA core calculates the i-th row of the ELL part

3: end for

4:

5: for i=1...n do ⊲ Use one GPU kernel to deal with this loop

6: The i-th CUDA core calculates the i-th row of the CSR part

7: end for the lower triangular system. The original idea of the level schedule method is to divide an equation into different levels by the unknowns. Different levels have different computing sequences, but the equations on the same level should be solved in parallel.

The lower triangular system to be solved is given as

L~x =~b. (4.59)

The level of an unknown xi, 1 i n, is defined by ≤ ≤

l(i)= 1 + max l( j) for all j such that Li j = 0,i = 1,2,...,n, (4.60) j { } 6 where l(i) represents the i-th level, zero initially; Li j stands for the (i, j)-th element of L; n is the number of rows [19].

Because all the equations in the same level can be solved in parallel, they are suitable for processing by a GPU kernel. After the levels are defined and L is reordered in each level, the level schedule method for solving (4.59) is described in Algorithm 12, where nlev stands for the number of levels. The degree of parallelism depends on the pattern of the matrix L. The two extreme situations are nlev = 1 and nlev = n, which correspond to a full parallel case and a full serial case, respectively.

81 Algorithm 12 Parallel triangular solver. 1: for i = 1 . . . nlev do

2: start = level(i) ⊲ the start row in level i

3: end = level(i + 1) - 1 ⊲ the end row in level i

4: for j = start ...end do ⊲ Use one GPU kernel to deal with this loop

5: solve the j-th row

6: end for

7: end for

4.4.3 RAS

The domain decomposition is adopted for dividing a matrix into some sub-matrices to implement the principle of divide-and-conquer. Cai et al. proposed the RAS (restricted additive Schwarz) method for domain decomposition [60], which can optimize the parallel structure of a precondi- tioner matrix.

Figure 4.38 shows a sample process of RAS with ILU(0). In general, the diagonal blocks are dense compared to the off-diagonal blocks; see Figure 4.38-(1). Before domain decomposition, the elements should be put along the diagonal tightly by permutation. Some existing methods can complete permutation and decomposition. METIS, which provides a quasi-optimal partition method, is used in the research [61]. If the sparse elements outside the diagonal blocks are re- moved, the remained detached diagonal blocks, which belong to different domains, can be solved in parallel; see Figure 4.38-(2).

The element removal may cause a decrease in the solution accuracy of a preconditioner sys- tem. An overlap technique is used as a compensatory way to restore calculation accuracy. The idea is to include the neighbor elements into the diagonal blocks; see Figure 4.38-(3). Multiple layers of elements can be included in the overlap. However, more elements would deteriorate the parallel performance. Figure 4.38-(4) shows the new matrix composed of all the overlapped blocks arranged from head to tail. The next step is to apply an ILU factorization such as ILU(0) on the

82 new matrix; see Figure 4.38-(5). The whole procedure is summarized in Algorithm 13. After the solution of the ILU(0) system, we only keep the values of the unknowns in the solution, discarding the parts corresponding to the rows of overlap.

(1) A (2) Decomposition (3) Overlapp

(4) Head to tail (5) ILU(0)

Figure 4.38: RAS ILU(0) process.

Algorithm 13 RAS ILU process. 1: Use permutation to make more elements along the diagonal

2: Remove the elements outside the diagonal block

3: Overlap the diagonal block by including neighbor elements

4: Arrange the diagonal blocks head to tail

5: Apply ILU factorization on the diagonal blocks

83 4.4.4 Parallel Algorithms

Based on the components and preconditioned schemes described above, different Krylov subspace methods can be developed. The component analysis of the BiCGSTAB and GMRES algorithms, which are mostly used, is presented in Algorithms 14 and 15. Detailed algorithm instructions refer to the books of Saad and Barrett [19, 21].

Algorithm 14 BiCGSTAB algorithm.

1: ~r0 =~b Ax~0; x~0 is an initial guess vector ⊲ SpMV; Vector update − 2: for k = 1, 2, ... do

3: ρk 1 =(~r0,~r) ⊲ Dot product − 4: if ρk 1 = 0 then − 5: Fails

6: end if

7: if k = 1 then

8: ~p =~r

9: else

10: βk 1 =(ρk 1/ρk 2)(αk 1/ωk 1) − − − − − 11: ~p =~r + βk 1(~p ωk 1~v) ⊲ Vector update − − − 12: end if

13: Solve ~p∗ from M~p∗ = ~p ⊲ Preconditioner system

14: ~v = A~p∗ ⊲ SpMV

15: αk = ρk 1/(~r0,~v) ⊲ Dot product − 16: ~s =~r αk~v ⊲ Vector update − 17: if ~s 2 is satisfied then ⊲ Dot product k k 18: ~x =~x + αk~p∗ ⊲ Vector update

19: Stop

84 20: end if

21: Solve~s∗ from M~s∗ =~s ⊲ Preconditioner system

22: ~t = A~s∗ ⊲ SpMV 2 23: ωk =(~t,~s)/ ~t ⊲ Dot product k k 24: ~x =~x + αk~p∗ + ωk~s∗ ⊲ Vector update

25: ~r =~s ωk~t ⊲ Vector update − 26: if ~r 2 is satisfied or ωk = 0 then ⊲ Dot product k k 27: Stop

28: end if

29: end for

Algorithm 15 GMRES(m) algorithm.

1: Given A,~b,~x0

2: Assemble M

3: ~r0 =~b A~x0 ⊲ SpMV; vector update − 4: for k = 1, 2, ... do

5: Solve~r from M~r =~r0 ⊲ Preconditioner system

6: β = ~r 2 ⊲ Dot product k k 7: ~v1 =~r/β ⊲ Vector update

8: for j = 1: m do

9: M~w = A~v j ⊲ SpMV; Preconditioner system

10: for i = 1: j do

11: hi j =(~w,~vi) ⊲ Dot product

12: ~w = ~w hi j~vi ⊲ Vector update − 13: end for

14: h j 1 j = ~w 2 ⊲ Dot product + , k k 85 15: if h j+1, j = 0 then

16: Stop

17: else

18: ~v j+1 = ~w/h j+1, j ⊲ Vector update

19: end if

20: end for

21: Define Vm, H¯m.

22: Compute~rm and~ym of the minimization problem

23: ~xm =~x0 +Vm~ym ⊲ SpMV; Vector update

24: ~r0 =~b A~xm ⊲ Vector update − 25: if~r0 is satisfied then

26: Stop

27: end if

28: end for

86 Chapter 5

NUMERICAL EXPERIMENTS(1)

This chapter is composed of numerical experiments of preconditioned solvers on GPUs. Paral- lelism and convergence are two major performance indicators that are considered. Two sets of algorithms are developed for both CPU and GPU. The algorithm based on CPU serves as a ref- erence to calculate the speedup of the algorithm based on GPU. As different GPU environments and matrices have different effects, we perform the tests on two GPU environments, K20X and

V100-PCIe-16GB, and test two sets of matrices: Poisson and SPE10. The tests present a com-

prehensive performance analysis and comparison about the solvers of BiCGSTAB, GMRES, and

AMG; the preconditioners of ILU, BILU, and CPR; and the domain decomposition of RAS. As the

specialized preconditioners developed for the black oil model, CPR ILU(k) and CPR ILU(T) both

show considerable improvement of parallel performance and convergence performance compared

to a traditional preconditioner.

5.1 Introduction

This solution platform is developed using C/C++ and CUDA on the Linux operating system. The

platform only works on NVIDIA GPUs.

Two GPU environments are used in the following solver test experiments. The parameters of

the environments are given in Table 5.1. The specific machine for each experiment is automatically

assigned by a cluster, so the running results of different tests may have a little difference even if

the same experimental configuration is used. However, the difference is trivial and has no impact

on the conclusion of each experiment.

The tests involve two types of matrices. Table 5.2 lists four 3-D Poisson matrices that are de- rived from the discretization of Poisson equations. Table 5.3 collects a series of Jacobian matrices,

87 Table 5.1: Workstation environments. Name CPU GPU CUDA cores

Env 1 Intel Xeon E5-2680 K20X 2688

Env 2 Intel Xeon Gold 6148 V100-PCIe-16GB 5120

which are difficult to solve, from the running of the SPE10 model 2 using a black oil simula-

tor [62]. The water drive recovery requires two unknowns, Po and Sw. As the SPE10 model 2 includes 1.122 106 cells (60 220 85 cells) and five wells, a Jacobian matrix has 2,244,005 × × × rows. Table 5.2: 3-D Poisson matrices. Name Dimensions Rows Nonzeros Nonzeros/rows

Poisson 1 100 100 100 1,000,000 6,940,000 6.94 × × Poisson 2 130 130 130 2,197,000 15,277,600 6.95 × × Poisson 3 150 150 150 3,375,000 23,490,000 6.96 × × Poisson 3 200 200 200 8,000,000 55,760,000 6.97 × ×

Table 5.3: Matrices from SPE10 model 2. Name Nonzeros Nonzeros/rows

SPE10 1 16,501,854 7.35

SPE10 2 18,269,261 8.14

SPE10 3 20,410,098 9.10

SPE10 4 23,629,318 10.53

A large-scale sparse linear system is expressed by

A~x =~b. (5.1)

Let~x0 be the initial guess. The initial residual is given as

~r0 =~b A~x0. (5.2) −

88 If~x represents a numerical solution, the corresponding residual can be written as

~r =~b A~x. (5.3) −

Then, the absolute error, the relative error to the initial residual and the relative error to the right-hand side are defined by, respectively,

εabs = ~r 2, (5.4) k k

~r 2 εrel = k k , (5.5) ~r0 2 k k and ~r 2 εrhs = k k . (5.6) ~b 2 k k 6 The convergence tolerance of the experiment is set to 10− . When any of εabs, εrel and εrhs reaches

the tolerance, the iteration is considered convergent. Usually, εrel arrives first. As a reference for the evaluation of the parallel performance of solvers on GPU, the single

thread execution time, the serial time, of the same solver on the host (CPU) is used to compute the

speedup that is expressed by CPU time speedup = . (5.7) GPU time The unit of time in the GPU experiments is second. A considerable speedup represents high par-

allelism. The number of iterations is used to judge the convergence performance. From the

experiments in this chapter, we can see that the number of iterations of each test running on CPU

and GPU is basically the same, which shows that the convergence logic of the CPU algorithm and

GPU algorithm is consistent merely one using serial implementation and the other using parallel

implementation.

5.2 Environments

We first compare Env 1 and Env 2 in terms of performance in this experiment. The BiCGSTAB

with ILU(0) is tested. The results of the Poisson matrices are included in Tables 5.4 and 5.5. The

89 results of the SPE10 matrices are included in Tables 5.6 and 5.7. Figures 5.1 and 5.2 present

the comparison between Env 1 and Env 2 for Poisson and SPE10 separately. All the tables and

figures in this experiment are prefixed with “[Env]”.

For the same solver and matrix, the number of iterations is the same for Env 1 and Env 2. The

CPU time of the two environments is close, but the GPU time is quite different. So the speedup of

Env 1 and Env 2 is different. The parallel performance of GPU in Env 2 is much better than that

of GPU in Env 1. Table 5.4: [Env] BiCGSTAB, ILU(0), Poisson, Env 1. Matrix CPU time CPU Ite GPU time GPU Ite Speedup

Poisson 1 4.50 56 0.84 55 5.34

Poisson 2 11.65 65 1.54 62 7.59

Poisson 3 21.67 79 2.56 75 8.46

Poisson 4 57.15 85 6.53 93 8.76

Table 5.5: [Env] BiCGSTAB, ILU(0), Poisson, Env 2. Matrix CPU time CPU Ite GPU time GPU Ite Speedup

Poisson 1 5.22 56 0.34 55 15.33

Poisson 2 13.83 65 0.55 62 24.95

Poisson 3 25.65 79 0.82 75 31.15

Poisson 4 66.15 85 1.71 93 38.76

Table 5.6: [Env] BiCGSTAB, ILU(0), SPE10, Env 1. Matrix CPU time CPU Ite GPU time GPU Ite Speedup

SPE10 1 21.44 122 3.23 122 6.64

SPE10 2 78.69 462 12.40 472 6.35

SPE10 3 140.45 790 24.83 911 5.66

SPE10 4 322.18 1534 46.09 1626 6.99

90 Table 5.7: [Env] BiCGSTAB, ILU(0), SPE10, Env 2. Matrix CPU time CPU Ite GPU time GPU Ite Speedup

SPE10 1 21.59 122 1.11 122 19.48

SPE10 2 85.92 462 4.21 472 20.42

SPE10 3 155.02 790 8.40 911 18.46

SPE10 4 322.16 1534 15.62 1626 20.62

100 50 500 50 BiCGSTAB, ILU(0), Poisson, Env_1: CPU time BiCGSTAB, ILU(0), SPE10, Env_1: CPU time BiCGSTAB, ILU(0), Poisson, Env_1: GPU time BiCGSTAB, ILU(0), SPE10, Env_1: GPU time BiCGSTAB, ILU(0), Poisson, Env_1: Speedup BiCGSTAB, ILU(0), SPE10, Env_1: Speedup BiCGSTAB, ILU(0), Poisson, Env_2: CPU time BiCGSTAB, ILU(0), SPE10, Env_2: CPU time 80 40 400 40 BiCGSTAB, ILU(0), Poisson, Env_2: GPU time BiCGSTAB, ILU(0), SPE10, Env_2: GPU time BiCGSTAB, ILU(0), Poisson, Env_2: Speedup BiCGSTAB, ILU(0), SPE10, Env_2: Speedup

60 30 300 30 Time [s] Time [s] Speedup Speedup 40 20 200 20

20 10 100 10

0 0 0 0 1 2 3 4 1 2 3 4 Poisson matrix SPE10 matrix

Figure 5.1: [Env] Env 1 vs. Env 2 for Poisson. Figure 5.2: [Env] Env 1 vs. Env 2 for SPE10.

5.3 Solvers

BiCGSTAB and GMRES are two popular iterative methods. We use GMRES with ILU(0) to run the Poisson and SPE10 matrices on Env 2. The results are included in Tables 5.8 and 5.9.

The restarted number of GMRES is set to 20. In order to reduce the communication resources between the host and device, the convergence condition is checked after the inner loop of GMRES is completed. That is, the number of iterations of GMRES on GPU is an integer multiple of 20.

Based on Tables 5.5, 5.7, 5.8, and 5.9, Figures 5.3 and 5.4 present the comparison of Poisson and SPE10 between BiCGSTAB and GMRES, respectively. The captions of the tables and graphs added in this experiment are prefixed with “[Svr]”.

From the aspect of running time, BiCGSTAB is more efficient than GMRES in terms of both

CPU time and GPU time. By GMRES, the SPE10 matrices take more iterations to converge and

91 the SPE10 4 cannot even converge after over 5,000 iterations; see Table 5.9. Figure 5.3 shows that

GMRES needs more iterations than BiCGSTAB for the Poisson matrices. So BiCGSTAB displays a better convergence performance. However, GMRES has a higher speedup than BiCGSTAB, which indicates that the GMRES has a better parallel performance. For the SPE10 matrices, Fig- ure 5.4 gives a similar conclusion.

Table 5.8: [Svr] GMRES, ILU(0), Poisson, Env 2. Matrix CPU time CPU Ite GPU time GPU Ite Speedup

Poisson 1 13.30 180 0.66 180 20.09

Poisson 2 46.18 260 1.44 260 32.09

Poisson 3 93.63 340 2.43 340 38.50

Poisson 4 356.62 540 6.85 540 52.04

Table 5.9: [Svr] GMRES, ILU(0), SPE10, Env 2. Matrix CPU time CPU Ite GPU time GPU Ite Speedup

SPE10 1 60.91 380 2.18 380 27.88

SPE10 2 255.24 1540 8.88 1560 28.76

SPE10 3 706.14 4120 26.94 4620 26.21

SPE10 4 over 5000 over 5000

5.4 ILU

The performance of ILU(k) and ILUT is tested in this experiment. The tables and graphs of this experiment are named with a prefixed “[ILU]”.

ILU(k)

Table 5.10 shows the results of SPE10 1 by BiCGSTAB with ILU(k) on Env 2. Figure 5.5 shows the changes of GPU iterations and speedup. The number of fill-ins increases as k becomes larger. More elements improve the accuracy of calculations but affect parallel performance. So the

92 BiCGSTAB, ILU(0), Poisson, Env_2: GPU ite BiCGSTAB, ILU(0), SPE10, Env_2: GPU ite 700 70 7000 70 BiCGSTAB, ILU(0), Poisson, Env_2: Speedup BiCGSTAB, ILU(0), SPE10, Env_2: Speedup GMRES, ILU(0), Poisson, Env_2: GPU ite GMRES, ILU(0), SPE10, Env_2: GPU ite GMRES, ILU(0), Poisson, Env_2: Speedup GMRES, ILU(0), SPE10, Env_2: Speedup 600 60 6000 60

500 50 5000 50

400 40 4000 40 GPU ite GPU ite Speedup Speedup 300 30 3000 30

200 20 2000 20

100 10 1000 10

0 0 0 0 1 2 3 4 1 2 3 4 Poisson matrix SPE10 matrix

Figure 5.3: [Svr] BiCGSTAB vs. GMRES for Figure 5.4: [Svr] BiCGSTAB vs. GMRES for

Poisson. SPE10.

iterations and speedup should both decrease when k increases. However, such tendencies are not

obvious in this test though the speedup has a general downward trend. Both the GPU iterations

and speedup show a zigzag pattern. Another test is done by using GMRES with ILU(k) to test

Poisson 1 on Env 1. Table 5.11 and Figure 5.6 present the same phenomenon. Table 5.10: [ILU] BiCGSTAB, ILU(k), SPE10 1, Env 2. k CPU time CPU ite GPU time GPU ite Speedup

0 27.52 122 1.12 122 24.52

1 27.38 119 1.49 114 18.43

2 27.59 123 1.41 126 19.61

3 24.08 106 1.51 112 15.92

4 29.43 131 1.48 121 19.85

ILUT

ILUT has two parameters: p and τ. A large τ will screen out more elements, but a large p will keep more elements. Therefore, if we increase the value of τ, the parallel performance will increase; at the same time, the convergence performance will decrease. Similarly, if we increase the value of p, the parallel performance will decrease, but the convergence performance will increase.

Two tests are performed for verification of this inference. The first one is the solution of SPE10 1

93 Table 5.11: [ILU] GMERS, ILU(k), Poisson 1, Env 1. k CPU time CPU ite GPU time GPU ite Speedup

0 10.77 180 1.81 180 5.93

1 9.65 160 1.64 160 5.89

2 10.68 178 1.82 180 5.88

3 9.63 160 1.63 160 5.89

4 10.55 177 1.84 180 5.75

350 35 350 35 BiCGSTAB, ILU(k), SPE10_1, Env_2: GPU ite GMRES, ILU(k), Poisson_1, Env_1: GPU ite BiCGSTAB, ILU(k), SPE10_1, Env_2: Speedup GMRES, ILU(k), Poisson_1, Env_1: Speedup 300 30 300 30

250 25 250 25

200 20 200 20 GPU ite GPU ite 150 15 Speedup 150 15 Speedup

100 10 100 10

50 5 50 5

0 0 0 0 0 1 2 3 4 0 1 2 3 4 k k

Figure 5.5: [ILU] BiCGSTAB, ILU(k). Figure 5.6: [ILU] GMRES, ILU(k). by BiCGSTAB with ILUT(fixed p = 7) on Env 2; see Table 5.12 and Figure 5.7. The value of p comes from the average nonzeros/rows of the matrix. The results show that the GPU iterations and speedup both go up with τ increasing. The second one is the solution of SPE10 1 by BiCGSTAB with ILUT(fixed τ = 0.03) on Env 2. Table 5.13 and Figure 5.8 show that the GPU iterations and speedup both go down when p increases.

ILUT filters elements according to the value size of the elements, but ILU(k) filters elements according to the pattern of the matrix. So ILUT is generally more beneficial to improve conver- gence, but ILU(k) is more useful to improve parallelism. By a comparison of Table 5.10 with

Tables 5.12 and 5.13, ILU(k) has higher value in both GPU iterations and speedup than ILUT.

94 Table 5.12: [ILU] BiCGSTAB, ILUT(fixed p), SPE10 1, Env 2. p τ CPU time CPU ite GPU time GPU ite Speedup

7 0.01 11.28 44 3.34 44 3.38

7 0.02 11.85 47 2.84 47 4.17

7 0.03 14.50 59 2.77 59 5.24

7 0.04 16.54 68 2.66 70 6.22

7 0.05 17.20 71 2.26 69 7.61

Table 5.13: [ILU] BiCGSTAB, ILUT(fixed τ), SPE10 1, Env 2. p τ CPU time CPU ite GPU time GPU ite Speedup

5 0.03 19.68 80 2.72 81 7.23

6 0.03 17.34 71 2.47 61 7.04

7 0.03 14.44 59 2.76 59 5.24

8 0.03 12.56 51 2.60 50 4.83

9 0.03 11.40 46 2.79 49 4.09

100 10 100 10 BiCGSTAB, ILUT(p = 7), SPE10_1, Env_2: GPU ite BiCGSTAB, ILUT(τ = 0.03), SPE10_1, Env_2: GPU ite BiCGSTAB, ILUT(p = 7), SPE10_1, Env_2: Speedup BiCGSTAB, ILUT(τ = 0.03), SPE10_1, Env_2: Speedup

80 8 80 8

60 6 60 6 GPU ite GPU ite Speedup Speedup 40 4 40 4

20 2 20 2

0 0 0 0 0.01 0.02 0.03 0.04 0.05 5 6 7 8 9 τ p

Figure 5.7: [ILU] BiCGSTAB, ILUT(fixed p). Figure 5.8: [ILU] BiCGSTAB, ILUT(fixed τ).

5.5 BILU

The block-wise ILU preconditioner is tested in this experiment. All the tables and figures are pre-

fixed with “[BILU]”. Table 5.14 shows the results of Poisson 1 solved by GMRES with BILU(0)

95 on Env 1. There are more fill-ins when the block size increases. Then the number of iterations should be reduced, and the speedup should go down. From Figure 5.9, the data do not show a decline in speedup and GPU iterations until the block size is larger than 4. Table 5.15 lists the results of SPE10 1 by GMRES with BILUT(7, 0.03) on Env 1. In the case where the block size equals 5, there is no change in parallel performance and convergence performance compared to the case where the block size is 1; see Figure 5.10. Therefore, although the BILU is an optional preconditioner for improving convergence, it requires a large block size to achieve this goal. Table 5.14: [BILU] GMRES, BILU(0), Poisson 1, Env 1. Block size CPU time CPU ite GPU time GPU ite Speedup

1 10.84 180 1.84 180 5.91

4 10.80 180 1.81 180 5.95

16 8.75 160 1.53 160 5.73

64 1.06 68 0.40 80 2.67

Table 5.15: [BILU] GMRES, BILUT(7, 0.03), SPE10 1, Env 1. Block size CPU time CPU ite GPU time GPU ite Speedup

1 17.57 120 7.29 120 2.41

5 17.50 120 7.21 120 2.43

250 10 250 10 GMRES, BILU(0), Poisson_1, Env_1: GPU ite GMRES, BILUT, SPE10_1, Env_1: GPU ite GMRES, BILU(0), Poisson_1, Env_1: Speedup GMRES, BILUT, SPE10_1, Env_1: Speedup

200 8 200 8

150 6 150 6 GPU ite GPU ite Speedup Speedup 100 4 100 4

50 2 50 2

0 0 0 0 0 8 16 24 32 40 48 56 64 1 2 3 4 5 Block size Block size

Figure 5.9: [BILU] BILU(0), Poisson 1. Figure 5.10: [BILU] BILUT, SPE10 1.

96 5.6 RAS

RAS is adopted in the research for domain decomposition. More partitions provide high parallel

performance. The tables and figures related to this experiment are prefixed with “[RAS]”. Ta-

ble 5.16 shows the results of the solution of Poisson 1 by BiCGSTAB with RAS ILU(0) on Env 2.

Figures 5.11 and 5.12 present the changes of GPU iterations and speedup when the number of par- titions increases. Because all the curves show an overall increase with more partitions, the parallel performance is improved. However, the convergence performance is affected. The overlap is used to compensate for the loss of computational accuracy due to partition. We expect to see a higher overlap in exchange for better convergence. Figure 5.11 does give such a result of GPU iterations.

However, Figure 5.12 does not show the expected speedup decrease. Table 5.16: [RAS] BiCGSTAB, RAS ILU(0), Poisson 1, Env 2. Part Overlap CPU time CPU ite GPU time GPU ite Speedup

1 0 5.10 56 0.34 55 14.97

8 0 6.60 73 0.28 77 23.27

64 0 7.34 82 0.21 80 34.14

512 0 6.04 70 0.17 82 35.67

1 1 5.12 56 0.32 55 15.76

8 1 5.97 64 0.23 63 25.86

64 1 5.24 54 0.17 60 30.02

512 1 6.15 60 0.14 58 45.01

1 2 5.08 56 0.33 55 15.63

8 2 5.66 59 0.23 62 24.78

64 2 5.58 53 0.16 54 34.29

512 2 6.66 55 0.16 60 42.09

The other test is the solution of SPE 10 by BiCGSTAB with RAS ILUT(7, 0.03) on Env 2.

When the number of partitions goes up, Figures 5.13 and 5.14 present the increase of GPU itera-

97 BiCGSTAB, RAS_ILU(0), Overlap = 0, POISSON_1, Env_2: GPU ite BiCGSTAB, RAS_ILU(0), Overlap = 0, POISSON_1, Env_2: Speedup 140 70 BiCGSTAB, RAS_ILU(0), Overlap = 1, POISSON_1, Env_2: GPU ite BiCGSTAB, RAS_ILU(0), Overlap = 1, POISSON_1, Env_2: Speedup BiCGSTAB, RAS_ILU(0), Overlap = 2, POISSON_1, Env_2: GPU ite BiCGSTAB, RAS_ILU(0), Overlap = 2, POISSON_1, Env_2: Speedup

120 60

100 50

80 40 GPU ite Speedup 60 30

40 20

20 10

0 0 0 64 128 192 256 320 384 448 512 0 64 128 192 256 320 384 448 512 Part Part

Figure 5.11: [RAS] Poisson 1, GPU ite. Figure 5.12: [RAS] Poisson 1, Speedup. tions and speedup for each overlap configuration, respectively. The partition functionality provided by RAS has been achieved. However, the overlap does not meet the anticipated influence.

Table 5.17: [RAS] BiCGSTAB, RAS ILUT(7, 0.03), SPE10 1, Env 2. Part Overlap CPU time CPU ite GPU time GPU ite Speedup

1 0 13.62 59 2.77 59 4.92

8 0 13.86 60 1.36 60 10.22

64 0 19.44 85 1.04 85 18.62

512 0 21.14 92 0.69 92 30.56

1 1 13.62 59 2.77 59 4.91

8 1 46.52 199 3.29 143 14.12

64 1 25.04 101 1.37 103 18.29

512 1 34.26 125 1.09 123 31.43

1 2 13.62 59 2.81 59 4.85

8 2 15.80 65 1.46 61 10.83

64 2 38.50 143 1.68 117 22.93

512 2 70.91 216 1.95 190 36.43

98 300 50 BiCGSTAB, RAS_ILUT(7, 0.03), Overlap = 0, SPE10_1, Env_2: GPU ite BiCGSTAB, RAS_ILUT(7, 0.03), Overlap = 0, SPE10_1, Env_2: Speedup BiCGSTAB, RAS_ILUT(7, 0.03), Overlap = 1, SPE10_1, Env_2: GPU ite BiCGSTAB, RAS_ILUT(7, 0.03), Overlap = 1, SPE10_1, Env_2: Speedup BiCGSTAB, RAS_ILUT(7, 0.03), Overlap = 2, SPE10_1, Env_2: GPU ite BiCGSTAB, RAS_ILUT(7, 0.03), Overlap = 2, SPE10_1, Env_2: Speedup 250 40

200

30

150 GPU ite Speedup 20

100

10 50

0 0 0 64 128 192 256 320 384 448 512 0 64 128 192 256 320 384 448 512 Part Part

Figure 5.13: [RAS] SPE10 1, GPU ite. Figure 5.14: [RAS] SPE10 1, Speedup.

5.7 AMG

The coarsening and prolongation algorithms in the AMG solvers use RS and RSSTD introduced

by Ruge and Stuben¨ [32, 28]. The damped Jacobi is adopted as the smoother. The default value for

levels of AMG methods is set to 32. F-cycle with µ = 1 is used in the tests. The tables and figures

in this part are prefixed with “[AMG]”.

The first test focuses on the performance of different cycle types for the Poisson matrices. The

number of iterations of the pre-smoothing and post-smoothing is set to 3. The results are collceted

in Tables 5.18 to 5.21. It is easy to see that the V-cycle has the best speedup, but it also needs

more iterations; see Figures 5.15 and 5.16.

We use the V-cycle for Poisson 1 to test the influence of different levels. Table 5.22 gives the results of the levels from 8 to 64. Figure 5.17 shows that the iterations stay constant and the speedup is influenced slightly when the levels are set at different values. Table 5.23 collects the results of different iterations of pre-smoothing and post-smoothing. Figure 5.18 shows that the number of

GPU iterations decreases fastly and then reaches a stable value, but the speedup increases slightly only when the iterations of pre-smoothing and post-smoothing are increased from 1 to 5.

99 Table 5.18: [AMG] AMG, Poisson 1, Env 2. Cycle type CPU time CPU ite GPU time GPU ite Speedup

V 2.87 6 0.06 6 45.48

W 9.72 5 0.99 5 9.85

F 5.76 5 0.18 5 32.24

Table 5.19: [AMG] AMG, Poisson 2, Env 2. Cycle type CPU time CPU ite GPU time GPU ite Speedup

V 6.57 6 0.10 6 62.90

W 22.70 5 0.89 5 25.53

F 13.36 5 0.26 5 52.19

Table 5.20: [AMG] AMG, Poisson 3, Env 2. Cycle type CPU time CPU ite GPU time GPU ite Speedup

V 10.96 6 0.16 6 67.49

W 47.05 5 2.19 5 21.52

F 24.22 5 0.42 5 57.69

Table 5.21: [AMG] AMG, Poisson 4, Env 2. Cycle type CPU time CPU ite GPU time GPU ite Speedup

V 60.02 7 0.40 7 148.76

W 227.04 5 3.56 5 63.77

F 113.52 5 0.83 5 137.23

Table 5.22: [AMG] AMG, V-cycle, Poisson 1, Env 2: level Level CPU time CPU ite GPU time GPU ite Speedup

8 2.87 6 0.07 6 43.37

16 2.86 6 0.06 6 46.19

32 2.87 6 0.06 6 45.23

64 2.85 6 0.06 6 46.27

100 10 200 AMG, V−cycle, Poisson, Env_2: GPU ite AMG, V−cycle, Poisson, Env_2: Speedup AMG, W−cycle, Poisson, Env_2: GPU ite 180 AMG, W−cycle, Poisson, Env_2: Speedup AMG, F−cycle, Poisson, Env_2: GPU ite AMG, F−cycle, Poisson, Env_2: Speedup

8 160

140

6 120

100 GPU ite Speedup 4 80

60

2 40

20

0 0 1 2 3 4 1 2 3 4 Poisson matrix Poisson matrix

Figure 5.15: [AMG] Poisson, GPU ite. Figure 5.16: [AMG] Poisson, Speedup.

Table 5.23: [AMG] AMG, V-cycle, Poisson 1, Env 2: smoothing iteration. Pre-smoothing Post-smoothing CPU time CPU ite GPU time GPU ite Speedup

1 1 3.14 13 0.07 13 46.44

2 2 2.58 7 0.06 7 46.10

3 3 2.93 6 0.06 6 46.87

4 4 3.05 5 0.06 5 47.42

5 5 3.62 5 0.08 5 47.81

20 20 AMG, V−cycle, Poisson_1, Env_2: GPU ite AMG, V−cycle, Poisson_1, Env_2: GPU ite 70 70 18 AMG, V−cycle, Poisson_1, Env_2: Speedup 18 AMG, V−cycle, Poisson_1, Env_2: Speedup

16 60 16 60

14 14 50 50

12 12

40 40 10 10 GPU ite GPU ite Speedup Speedup 8 30 8 30

6 6 20 20

4 4

10 10 2 2

0 0 0 0 8 16 24 32 40 48 56 64 1 2 3 4 5 Level Smoothing iteration

Figure 5.17: [AMG] Level. Figure 5.18: [AMG] Smoothing iteration.

101 5.8 CPR

The CPR preconditioner in this research is developed for the solution of the SPE10 matrices. This experiment tests the performance of the solvers using CPR. The AMG part of CPR is set as the

V-cycle with 32 levels and three iterations for both pre-smoothing and post-smoothing. The tables and figures of this experiment are prefixed with “[CPR]”.

CPR ILU(0) vs. ILU(0)

The current tests set the ILU part of CPR as ILU(0). Tables 5.24 and 5.25 include the results of

CPR ILU(0) preconditioned BiCGSTAB and GMRES on Env 2. Both BiCGSTAB and GMRES converge mostly in tens of iterations, with a speedup of over 20.

Based on Tables 5.7, 5.9, 5.24, and 5.25, the comparison curves are drawn and shown in

Figures 5.19 and 5.20. The curves present that CPR ILU(0) greatly reduces the number of iter- ations compared to ILU(0) for both BiCGSTAB and GMRES. Meanwhile, the speedups are also improved significantly.

Table 5.24: [CPR] BiCGSTAB, CPR ILU(0), SPE10, Env 2. Matrix CPU time CPU Ite GPU time GPU Ite Speedup

SPE10 1 6.16 6 0.18 6 33.57

SPE10 2 14.65 14 0.41 14 36.00

SPE10 3 24.47 23 0.68 23 35.88

SPE10 4 46.87 43 1.29 45 36.32

Table 5.25: [CPR] GMRES, CPR ILU(0), SPE10, Env 2. Matrix CPU time CPU Ite GPU time GPU Ite Speedup

SPE10 1 6.58 11 0.33 20 20.01

SPE10 2 25.20 41 0.64 40 39.64

SPE10 3 40.45 65 1.28 80 31.64

SPE10 4 102.10 160 2.53 160 40.34

102 6000 60 6000 60 BiCGSTAB, CPR_ILU(0), SPE10, Env_2: GPU Ite GMRES, CPR_ILU(0), SPE10, Env_2: GPU Ite BiCGSTAB, CPR_ILU(0), SPE10, Env_2: Speedup GMRES, CPR_ILU(0), SPE10, Env_2: Speedup BiCGSTAB, ILU(0), SPE10, Env_2: GPU Ite GMRES, ILU(0), SPE10, Env_2: GPU Ite 5000 BiCGSTAB, ILU(0), SPE10, Env_2: Speedup 50 5000 GMRES, ILU(0), SPE10, Env_2: Speedup 50

4000 40 4000 40

3000 30 3000 30 Speedup Speedup GPU ite GPU ite

2000 20 2000 20

1000 10 1000 10

0 0 0 0 1 2 3 4 1 2 3 4 SPE10 matrix SPE10 matrix

Figure 5.19: [CPR] CPR ILU(0) vs. ILU(0) for Figure 5.20: [CPR] CPR ILU(0) vs. ILU(0) for

BiCGSTAB GMRES

To take a close look at the convergence characteristics, we take SPE10 3 as an example for

further analysis. Tables 5.26 and 5.27 list the changes of the relative error εrel for BiCGSTAB

and GMRES when the number of iterations grows up. The logarithmic values of εrel are also calculated. Figures 5.21 and 5.22 show that lgεrel of CPR ILU(0) falls below 6 in a steep trend, − but lgεrel of ILU(0) declines very slowly.

0 0 ε ε BiCGSTAB, CPR_ILU(0), SPE10_3, Env_2: lg rel GMRES, CPR_ILU(0), SPE10_3, Env_2: lg rel ε ε BiCGSTAB, ILU(0), SPE10_3, Env_2: lg rel GMRES, ILU(0), SPE10_3, Env_2: lg rel −1 −1

−2 −2

−3 −3

−4 −4 rel rel ε ε

lg −5 lg −5

−6 −6

−7 −7

−8 −8

−9 −9 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 20 40 60 80 GPU ite GPU ite

Figure 5.21: [CPR] lgεrel for BiCGSTAB. Figure 5.22: [CPR] lgεrel for GMRES.

103 Table 5.26: [CPR] εrel comparison for BiCGSTAB, SPE10 3, Env 2. CPR ILU(0) ILU(0)

GPU Ite εrel lgεrel εrel lgεrel 1 6.02E-02 -1.22 4.05E-02 -1.39

2 5.16E-02 -1.29 3.16E-02 -1.50

3 4.84E-02 -1.31 5.24E-03 -2.28

4 5.17E-02 -1.29 2.53E-03 -2.60

5 2.22E-02 -1.65 2.02E-03 -2.69

6 6.04E-03 -2.22 2.00E-03 -2.70

7 1.48E-03 -2.83 1.95E-03 -2.71

8 7.99E-04 -3.10 1.93E-03 -2.71

9 6.39E-04 -3.19 1.98E-03 -2.70

10 2.00E-04 -3.70 1.96E-03 -2.71

11 1.73E-04 -3.76 1.92E-03 -2.72

12 3.38E-05 -4.47 1.92E-03 -2.72

13 4.61E-04 -3.34 1.93E-03 -2.71

14 1.09E-05 -4.96 1.96E-03 -2.71

15 8.97E-06 -5.05 2.06E-03 -2.69

16 5.14E-06 -5.29 1.94E-03 -2.71

17 5.64E-07 -6.25 1.93E-03 -2.71

104 Table 5.27: [CPR] εrel comparison for GMRES, SPE10 3, Env 2. CPR ILU(0) ILU(0)

GPU Ite εrel lgεrel εrel lgεrel 20 4.67E-03 -2.33 6.03E-02 -1.22

40 1.35E-04 -3.87 1.61E-02 -1.79

60 1.42E-06 -5.85 1.50E-02 -1.83

80 2.10E-08 -7.68 1.22E-02 -1.92

105 CPR ILUT

The following tests study the influence of the characteristic parameters (p,τ) on CPR ILUT.

SPE10 1 is used in the tests. Table 5.28 contains the impact of τ on BiCGSTAB when p is fixed

at 7. Obviously, τ has little effect on the number of iterations and speedup; see Figure 5.23.

From another aspect, Table 5.29 collects the impact of p for a fixed τ. With the increase of p, speedup decreases; see Figure 5.24. Therefore, p continues to play a significant role in CPR

parallel performance.

Table 5.28: [CPR] BiCGSTAB, CPR ILUT(fixed p), SPE10 1, Env 2. p τ CPUtime CPUIte GPUtime GPUIte Speedup

7 0.01 6.58 5 0.68 5 9.72

7 0.02 6.56 5 0.68 5 9.72

7 0.03 6.59 5 0.67 5 9.88

7 0.04 6.59 5 0.67 5 9.78

7 0.05 6.53 5 0.68 5 9.64

Table 5.29: [CPR] BiCGSTAB, CPR ILUT(fixed τ), SPE10 1, Env 2. p τ CPUtime CPUIte GPUtime GPUIte Speedup

5 0.03 6.44 5 0.47 5 13.78

6 0.03 6.49 5 0.61 5 10.60

7 0.03 6.58 5 0.67 5 9.78

8 0.03 6.63 5 0.83 5 8.00

9 0.03 6.72 5 0.98 5 6.89

106 20 20 20 20 BiCGSTAB, CPR_ILUT(p = 7), SPE10_1, Env_2: GPU ite BiCGSTAB, CPR_ILUT(τ = 0.03), SPE10_1, Env_2: GPU ite τ 18 BiCGSTAB, CPR_ILUT(p = 7), SPE10_1, Env_2: Speedup 18 18 BiCGSTAB, CPR_ILUT( = 0.03), SPE10_1, Env_2: Speedup 18

16 16 16 16

14 14 14 14

12 12 12 12

10 10 10 10 GPU ite GPU ite Speedup Speedup 8 8 8 8

6 6 6 6

4 4 4 4

2 2 2 2

0 0 0 0 0.01 0.02 0.03 0.04 0.05 5 6 7 8 9 τ p

Figure 5.23: [CPR] CPR ILUT(fixed p). Figure 5.24: [CPR] CPR ILUT(fixed τ).

107 Chapter 6

CONCLUSIONS(1)

This research concentrates on the development of a new parallel solution platform based on GPU for accelerating the solution of Jacobian systems derived from large-scale heterogeneous black oil simulations. A series of general preconditioners are completed. The CPR preconditioned solvers specialized for black oil models are developed. The development of these iterative solvers makes full use of the parallel features provided by GPU. The conclusions, based on the theoretical analysis and experimental verification, are summarized in this chapter.

6.1 Conclusions

As an accelerator of scientific computing, GPU provides an effective HPC solution for personal computers or workstations. In order to make a complete evaluation scheme for the solution of large-scale heterogeneous black oil simulation based on the GPU technique, this research develops a new parallel solution platform by application of the characteristics of GPU hardware. This so- lution platform provides parallel implementation of a set of solvers and preconditioners, including the Krylov solvers BiCGSTAB and GMRES; the AMG solvers; the ILU preconditioners of ILU(k),

ILUT and BILUK; the CPR ILU(k) and CPR ILUT preconditioners specialized for the black oil

model; and the domain decomposition realized by RAS. The SPE10 model 2 is used as a bench-

mark to test the performance of the preconditioned solvers. The experimental results show that the

CPR preconditioners can provide efficient parallel performance and convergence performance for

large-scale heterogeneous black oil models based on the GPU technique. The CPR precondition-

ers are a much more effective solution than general ones on GPU. The platform also provides a

comprehensive comparative study of different solvers and preconditioners on GPUs. The specific

conclusions are described in detail below.

108 1. The algorithms implemented in this study achieves considerable parallel acceleration on GPU.

In general, several or dozens of times of acceleration compared to a serial program can be ob-

tained on GPU. The same implementation of parallel algorithms presents different performance

on different environment configurations. For instance, all the speedups of SPE10 solved by

BiCGSTAB with ILU(0) are over five on Env 1 but over 18 on Env 2.

2. Different solvers present different convergence and parallel performance. The variation in per-

formance is related to the algorithm mechanism and specific implementation. In the tests, GM-

RES with ILU(0) on Env 2 gives at least a speedup of 26 for the solution of SPE10 compared

to around 20 provided by BiCGSTAB in the same other conditions, but it needs more iterations

to achieve convergence than BiCGSTAB.

3. A large value of k makes ILU(k) keep more elements in a preconditioner matrix, which leads to

a decrease in parallel performance and an increase in convergence performance. ILU(0) gives

the best parallel performance for different k. The solution of SPE10 1 by BiCGSTAB with

ILU(k) on Env 2 obtains a speedup of 24.52 when k = 0, but the speedups are all less than 20

when k is set to 1 to 4.

For ILUT, a higher parameter p keeps more elements, but a higher τ screens out more elements.

So a high p improves convergence performance, but a high τ enhances parallel performance.

From the tests, the solution of SPE10 1 by BiCGSTAB with ILUT(9, 0.03) on Env 2 reaches

convergence after 49 iterations, but it requires 81 iterations with ILUT(5, 0.03) given other

conditions unchanged. For the same problem, a speedup of 3.38 obtained with ILUT(7, 0.01)

compares to a speedup of 7.61 achieved with ILUT(7, 0.05).

ILU(k) screens elements from the structure, but ILUT screens elements from the value size.

ILU(k) provides better parallelism than ILUT, in general.

4. BILU imports more elements, which benefits performance improvement in convergence. How-

ever, the effect is not apparent when the block size is relatively small, based on the tests.

5. RAS works as a useful decomposition in the research. For SPE10 1 solved by BiCGSTAB with

109 ILUT(7, 0.03) on Env 2, the solution achieves a speedup of 30.56 when RAS is set to 512. In

contrast, the solution is accelerated to only 4.92 times faster if no decomposition is used. Based

on the tests, the overlap does not play an expected role in the solution of SPE10 1.

6. The AMG method dramatically improves the efficiency of solving Poisson-like systems. The

V-cycle shows the best parallel performance, compared to the W-cycle and F-cycle from the

tests. The four Poisson matrices obtained at least a speedup of 45.48 by the V-cycle on Env 2.

The levels and smoothing iterations have a slight influence on the parallel performance of the

V-cycle.

7. CPR is a preconditioner specifically for the solution of a black oil system. The CPR ILU(k)

and CPR ILUT are developed in the research. For the solution of SPE10 with CPR ILU(0) on

Env 2, both BiCGSTAB and GMRES obtain a speedup of more than 20. For all the four SPE10

matrices, CPR ILU(0) demonstrates much better performance in running time, convergence,

and parallelism than ILU(0).

The work of this research has demonstrated that more effective black oil solvers can be devel-

oped based on GPU for personal computers or workstations. Besides the performance investigation

of the solution of black oil models, the work of this research can be used for a subsequent devel-

opment platform for the next generation black oil simulator.

6.2 Future Work

For further expansion of this research, some of the following aspects might be taken as future work.

1. The algorithms of this research are developed based on the hardware platform of a single GPU

in a single node. The work of multi-GPUs in a single node and multi-GPUs in multi-nodes

would be able to provide more favorable computing performance from a hardware point of

view. However, new parallel algorithms need to be designed and implemented for the new

parallel computing architectures.

110 2. From the perspective of computational mathematics, there is always some space for improve-

ment in preconditioners for specific problems. It is a direction to design new preconditioners

or study the combination of existing techniques according to the matrix characteristics of a

reservoir model.

3. As the fluid pressure always occupies one place of the unknowns during reservoir simulation,

the investigation of CPR-like preconditioned solvers for reservoir simulators on GPU is a topic

worthy of study.

111 Chapter 7

INTRODUCTION(2)

Heavy oil and oil sands have become a vital petroleum resource. Because of their high viscosity, thermal recovery techniques are widely adopted. Firstly, this chapter introduces hot water flooding, steam flooding, CSS, and SAGD. Because the hot water and steam recovery techniques consume a large amount of natural gas and water and produce significant greenhouse gas, ISC (in-situ com- bustion) that overcomes these disadvantages has been paid attention to as another thermal recovery technology. Then, the chapter gives a detailed description of the ISC process. A new in-situ com- bustion simulator with integrated functions based on the PER method is required to develop in this research. The remainder of this chapter presents the literature review and detailed research objectives.

7.1 Thermal Recovery

With conventional reserves decreasing, a broad spectrum of unconventional hydrocarbons, includ- ing heavy oil and oil sands, have emerged as potential resources to secure the energy demand in the foreseeable future. However, the majority of heavy oil fields remain untapped or undeveloped due to their recovery challenges and marginal economics. In general, crude oil with API gravity less than 20◦ is classified as heavy oil. High viscosity and density are the oil characteristics that give rise to the mobility problem of heavy oil. Oil sands are loose sands or partially consolidated sandstone, consisting of sand, clay, water, and bitumen.

Thermal recovery methods are EOR (enhanced oil recovery) processes, which optimize and improve the oil recovery by raising the temperature of a reservoir. Because the high temperature can reduce the viscosity and improve the fluidity of oil, thermal recovery methods are mainly used in a heavy oil or oil sand reservoir. The usual methods of thermal recovery are carried out by

112 injecting hot water or steam. Figure 7.1 shows the classification of thermal recovery methods.

Hot water flooding

Steam flooding

Thermal recovery Steam injection CSS

SAGD

In-situ combustion

Figure 7.1: Thermal recovery classification.

In a hot water flooding, some wells are used as hot water injection wells, and other wells are

used for oil production; see Figure 7.2. Crude oil is reduced in viscosity because it absorbs the en-

ergy released from hot water and is pushed into production wells by the hot water flooding. Steam

flooding uses a mechanism like the hot water flooding, but the injection fluid is steam rather than

hot water; see Figure 7.3. Steam carries more latent heat, resulting in higher efficiency in heating

a reservoir. CSS (cyclic steam stimulation), known as the Huff and Puff method, consists of 3

stages: injection, soaking, and production; see Figure 7.4. Steam is injected into a well for several

weeks, and then the well is shut-in to allow sufficient heat dissipation, which is expected to lead

to an enhanced production after a few days. The cycle of CSS is usually repeated until the eco-

nomic benefits become infeasible [63]. SAGD (steam-assisted gravity drainage) is a revolutionary

technology for oil recovery from oil sands; see Figure 7.5. Butler “developed the concept of using

horizontal pairs of wells and injected steam to develop certain deposits of bitumen considered too

deep for mining” [64]. High-pressure steam injected into an upper wellbore causes the heated oil

to drain into a lower wellbore where the oil is pumped out.

Steam injection is widely adopted in oil fields in California (USA), Venezuela, and Alberta

(Canada). Due to the rich oil sands found in Alberta that are about 70% oil sands in the world [65],

CSS and SAGD technologies have been thoroughly studied and developed in Canada.

113 However, limitations of the thermal methods stated above are apparent. Because hot water or steam requires heating water by burning fuel, natural gas usage, massive water consumption, and significant greenhouse gas emissions are unavoidable. The heat loss of steam transmission and injection also lower energy use efficiency. ISC, fire flooding, provides another thermal recovery method to heat heavy oil by burning the oil in a reservoir rather than injecting steam into the reservoir.

Hot water Steam

Hot water zone Hot oil and water Steam Zone Hot water zone Hot oil and water

Figure 7.2: Hot water flooding. Figure 7.3: Steam flooding.

7.2 In-Situ Combustion Process

Unlike the other thermal recovery processes that require the injection of steam or hot water, the

ISC process mainly requires the injection of air or oxygen. Because the combustion process takes

place in the reservoir, exhaust gases from combustion reactions, especially carbon dioxide, can

be retained in the reservoir, thus reducing the pollution to the environment. However, because

the burning process belongs to intense chemical reactions and the geological conditions of an

underground reservoir are often complicated, the ISC process has some problems that need to be

overcome, such as the control of the burning process and the safety of well operations.

114 Steam Steam

Steam Zone Initial Zone Steam Chamber Initial Zone

Figure 7.4: CSS. Figure 7.5: SAGD.

Figure 7.6 shows a typical ISC process. Air or oxygen-enriched air is injected into a reservoir

through an injection well. For a wet ISC process, water needs to be injected with air. After ignition

in the wellbore, a combustion front forms and moves forward by the propulsion of gas. It burns

a small fraction of the oil in the reservoir to heat the unburned fraction. The strong exothermic

oxidation reactions generate flue gases and increase the temperature of the adjacent part. Oil is

reduced in viscosity by several orders of magnitude and driven to produce. The pyrolysis of heavy

oil also improves the quality of the oil by the production of light oil components.

For movable heavy oil, the ISC process can be used as an alternative to steam flooding. For oil sands or bitumen, the ISC process can be used as a follow-up technology to CSS or SAGD.

The process is mostly a gas phase dominated oil displacement process with high temperature and chemical reactions. In-depth understanding and optimization of the process through numerical simulations have great significance to both industrial practice and scientific research.

7.3 Literature Review

Ramey et al. established the first ISC model in 1959 [66]. The model is a highly simplified one to study the movement of a combustion front and the transient temperature distribution caused by the

115 Air or O 2 Fluids Water

Burned zone Combustion zone Cracking/Vap zone

Steam Zone Altered Zone Initial Zone

Figure 7.6: In-situ combustion process. radial movement of a cylindrical heat source through an infinite homogeneous medium.

Gottfried developed the first general ISC model in 1965 [67]. He proposed a generalized math- ematical model describing the thermal recovery of oil with convective external heat loss. Three- phase fluid flow, conduction-convection heat transfer, and chemical reactions between oxygen and oil and aqueous phase change were all taken into account. Many characteristics of an ISC pro- cess, such as the propagation of a combustion front and the formation of water and oil banks, can be reflected in the running results. However, the effects of gravity and capillary forces, oil phase equilibrium, and the coke formation and oxidation were neglected. Limited by the performance of computers at that time, a model was designed in one dimension, and its nonlinear system was solved by a semi-implicitly method (implicit for pressure and temperature; explicit for saturation).

The model is considered a milestone in simulating the phenomena observed in thermal recovery experiments.

Coats presented a three-dimensional and highly implicit numerical model for simulating steam

116 flooding in 1978. The model is compared in stability and computing time with an earlier reported

model. The VS (variable substitution) method was used in the model formulation [68].

Crookston et al. proposed a consistent model to simulate thermal recovery processes under

complex conditions in 1979 [69]. The PER concept used for the process of the disappearance

and appearance of liquid phases was introduced in this work. The oil phase was divided into two

pseudo components: light oil and heavy oil. Six components and three fluid phases were involved

in the formulation. Four chemical reactions that are light oil oxidation, heavy oil oxidation, heavy

oil cracking, and coke oxidation were involved. Heat transfer of conduction, convection, and heat

loss to the surrounding rocks were all included. The overburden and underburden were treated

as finite regions, and the heat flow in both vertical and horizontal directions was allowed in the

regions. The effects of capillary pressure and gravity were also included.

Grabowski et al. developed a more general-purpose thermal model with a fully implicit for-

mulation in 1979 [70]. The model included four phases, a variable number of oil components, and

chemical reactions. The finite difference method, two-point upstream, and harmonic mobility were

adopted for numerical calculations.

With the development of research, the modeling of ISC became more and more rigorous. More

oil and gas components and chemical reactions were taken into account. The phase equilibrium,

reaction kinetics, and stoichiometry were all required. The solution methods evolved gradually

towards the fully implicit treatment of the finite difference scheme.

Based on the work of Crookston et al., Youngren developed an ISC combustion simulator for the scale of realistic reservoir problems in 1980 [71]. It presented the analysis of experimental and

field data in one, two, and three dimensions. As oxygen is assumed to be instantaneously consumed on contact with fuel, the rate of combustion is assumed to be equal to the oxygen flux. The simulator cannot model partially quenched wet combustion or be used to predict low-temperature and spontaneous ignition.

Coats extended the generality of previous models in 1980 [72]. He developed a comprehensive

117 and general-purpose thermal simulator. The simulator allowed any component to distribute in any phase. The model computed the coke concentration and considered the change of absolute permeability with coke concentration. Variable substitution and a fully implicitly solution scheme were adopted.

Hwang et al. introduced a black oil simulator for in-situ combustion processes in 1982 [73].

A combustion front was not only viewed as a moving heat source but as a displacement pump to enhance oil flow. The field-oriented features were illustrated by the history-matching of a dry- combustion pilot test in South Belridge, California.

Rubin et al. introduced a steam and combustion simulator in 1985, which is fully implicit, four-phase, multi-component, and multi-dimensional [74]. They also developed a suite of powerful iterative techniques to solve large-scale thermal problems.

Abou-Kassem et al. proposed a new set of pseudo equilibrium ratios (PERs) for a thermal simulator in 1985. They compared the proposed approach with the variable substitution (VS) method and considered the PER method to be an efficient and simple method of handling phase change. Their PERs can be used to deal with the disappearance and appearance of the oil, water, or gas phase [75].

Chen et al. developed a simulator to model thermal effects in naturally fractured reservoirs in 1987 [76]. The double porosity concept was used for fractures and matrix blocks. Chemical reactions were not involved in this model.

In 1988, Ito and Chow developed their model for field-scale combustion, in which a new pseudo-kinetic scheme was introduced. And they developed a new and numerically stable al- gorithm with an oil-flow-enhancement scheme to achieve the desirable fuel consumption [77].

Ewing and Lazarov published the techniques of adaptive local grid refinement in 1988 [78].

They explained several different types of grid refinement methods and suggested approaches to in- corporate these methods in existing large-scale simulators. The self-adaptive local grid refinement techniques can be a reasonable solution to resolve the local physical behavior such as a combustion

118 front and complex well models in large-scale fields.

Thiez and Lemonnier proposed a formulation that used a heat-release curve to describe the

combustion front in 1990. The method improved numerical stability and gave an accurate temper-

ature distribution with large grid blocks [79].

Oklany developed a multi-dimensional three-phase model with a numerical Jacobian solution

in FORTRAN 77 to provide a comprehensive and efficient model accessible to the academic com-

munity in 1992 [80]. The model was based on the assumptions made in the model proposed by

Crookston [69].

Abou-Kassem et al. gave some useful guidelines for numerical difficulties encountered in the

simulation of steam injection and in-situ combustion processes in 1996. A set of problems were

addressed, such as the elimination of constraint equations at the matrix level, phase change, steam

injection rate, and alternative treatments of heat loss. Both the PER method and the VS method

are discussed in this work [81].

Christensen et al. presented a new technique to decrease the computational times of thermal

simulations in 2004. To reduce the computational cost, they applied a dynamic gridding method

to keep a fine-scale representation near the thermal front, which is a combustion front or steam

chamber interface, and a coarser grid away from the front [82].

Huang designed a decoupled software framework with separate modules for physical models and discretization methods in 2009 [83]. Thermal recovery processes such as steam flooding, in-situ combustion, and steam-assisted gravity drainage were implemented under this framework.

The simulator also provided functions of unstructured grids, discrete fractures, and parallelization.

7.4 Research Objectives

The objectives of this research focus on developing a new in-situ combustion simulator with inte- grated functions based on the PER (pseudo equilibrium ratio) method and completing a compre- hensive comparison of the research simulator with a benchmark simulator implemented with the

119 VS (variable substitution) method.

Compared with the VS method, the PER method reduces the influence of phase disappearance and appearance on the mathematical system of the ISC process. The PER method can be used as an efficient option to reduce the complexity of development. In this research, the numerical results of the research simulator with the PER method are required to compare with those of a benchmark simulator with the VS method in an omnidirectional range. These functions contain dry combustion, wet combustion, heat loss, multi-perf wells, well schedule, simulation of a tube size model, and simulation of 1-D, 2-D and 3-D full-scale models. The equivalence of the two methods is expected to be verified from numerical experiments.

The significance of this research is not only to verify the equivalence of the research simulator with a benchmark simulator but also to provide complete experimental support for popularizing the use of the PER method to develop an in-situ combustion simulator.

The research also provides a secondary development platform for further research related to in-situ combustion development. Because an in-situ model integrates the function of a thermal model and chemical reactions, the conclusion of this research work can be directly applied to the development of a thermal recovery simulator.

A remarkable point to clarify is that using a simulator and developing a simulator are two com- pletely different things. For simulator applications, researchers are more concerned about how to build an appropriate model and find new ways to optimize the model using a developed simulator.

However, for simulator development, researchers focus on the analysis of mathematical systems, the use of numerical methods, the design and development of software, and the correctness of running results. The development of a modern simulator is an integrated discipline of reservoir engineering, software engineering, computational mathematics, and even high-performance com- puting.

120 Chapter 8

IN-SITU COMBUSTION SIMULATOR

This chapter expounds on the principle and implementation mechanism of the ISC simulator. The

first part introduces the model description, which includes the phases, components, chemical reac- tions, physical properties, and assumptions. In the second part, this chapter investigates the mathe- matical system of the ISC process, including mass conservation, energy conservation, constraints, and well equation. The principle of phase transition treated by the PER method is discussed in detail. A specific study is given to heat loss, which is essential for a non-isothermal model. The last part of this chapter presents the numerical methods of simulator implementation. The dis- cretization scheme and specific treatments are investigated. The Newton-Raphson method and the simulator flow are also presented.

8.1 Model Description

8.1.1 Phases and Components

The ISC (in-situ combustion) recovery is a very comprehensive process featured by multi-phase, multi-component, non-isothermal, and chemical reactions. In this research, we designed and devel- oped an ISC simulator based on the general functions majorly described by Crookston et al. [69].

Table 8.1 gives the phases, components, and their relationships, where the symbol “X” denotes a component that can exist in a phase.

The ISC model has three fluid phases: water, oil, and gas. The oil phase is composed of two

pseudo-component: light oil and heavy oil. The components in the gas phase are divided into con-

densable and non-condensable. The components of water, light oil, and heavy oil are condensable.

The components of oxygen, nitrogen, carbon dioxide, and carbon monoxide are assumed non-

condensable. Except for the oxygen component, the other non-condensable components in the gas

121 Table 8.1: Phases and components for ISC model. Phase water oil gas solid Component

Water (H2O) X X Light oil (LO) X X

Heavy oil (HO) X X

Oxygen (O2) X

Inert gas (COx,N2) X Coke (Coke) X

phase are combined into a pseudo-component: Inert gas. The combination reduces the computa-

tional complexity but loses the opportunity to study the properties of each member. Besides the

rock matrix, coke is also a solid, but it is considered an active solid phase, which is one product

of the combustion reaction. The symbols in parentheses will be used in the following chemical

reaction equations.

8.1.2 Chemical Reactions

The ISC model has four chemical reactions, which are light oil oxidation, heavy oil oxidation,

heavy oil cracking, and coke oxidation, as illustrated by Figure 8.1. The chemical reaction equa-

tions are given as

r1 LO + s1 O2 s2 COx + s3 H2O, (8.1) −→ r2 HO + s4 O2 s5 COx + s6 H2O, (8.2) −→ r3 HO s7 LO + s8 Coke + s9 COx, (8.3) −→ and

r4 Coke + s10 O2 s11 COx + s12 H2O. (8.4) −→ s1 s12 are stoichiometric coefficients. r1, r2, r3, r4 are reaction rates measured by a reactant with ∼ a unit stoichiometric coefficient. COx, a portion of the inert gas, is a mixture of carbon monoxide

122 (CO1) and carbon dioxide (CO2).

CO x +O 2

r 2 H 2 O

HO CO x +O 2 LO

r 1 H 2 O

CO x +O 2 Coke

r 3 r 4 H 2 O

CO x

Figure 8.1: Chemical reactions.

8.1.3 Physical Properties

The expressions of physical properties are collected from the publications of Crookston et al.,

Coats, and STARS USER GUIDE [69, 72, 84].

8.1.3.1 Porosity

Porosity, occupied by the fluid phases and coke, is the fraction of the void part over the total vol- ume. The porosity φ is impacted by pressure and temperature, and calculated based on a reference value φre f as

1+cφ φ = φre f e , (8.5) where

cφ = cφ p(P Pφ re f ) cφ t(T Tφ re f )+ cφ p t(P Pφ re f )(T Tφ re f ). (8.6) , − , − , − , , , − , − , cφ,p, cφ,t and cφ,p,t are the formation compressibility, the thermal expansion coefficient, and the P-T cross-term coefficient, respectively.

123 Coke is assumed to be immobile and accounts for a portion of the void volume. The coke

concentration is calculated by

Cc = φScρc, (8.7)

where Sc and ρc are coke saturation and coke molar density.

In order to avoid introducing Sc into the model calculations, we use a porosity reduction ap- proach suggested in STARS USER GUIDE [84]. The effective porosity

φ f = φ(1 Sc) (8.8) −

comes from the original porosity deducted by the volume of coke.

By combining (8.7) and (8.8), the final expression of φ f is written as

Cc φ f = φ . (8.9) − ρc

8.1.3.2 Absolute Permeability

The tensor of absolute permeability is given by

Kxx   K = Kyy , (8.10)      Kzz      where each diagonal entry stands for an absolute permeability measured in the direction of a coor- dinate axis.

8.1.3.3 Relative Permeability

Because the measurement of three-phase permeability is challenging, three-phase permeability is usually estimated by two-phase permeability data. Given a water-wetting system, Figure 8.2 shows the relationship between pore occupancy and pore size. Therefore, the relative permeabilities of the three phases are based on the functions of

Krw = Krw(Sw), (8.11)

124 Krg = Krg(Sg), (8.12)

and

Kro = Kro(Sw,Sg). (8.13)

Pore occupancy

water oil gas

Pore size

Figure 8.2: Pore occupancy vs. pore size (adapted from [85]).

Krw and Krg are obtained from an oil-water flow and oil-gas flow experiments, respectively; see Figure 8.3 and Figure 8.4. They can also be estimated by the two-phase models as

Zw Sw Swc Krw = Krwro − (8.14) 1 Swc Sorw  − −  and

Zg Sg Sgc Krg = Krgro − (8.15) 1 Swc Sorg Sgc  − − −  [86].

The Stone II model

Krow Krog Kro = Krocw + Krw + Krg Krg Krw (8.16) K K − −  rocw  rocw   provides the most common method for calculating Kro, where Krow and Krog are computed by

Zow 1 Sorw Sw Krow = Krocw − − (8.17) 1 Sorw Swc  − −  125 and Zog 1 Swc Sorg Sg Krog = Krocw − − − (8.18) 1 Swc Sorg Sgc  − − −  [72, 87, 88].

k row k rw k k krocw krocw rog rg

S + S Swc Sorw wc org 0 0 0 1 0 1 Sw Sg

Figure 8.3: Two-phase relative permeabilities Figure 8.4: Two-phase relative permeabilities for an oil-water system (from [88]). for an oil-gas system at connate water satura-

tion (from [88]).

Krwro: relative permeability to water at residual oil saturation in oil-water two-phase • Krgro: relative permeability to gas at residual oil saturation in oil-gas two-phase • Krocw: relative permeability to oil at connate water saturation in oil-water two-phase • Krow: relative permeability to oil in oil-water two-phase • Krog: relative permeability to oil in oil-gas two-phase • Swc: connate water saturation • Sorw: residual oil saturation in oil-water two-phase • Sorg: residual oil saturation in oil-gas two-phase • Sgc: critical gas saturation • Zw: fitting constants for relative permeability of water phase • Zow: fitting constants for relative permeability to oil in oil-water two-phase •

126 Zog: fitting constants for relative permeability to oil in oil-gas two-phase • Zg: fitting constants for relative permeability of gas phase •

8.1.3.4 Molar Density

Water and oil are liquid phases. Their molar densities, ρw and ρo, are calculated from a reference density. The molar density of gas needs to be determined from the EOS (Equation of State) as it is

sensitive to the properties and composition of its components.

Water Molar Density

The calculation of ρw is given as

cw ρw = ρre f ,we , (8.19)

o o where the reference density ρre f ,w is measured at (T , P ), the reference pressure and temperature.

cw is determined by density compressibility cw,p, 1st-order thermal expansion coefficient cw,t, 2nd-

order thermal expansion coefficient cw,t2, and P-T cross-term coefficient cw,p,t. The expression of

cw is written as o o 1 2 o2 cw = cw p(Pw P ) cw t1(T T ) cw t2(T T ) , − − , − − 2 , − o o + cw p t(Pw P )(T T ). (8.20) , , − − Oil Molar Density

As there are multiple components in the oil phase, the oil molar density

1 No − xo,i ρo = ∑ (8.21) i=1 ρo,i !

is calculated by mole fraction xo,i of each component i in the oil phase and molar density

co,i ρo,i = ρre f ,o,ie (8.22) of each component i in the oil phase, where

o o 1 2 o2 co i = co p i(Po P ) co t1 i(T T ) co t2 i(T T ) , , , − − , , − − 2 , , −

127 o o + co p t i(Po P )(T T ). (8.23) , , , − − Gas Molar Density

The gas molar density ρg is calculated by the EOS (Equation of State) of

Pg ρ = , (8.24) g ZR T where Z is the average compressibility for the gas phase and R is the universal gas constant.

The factor Z is obtained from the maximal real root of the Redlich-Kwong equation given as

Z3 Z2 +(A B2 B)Z AB = 0. (8.25) − − − −

Let Tc and Pc represent the critical temperature and critical pressure, respectively and xg,i be the mole fraction of component i in the gas phase. The coefficient calculations are given by

2.5 Pg T A = 0.427480 c , (8.26) P T  c   P T B = 0.086640 c , (8.27) P T  c   2 a2 3 T = , (8.28) c b   T P = c , (8.29) c b

1 Ng 2 Tc,i a = ∑ xg,iTc,iv , (8.30) uPc,i i=1 u t and Ng Tc,i b = ∑ xg,i . (8.31) i=1 Pc,i

8.1.3.5 Mass Density

When there are Nα components in phase α, the mass density ρˆ α of the phase is calculated by

ρˆ α = ραMα. (8.32)

128 Mα, the average molecular weight, is determined by

Nα Mα = ∑ xα,iMi, (8.33) i=1

where xα,i is the mole fraction of component i in phase α and Mi is molecular weight of component i.

8.1.3.6 Viscosity

Water Viscosity

The water viscosity µw is calculated by temperature T with coefficients avw and bvw. The expression is written as

bvw µw = avwe T . (8.34)

Oil Viscosity

The oil viscosity µo is calculated by

No ∑ xo ilnµo i µo = e i=1 , , , (8.35)

where

bvo,i µo,i = avo,ie T . (8.36)

Gas Viscosity

The gas viscosity is calculated by

Ng ∑i=1 µg,ixg,i√Mi µg = , (8.37) Ng ∑i=1 xg,i√Mi where

bvg,i µg,i = avg,iT . (8.38)

8.1.3.7 Enthalpy

The isobaric heat capacity and temperature determine the gas enthalpy. The liquid enthalpy can be

calculated from the gas enthalpy subtracted by vapourization enthalpy.

Gas Enthalpy

129 The gas enthalpy is given as Ng Hg = ∑ xg,iHg,i, (8.39) i=1 where T Hg,i = cp,g,idT (8.40) ZT o

is the enthalpy of component i in the gas phase. cp,g,i, the isobaric heat capacity of component i in the gas phase, is calcualted by

2 3 cp,g,i = cp,g1,i + cp,g2,iT + cp,g3,iT + cp,g4,iT . (8.41)

Water Enthalpy

The water enthalpy Hw is calculated by

Hw = Hg w Hvap w, (8.42) , − ,

where Hg,w is the enthalpy of the water component in a gas state, and Hvap,w is the vapourization enthalpy calculated by

ev,w Hvap w = hvr w(Tc w T) T Tc w , , , − ≤ ,  . (8.43) Hvap,w = 0 T > Tc,w

When the temperature T is greater than the critical temperature Tc,w, Hvap,w equals zero. Oil Enthalpy

The oil enthalpy No Ho = ∑ xo,iHo,i (8.44) i=1

is calculated by xo,i and the enthalpy of oil component i

Ho i = Hg o i Hvap o i, (8.45) , , , − , ,

where ev o i Hvap o i = hvr o i(Tc o i T) , , T Tc o i , , , , , , − ≤ , ,  . (8.46) Hvap,o,i = 0 T > Tc,o,i   130 8.1.3.8 Internal Energy

Fluid Internal Energy

The internal energy Uα of phase α is calculated by

Pα Uα = Hα . (8.47) − ρα

Solid Internal Energy

Since rock often occupies a significant fraction of the total volume, the rock internal energy

o 1 2 o2 Ur = Cvr1(T T )+ Cvr2(T T ) (8.48) − 2 − should be calculated by constant volumetric heat capacity.

The coke internal energy

o 1 2 o2 Uc = Cpc1(T T )+ Cpc2(T T ) (8.49) − 2 −

is calculated by the isobaric heat capacity, which is approximate but accurate enough.

8.1.3.9 Thermal conductivity

The thermal conductivity

κT = φ ∑ Sακα +(1 φ)κr (8.50) α=w,o,g ! − is calculated from the thermal conductivity of the fluid phases and rock with the weight of volume.

8.1.3.10 Capillary pressure

The capillary pressure Pcow between the oil phase and the water phase and the capillary pressure

Pcgo between the gas phase and the oil phase are calculated by

3 Pcow = pcw0 + pcw1(1 Sw)+ pcw2(1 Sw) (8.51) − − and

3 Pcgo = pcg0 + pcg1Sg + pcg2Sg (8.52) separately, which were proposed by Vaughn [89].

131 8.1.4 Assumptions

The following assumptions are made for the model:

1. Mass transport: the diffusion effect of the components between phases is ignored.

2. Energy transport: the heat conduction and convection are considered, while the heat radiation

effect is ignored. The flow of fluid (liquid and gas) is assumed to be a Darcy flow. We do not

take into account the energy changes caused by viscous forces and diffusion.

3. Phase equilibrium: phase equilibrium is reached instantly.

4. Chemical reaction equilibrium: chemical reaction equilibrium is reached instantly.

5. Rock: rock is a solid that forms the matrix of the model. The porosity of rock provides a space

for fluid and coke. Rock is not supposed to participate in chemical reactions.

6. Coke: coke is assumed as an immobile solid that occupies the volume of porosity, but its effect

on permeability is ignored.

8.2 Mathematical System

8.2.1 Mass Conservation

A mass conservation equation is a consecutive equation used to describe the mass conservation of a specific component. For a component i in phase α, its mass conservation is expressed by

∂∑α=w,o,g(φSαραxα,i) = ∇ ∑ uαραxα,i + qi + ri, (8.53) ∂t − · α=w,o,g ! where uα, qi and ri represent the Darcy velocity, the well term, and chemical reaction term, respec- tively.

As coke is solid, there is no divergence term and well term. The coke mass conservation is given by ∂C c = r , (8.54) ∂t i where Cc is the concentration of coke.

132 Darcy’s Law

Darcy’s law describes a fluid flow in a porous medium. The Darcy velocity uα of phase α is given by Krα uα = K ∇Φα, (8.55) − µα where the potential Φα is calculated by

Φα = Pα ρˆ αgZ. (8.56) −

Well Term

The well term qi represents the flow rate of component i in a well. The calculation of qi is

qi = ∑ Qαxα,i, (8.57) α=w,o,g where Qα is the flow rate of phase α in the well.

Qα is obtained from Peaceman’s model and written as

ρα Qα = WI Krα [(Pbh Pα) ρˆ wellgˆ(ZBH Z)] (8.58) · µα − − −

[90]. WI is the well index. Pbh is the bottom hole pressure. ρˆ well is the average mass density of

flow from the depth ZBH to Z in the well.

Peaceman’s model calculates the flow rate through a radial flow with an equivalent radius re, at which the steady-state flowing pressure for the actual well equals the numerically computed pressure for the well block [4]. Figure (8.5) presents an instance that shows a single layer centered well model with a well index WI given by

2πh k WI = z , (8.59) ln( re ) rw

where rw is the well radius and hz is the height of the well wall. We assume the reservoir to be homogeneous and isotropic; i.e., for the absolute permeability tensor K = kI, k is a constant. For

other well types, the well indices are detailed in [90, 4].

Chemical Reaction Term

133 re

rw

hz

Figure 8.5: Radial flow (adapted from [4]).

ri represents the rate of consumption or formation of a component i that acts as a reactant or a product in chemical reactions. The calculation of ri is given by

Nr ri = ∑ s j,iR j, (8.60) j=1 where component i participates in Nr chemical reactions. s j,i and R j are the stoichiometric coeffi- cient and the chemical reaction rate, respectively.

The chemical reaction rate R j is determined by the Arrhenius Equation

E j a b R A e− R T P C (8.61) j = j O2 f

[91, 92]. PO2 is the partial pressure of oxygen. Cf is the concentration of fuel (hydrocarbon).

The fuel molecules that have kinetic energy greater than the activation energy E j are qualified to react. The activation energy is assumed to have no relation with temperature on experience. A j is the pre-exponential factor. From a statistical point of view, the number of reaction molecules conforms to the Maxwell-Boltzmann distribution. A first-order dependence with respect to both oxygen partial pressure (a = 1) and carbon concentration (b = 0.5 1) was suggested by Bousaid ∼ et al. (1968) and Dabbous et al. (1974) [93, 94].

The reaction rate of hydrocarbon oxidation is calculated by

E j R T R j = A je− (Pgxg,O2 )(φSoρoxo,i), (8.62)

where Pgxg,O2 is the oxygen partial pressure and φSoρoxo,i is the concentration of the oil component

i in the oil phase. If the fuel is coke, φSoρoxo,i is replaced by Cc.

134 The reaction rate of heavy oil pyrolysis is calculated by

5 E j C R c R j = A je− T (φSoρoxo i) 1 . (8.63) , − C "  c,max  #

5 As coke exists in the resultants, a factor 1 Cc is used to set an upper bound for coke − Cc,max   concentration suggested by Oklany [80]. The reaction rate equals zero when the upper bound is

reached, but little influence affects the reaction rate when the coke concentration is low.

8.2.2 Energy Conservation

The energy conservation is expressed by

∂ ∑ φSαραUα +CcUc +(1 φ)Ur = ∇ (κT ∇T) ∇ ∑ uαραHα ∂t "α=w,o,g − # · − · α=w,o,g !

+ qH + qH qloss, (8.64) r − where

qH = ∑ QαHα (8.65) α=w,o,g and Nr qHr = ∑ R jHr, j. (8.66) j=1

The left-hand side is the accumulation term calculated by the internal energy of a fluid phase Uα, the internal energy of coke Uc, and the internal energy of rock Ur. It needs to note that Uα and Uc use the dimension of energy per mole, but Ur is in the dimension of energy per grid block volume.

On the right-hand side, two divergence terms are heat conduction and heat convection. κT and uα are thermal conductivity and Darcy velocity, repectively. Hα is the enthalpy for fluid phase

α. qH, qHr and qloss represent the heat flow through a well, chemical reactions and surroundings

(overburden or underburden), respectively. Hr, j in (8.66) stands for the reaction enthalpy of the chemical reaction j. The heat loss qloss will be introduced in Section 8.2.6.

135 8.2.3 Constraints

8.2.3.1 Gas Mole Fraction

The gas phase is composed of condensable components and non-condensable components. The

condensable components are the water component and the oil components. The non-condensable

components include oxygen, nitrogen, carbon monoxide, and carbon dioxide. The constraint for

mole fractions of components in the gas phase is given as

Ng ∑ xg,i = 1. (8.67) i=1

8.2.3.2 Oil Mole Fraction

The mole fractions of components in the oil phase satisfy the constraint of

No ∑ xo,i = 1. (8.68) i=1

8.2.3.3 Saturation

The saturations of fluid phases conform the constraint given by

∑ Sα = 1. (8.69) α=w,o,g

8.2.3.4 Capillary Pressure

The constraints of capillary pressure is given as

Pw = Po Pcow (8.70) −

and

Po = Pg Pcgo, (8.71) −

where Pcow represents the capillary pressure between oil and water, and Pcgo is the capillary pres- sure between gas and oil. Because the capillary pressures are usually much smaller than the phase

pressures, they can be reduced to zero in an approximate calculation.

136 8.2.4 Well Equation

Each well contributes an equation to the mathematical system. When the well has a fixed bottom

hole pressure Pbh,const, the equation is simple as

Pbh = Pbh,const. (8.72)

When a well is fixed at a flow rate qα,const of phase α, the equation is written as

Qα = qα,const. (8.73)

8.2.5 Phase Change

We only consider the phase change between liquid and gas. The equality of fugacity of a compo- nent i in the liquid and gas phases represents the phase equilibrium, written as

fl,i = fg,i l = w,o. (8.74)

The phase equilibrium can be written in another form of K values:

xg,i = Kl g,ixl,i. (8.75) ∼

Because it is accompanied by heat changes, the ISC process makes occasions for the disappear- ance and appearance of the liquid phases: the water phase or the oil phase. The phase equilibrium equation (8.75) will be invalid if its liquid phase disappears. Furthermore, we have to update the mathematical system to cater to different situations, such as removing the saturation variable of a liquid phase when the liquid phase disappears or changing the system back when the liquid phase reappears. Base on this idea, the VS (variable substitution) method was investigated by

Coats [68, 72]. To avoid many variable and system adjustments, Crookston et al. adopted the

PER (pseudo equilibrium ratio) method, which introduces a correction factor to keep the phase equilibrium equations effective no matter whether a liquid phase disappears or appears [69]. The

PER method is equivalent to acquiring the system solution with high accuracy under the state of a liquid phase vanish by giving a minimal saturation of the liquid phase. In other words, it is

137 equivalent to assuming that no liquid phase will disappear, but it is possible to obtain the solution

with high accuracy for the case of phase disappearance and appearance. Therefore, compared with

the VS method, the PER method reduces the influence of phase disappearance and appearance on

the mathematical system. It provides a simple and efficient technique to handle the phase changes

[75].

Kw∗ g, the pseudo K value for the water component, can be written as ∼

Sw Kw∗ g = Kw g , (8.76) ∼ ∼ Sw + ε where Sw is the correction factor. Sw+ε For the oil phase, it is unnecessary to create a pseudo K value for every component because

one correction factor related to oil saturation is enough to keep the presence of the oil phase. The

pseudo K value should be used for the heavy oil component [69]. Ko∗ g,H, the pseudo K value for ∼ the heavy oil component, is calculated by

So Ko∗ g,H = Ko g,H , (8.77) ∼ ∼ So + ε where So is the correction factor. So+ε Usually, ε is assigned a small value on the order of 10 4. Given ε = 1 10 4, the correction − × − factor is greater than 0.99 when the saturation is greater than 0.01. So there is little influence on a system solution for an existing liquid phase. From the other side, the correction factor ranges from 0 to 0.99 when the saturation is less than 0.01. This feature allows the correction factors to provide enough flexibility to satisfy a phase equilibrium equation and guarantee a sufficiently precise solution when a liquid phase approaches disappearance.

The K values are obtained from Pg, T and a series of laboratorial parameters:

kl,i,1 kl,i,4 Kl g,i = + kl,i,2Pg + kl,i,3 exp (8.78) ∼ Pg T kl i 5    − , ,  [4].

138 8.2.6 Heat Loss

The chemical reactions in an ISC process release a large amount of heat, which increases the

temperature of a reservoir, leading to an unavoidable heat loss to the surrounding rock, usually the

cap rock or base rock (overburden or underburden); see Figure 8.6.

cap rock

reservoir

base rock

Figure 8.6: Heat loss.

Because a temperature profile may go deep into the surrounding rock, a large number of grid

blocks based on a discrete method are required to be established in the surrounding rock to obtain

a numerical solution of heat loss. Therefore, the solution of such a discrete method would oc-

cupy considerable resources of computing and storage. Considering the uncertainty in the thermal

parameters of the surrounding environment, Vinsome and Westerveld proposed a semi-analytical

method to calculate the heat loss easily and efficiently [95].

The problem of heat loss can be seen as heat conduction in a semi-infinite solid defined by

∂T ∂2T ∂t = λsur ∂z2   T(t,0)= Tres  , (8.79)   T(t,z ∞)= T o  → sur  T(t,z)= T o  sur   where λsur (abbreviated as λ in the following) represents the thermal diffusivity of the surrounding rock and z is the depth starting at the interface and vertical into the surrounding rock. Let θ

represent the difference at the interface between the reservoir temperature and initial temperature

139 in the surrounding rock:

o θ = Tres T . (8.80) − sur Based on the fitting function

o 2 z T(t,z) T =(θ + pz + qz )e− d (8.81) − sur

chosen for the temperature profile into the cap or base rock, Vinsome and Westerveld derived the

heat loss expressions using a finite-difference discretization of the time derivative [95].

The method only requires quite standard parameters from the surrounding rock, such as the

thermal conductivity κsur and volumetric heat capacity Cv,sur (abbreviated as κ and Cv in the fol- lowing). The thermal diffusivity λ is determined by

κ λ = . (8.82) Cv

The heat loss flux q f lux is calculated by

∂T θ q f lux = κ = κ p . (8.83) − ∂z d −   d is the diffusion length in the surrounding rock:

√λt d = . (8.84) 2

p, I, and q are intermediate variables:

λ∆tθ n d3(θ θn) + I − p = d − λ∆t , (8.85) 3d2 + λ∆t

In = θndn + pn(dn)2 + 2qn(dn)3, (8.86)

and 2 θ θn 2pd θ + d − q = − λ∆t , (8.87) 2d2 where the superscript n denotes a value from the old time step.

The heat loss is obtained by

qloss = q f luxA f lux, (8.88)

140 where A f lux is the flux area between the reservoir and the surrounding rock. Because the calculation does not use any feature information of boundary grid blocks, the method can be used for both

structured or unstructured grids. The energy stored in the surrounding rock can be calculated by

∞ κ κ 2 Ec = Tdz = d(θ + pd + 2qd ). (8.89) λ Z0 λ

8.2.7 PDE System

Based on the equations and formulas introduced, a closed system can be established by selecting equal numbers of equations and unknowns. We select the natural variables, which refer to the

Po, T, Sα, and xα,i, for unknowns in the research. The natural variables benefit the derivation calculation of a Jacobian matrix and avoid the flash calculation to obtain mole fraction xα,i. Table 8.2 lists the equations and unknowns. The correspondence between an equation and an unknown is not required to be the same as in the table. Generally, a variable that has a great effect on an equation should be selected for the equation.

Po is selected as the unknown for pressure. Pw and Pg are obtained by the capillary pressure equations (8.70) and (8.71). Sw and Sg are selected as unknowns for saturation. So is calculated by the saturation constraint equation (8.69). xo,i, the mole fraction of the oil component i in the oil

phase, is selected as unknown for oil conservation. xg,i, the mole fraction of the non-condensable gas component i in the gas phase, is used as unknown for gas conservation. The phase equilibrium equation (8.75) determines the mole fraction of the water component and any oil component in the gas phase. Both a fixed pressure well and a constant rate well have Pbh as their unknowns. Two types of indexing rules are designed to track a component; see Table 8.3. The local index records a component in its local phases, such as an oil component in the oil phase and a non- condensable gas component in the gas phase. The global index is used to identify a component when all components are under consideration. For example, the global index is an appropriate way when we want to indicate an oil component in the gas phase.

141 Table 8.2: Equations and unknowns. No. Equation Unknown

1 water conservation Po

2 oil conservation xo,i

3 gas conservation xg,i

4 coke conservation cc 5 energy conservation T

6 gas mole fraction Sw

7 oil mole fraction Sg

8 well Pbh

Phase Number of components Local index Global index

water 1 1 1

oil No 1 ,..., No 2 ,..., No + 1

gas Ng 1 ,..., Ng No + 2 ,..., No + Ng + 1

coke (active solid) 1 1 No + Ng + 2 Table 8.3: Component indexing.

8.3 Numerical Methods

8.3.1 Discretization Scheme

The FDM (finite difference method) is used in discretization. The reservoir model is a three- dimensional block-centered grid with no flow boundary; see Figure 8.7. Input parameters define the number of cells (grid blocks) and cell size in each direction. Any number of wells are allowed to present. The wells are allowed to penetrate multi-layers in a vertical or horizontal direction. The cells can be an internal cell or a boundary cell, with or without a well perforation.

As high-level implicitness enhances numerical stability, the fully implicit scheme is adopted in the research. The FDM formulations on a given time interval ∆t and a given control volume V for

142 well 1

well 2

internal cell boundary cell well cell

Figure 8.7: Grid and cell.

the mass conservation equations and the energy conservation equation are listed in

n ∑ (φSαραxα,i) ∑ (φSαραxα,i) α=w,o,g − α=w,o,g V ∆t  6 A Krα = K ραxα i (Φα j Φα) ∑ L ∑ µ , , − j=1  avg, j α=w,o,g  α avg, j + qiV + riV, (8.90) n Cc (Cc) − V = r V, (8.91) ∆t i and n ∂ ∑ φSαραUα +CcUc +(1 φ)Ur ∑ φSαραUα +CcUc +(1 φ)Ur V ∆t " α=w,o,g − ! − α=w,o,g − ! #

6 A = κT (Tj T) ∑ L − j=1  avg, j 6 A Krα + K ραHα (Φα j Φα) ∑ L ∑ µ , − j=1  avg, j α=w,o,g  α avg, j + qHV + qH V qlossV. (8.92) r − The superscript n denotes the previous time step, and the other variables all belong to the new time step, n + 1.

Figure 8.8 shows an internal cell with its six adjacent cells. In the discretization of the diver- gence term, the subscript j denotes the six neighbor cells of the current cell. The ()avg, j terms stand for the average of the j-th adjacent cell and the current cell.

143 j = 5 x

y

j = 3 z

flow

j = 1 j = 2

cell j current cell

j = 4

j = 6

Figure 8.8: Adjacent cells and upstream weighting.

A The term L K avg, j is composed of the rock property K and the geometric properties A and L of the cells. Kis the absolute permeability in the j direction. A is the area of the interface between

two cells, cell j and the current cell. L is the length of a cell in the j direction. This term is

calculated by harmonic average:

KA KA KA 2 L L j = (8.93) L KA + KA  avg, j L  L  j [4].  

Krα The term ραxα,i involves the fluid properties ρα and µα, and the rock-fluid property µα avg, j   Krα. In this research, the average of them is determined by single-point upstream weighting:

Krα K µ ραxα,i Φα, j Φα rα ρ x = α j ≥ . (8.94) α α,i   µα avg, j     Krα ρ x Φ < Φ µα α α,i α, j α   8.3.2 Specific Treatments

8.3.2.1 Potential Difference

The calculation of the potential difference between cell j and the current cell is written as

Φα j Φα =(Pα j ρˆ α jgZˆ j) (Pα ρˆ αgZˆ ) , − , − , − −

144 ρˆ α, j + ρˆ α =(Pα j Pα) gˆ(Z j Z). (8.95) , − − 2 − The mass density is assumed to be in an arithmetic average, which offers enough accuracy.

8.3.2.2 Well Treatment

A well may penetrate multi-layers as illustrated in Figure 8.9. The reference depth ZBH, where Pbh locates, is selected at the first perforation. The flow rate of phase α in the well is calculated by

Np Krα Qα = WIper fi ρα (Pbh Pα celli) ρˆ wellgˆ(ZBH Zcelli) . (8.96) ∑ µ − , − − i=1  α avg,per fi   Np is the number of perforations. WIper fi is the well index for perforation i. Pα,celli is the phase α

pressure in the cell i where perforation i locates. Zcelli is the depth of cell i.

Pbh

Pbh

vertical well horizontal well

Figure 8.9: Well model.

Figure 8.10 shows the flow in an injection well and a production well. The calculation of the

rock-fluid part is given as:

Krα (ρα) (Pbh Pα celli) ρˆ wellgˆ(ZBH Zcelli) 0 Krα µα well , ρ =  downstream − − − ≥ . µ α   α avg,per fi  Krα    ρα (Pbh Pα,celli) ρˆ wellgˆ(ZBH Zcelli) < 0 µα cell,i − − −   (8.97)  When the well is an injector, ρα is determined by the upstream fluid in the well. Specific treatment for mobility Krα is necessary for perforations of an injection well. The mobility should be the total µα

145 mobility of all phases in a perforation cell at the downstream. This treatment avoids weak injection of a phase at the start when the saturation of the phase in the perforation cell is minimal. With the injection process of the phase α, the effect of the other phases in the mobility will become trivial.

When the well is a producer, the rock-fluid part is calculated by the fluid properties in the cell.

As the fluid in the well, especially for production, is a mixture of different phases, ρˆ well can be obtained from the mass flow rate divided by the volume flow rate in the well.

upstream

upstream

injection well production well

Figure 8.10: Flow in well.

8.3.3 Nonlinear System

Figure 8.11-(1) shows a schematic diagram of the reservoir model. There are nx, ny and nz cells

(grid blocks) in different coordinate directions. So the number of cells is ncell (nx ny nz). One × × water component, No oil components, and Ng non-condensable gas components exist. And each cell has its own set of unknowns: Po, xo,i, xg,i, Cc, T, Sw, and Sg, referring to Table 8.2. The model has nwell wells. Each well has one unknown, Pbh. The total number n of equations for the discrete system is determined by

n =(No + Ng + 5) ncell + nwell. (8.98) × The system is a non-linear system written as

~F(U~ )=~0, (8.99)

146 where ~F and U~ represent the equation vector and unknown vector, respectively.

well 1

well 1 well 2

16 17 18

13 14 15 well 2 10 11 12

n z

7 8 9

4 5 6

1 2 3

n y

n x

(1) (2)

Figure 8.11: Reservoir model.

8.3.4 Newton-Raphson Method

The Newton-Raphson method is applied to the solution of a nonlinear system. Figure 8.12 shows the Newton’s iteration from l to l + 1. The process is written as a linear system

(l) (l+1) (l) (l) ~F′(U~ )(U~ U~ )= ~F(U~ ), (8.100) − − where ~F (U~ (l)), the derivative, is called a Jacobian matrix. If U~ (l) is known, U~ (l+1) U~ (l) can be ′ − solved by the Jacobian system, which gives the update of U~ (l+1). Starting from an initial U~ (0),

Newton’s iteration continues until the requirement of tolerance ε is satisfied:

(l) ~F(U~ ) 2 < ε. (8.101) k k

147 F

(l) F( U )

(l+1) F( U )

0 (l+1) (l) U U U

Figure 8.12: Newton’s iteration.

The calculation of the Jacobian matrix by the vector-by-vector derivative is given as

∂F1 ∂F1 ...... ∂F1 ∂U1 ∂U2 ∂Un   ~ ∂F2 ∂F2 ...... ∂F2 dF ∂U1 ∂U2 ∂Un ~F′(U~ )= =  . (8.102) ~  . . . .  dU  . . .. .       ∂Fn ∂Fn ...... ∂Fn   ∂U1 ∂U2 ∂Un      Both numerical and analytical approaches to calculate the partial derivative ∂Fi are developed ∂Uj in the research. The numerical approach given by

∂Fi Fi(U1,...,Uj + ∆Uj,...,Un) Fi(U1,...,Uj,...,Un) = − ∂Uj ∆Uj

i = 1,...,n, j = 1,...,n (8.103)

6 is easy to implement. ∆Uj is always assigned a small value, such as 10− , to approximate the derivatives.

The analytical approach utilizes the rules of calculus to obtain partial derivatives directly. The partial derivatives of each row in the Jacobian matrix are the partial derivatives from the exact differential for each equation Fi. The exact differential is written as

n ∂Fi dFi = ∑ dUj i = 1,...,n, j = 1,...,n. (8.104) j=1 ∂Uj

148 Although the analytical approach can provide accurate theoretical derivatives, the complicated derivation process needs to be considered in the development.

If the nonlinear system is constructed by traversing all cells in sequential illustrated in Fig- ure 8.11, the pattern of the Jacobian matrix is illustrated in Figure 8.13. The rows and columns correspond to the equations and unknowns, respectively. The symbol “ ” of a cell stands for a × submatrix because each cell has multiple equations. If the “ ” lies on the diagonal, the subma- × trix is formed by the partial derivatives of the cell’s equations and unknowns. Otherwise, the “ ” × represents a submatrix formed by the cell’s equations and its adjacent cell’s unknowns.

cell number well number

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 1718 1 2 1

2

3

4

5

6

7

cell 8 number 9

10

11

12

13

14

15

16

17

18

well 1 number 2

Figure 8.13: Pattern of Jacobian matrix.

The Jacobian system is a linear system that is usually large-scale and sparse. Nowadays, the commonly used solvers are Krylov subspace solvers with appropriate preconditioners. An effective

Jacobian matrix solution is critical for the performance of a reservoir simulator. The Newton-

Raphson method for solving the nonlinear system (8.99) is summarized in Algorithm 16.

149 Algorithm 16 Newton-Raphson method. 1: Set l = 0 and Initialize U~ (l)

2: while (true) do

3: Assemble right hand side: F(U~ (l)) − (l) 4: if ( F(U~ ) 2 < ε) then k k 5: Break ⊲ Solution succeeds

6: end if

7: if l = lmax then

8: Break ⊲ Solution fails

9: end if

(l) 10: Assemble the Jacobian matrix: F′(U~ )

11: Solve the Jacobian system: F (U~ (l))∆U~ = F(U~ (l)) ⊲ Needs linear solver ′ − 12: U~ (l) = U~ (l) + ∆U~ ⊲ Assign value for next Newton’s iteration

13: Set l = l + 1

14: end while

8.3.5 Simulator Flow

A simulation process always consists of a series of time steps. The initial time step is set to a small

interval, such as 0.001 days. As the simulation process advances, we need to increase the time step

gradually to shorten the total simulation time, without losing an expected simulation accuracy. The

principle of adjustment is to restrict the change of unknowns between two adjacent time steps to a

range revolving around some desired values.

The estimation of a next time step size ∆tn+1 based on a previous time step size ∆tn can be calculated by

n+1 n (1 + ω)ηi, j ∆t = ∆t min n i, j "∆Ui, j + ωηi, j #

i = 1...ncell, j = 1...No + Ng + 5, (8.105)

n where ∆Ui, j stands for the previous unknown changes of cell i and unknown j, and ηi, j represents

150 the desired changes [70]. The suggested ηi, j for pressure, temperature and molar fraction are 72.5 psi, 86 F, 0.2, respectively [84]. ω is a damping factor between 0.0 and 1.0. When ω is assigned 0,

(8.105) reduces to a linear extrapolation. When ω grows, the increment or decrement of the time step becomes moderated. A default value 0.75 of ω is proposed in CMG STARS USER GUIDE

∆tn+1 [84]. Generally, the ratio ∆tn should be restricted between 0.5 and 2. Figure 8.14 summarizes the entire flow of the simulator. After loading the input data, the sim- ulator initializes the unknowns and physical properties and sets time t to zero. The module of time step control is responsible for updating ∆t and new time point t. At each time step, the state repre- sented by new unknown values and property values is calculated from solving a nonlinear system by the Newton-Raphson method. A linear iterative solver is used for the solution of a Jacobian system in each Newton’s iteration. When the simulation time reaches tmax, the simulation ends successfully. If the iteration number reaches an upper limit during Newton’s iteration, we restore the system data to the previous time step, halve the time step, and do the calculation again. If such failures continue to occur beyond a specific limit, the simulation process is considered unsuccess- ful. Three loops are involved in the simulation process: time step loop, Newton’s iteration loop, and linear solution loop.

151 Start End Input

Initialize Output True Time step False control: " = "*+- Newton ’s iteration !" , "

Set # = 0 Set $(%) = $ Output Current time step result Update properties Compute: and Set # = # + 1 –$(# (%))

Update : 4 , 4 , 7 , K value , True 5 6 8 (%) (%) ||$(# )||&< ' = !," (Oil component in gas phase)

False

(%12) (%) True = 3 ! . = .*+-

False

Linear solver solution: Nonlienar system: Assemble Jacobian: (%) (%) ! $/(# )! = 0$(# ) $/(# (%))

Figure 8.14: Simulator flow.

152 Chapter 9

NUMERICAL EXPERIMENTS(2)

This chapter carries out the verification of the developed ISC simulator with a benchmark simula- tor, the CMG STARS. The ISC simulator is developed with the PER method, but the CMG STARS is implemented with the VS method. The difference in phase transition treatment causes an essen- tial difference in their mathematical systems. As a comprehensive in-situ combustion simulator, the ISC simulator is required to be fully verified to ensure its correctness. We use two tube exper- iments to test the combustion functions, including the dry combustion and the wet combustion. A full-scale model is used to verify the function of heat loss. Several full-scale 2D and 3D models are performed in the last two experiments, which are an indispensable step in field applications.

The equivalence of the two types of simulators provides strong evidence for the promotion of the

PER method application, which is more efficient in design and development.

9.1 Introduction

This ISC simulator is developed using C language on the Linux operating system. The Darcy units are used as the internal units of the simulator implementation. Other unit systems, such as field units and SI units, are supported.

The CMG STARS is a commercial simulator subject to successful industrial testings. In the

following experiments, ISC and STARS are used to denote the ISC simulator and the CMG STARS,

respectively. All the cases are solved on the distributed-memory parallel platform [96].

153 9.2 Dry Combustion Tube

This experiment simulates dry forward combustion under steady conditions. The prototype comes from the template (stdrm001.dat) released with CMG STARS version 2015.10, which is a tube testing of 1-D vertical combustion. All figures and tables of this experiment are named with the prefix “[dry tube]”.

Table 9.1 collects the input parameters for the ISC simulator. Table 9.2 and Table 9.3 give Krw and Krg changes with Sw and Sl separately. The tube has 12 cells (grid blocks) named “cell 1” to “cell 12” from top to bottom. Air is injected from cell 1 and the fluid is produced from cell 12. An

Btu external heater with a heat rate of 4800 day is used to raise the temperature in cell 1 for 0.5 hour as ignition.

Figure 9.1 to Figure 9.6 show the combustion process of the tube. We can clearly see that there is a combustion front, which is the highest temperature part. Over time, the front of the fire keeps moving forward.

Figure 9.7 to Figure 9.17 show the curves of major variables with time. The results of ISC and

STARS are drawn with solid style and dot style, respectively. Figure 9.7 describes the bottom hole pressure of the injection well. Figure 9.8 presents the cumulative oil production of the production well. Figure 9.9 to Figure 9.17 show the curves of the unknowns of the mathematical system.

Only the results of cells 1, 5, 9, 12 are displayed to avoid the confusion caused by too many curves. Although ISC and STARS use different treatment methods in a phase transition, we can see that the ISC curves match the STARS curves very well. The correctness of the design and development of ISC is verified.

The injection well has a constant flow injection. It requires a high bottom hole pressure at the beginning, and then the bottom hole pressure decreases gradually; see Figure 9.7. The cumulative oil production rises steadily before 0.8 days, but very little oil is produced after then; see Figure 9.8.

The time for the combustion front to reach each cell is observed from the temperature curves; see

Figure 9.15.

154 Figure 9.11 shows the change of xo,HO, the mole fraction of the heavy oil component in the oil phase, with time. However, there is a difference between ISC and STARS. xo,HO of ISC goes to

1 while xo,HO of STARS approaches 0. The reason is that STARS assigns xo,LO and xo,HO both 0 when So is 0. Nevertheless, in the ISC simulator, So is assumed to be very small even when the oil phase disappears. The mole fraction constraint xo,LO + xo,HO = 1 still needs to be satisfied. When the combustion front passes through a cell, the liquid phases disappear, and xg,LO and xg,HO go to

0. The heavy oil component uses the pseudo K value that gives xo,HO a flexible change. So xo,LO

is allowed to become 0 because of the phase equilibrium and xo,HO must be assigned 1 because of

the more fraction constraint. In this case, the value of xo,HO has no actual meaning as the oil phase has disappeared.

The changes of Sw, So, Sg and Cc with distance are given by Figures 9.18, 9.19, 9.20 and 9.21. Three representative times, 0.25 days, 0.50 days and 0.75 days, are selected for each plot. Based

on these data, three graphs of pore volume occupation are generated to describe the combustion

process more clearly; see Figures 9.23, 9.25 and 9.27. A corresponding graph is put above each

of the pore volume graphs to denote the combustion front and the mole fractions of condensable

components in the gas phase; see Figure 9.22, 9.24 and 9.26. For the dry in-situ combustion,

there are four zones existing between the injection well and the production well, which are stated

by Latil [97]. We use Figure 9.23 to analyze these zones.

1. The burned zone (cell 1).

The zone lies behind the combustion front, and the combustion has taken place. It is clean and

occupied by the gas phase.

2. The combustion zone (cell 2).

The light oil, heavy oil, and coke participate in the oxidation reactions in this zone. The com-

bustion front has the highest temperature, but it is very thin, less than several inches.

3. The cracking/vaporization zone (cells 3 and 4).

The pyrolysis reaction of heavy oil takes place downstream of the combustion front. The heavy

155 oil is pyrolyzed into COx, light oil gases, and coke deposited on the rock. The light oil is vaporized by the heat released from the reaction and then transported downstream.

4. The steam plateau, water bank, oil bank, and initial zone (cells 5 to 12).

The steam plateau at the steam saturation temperature comes into being next to the cracking/-

vaporization zone. Continuous vaporization and condensation of water and light oil take place.

Most of the light oil vapor condenses further downstream as the steam condenses. A water bank

forms at the leading edge of the steam plateau because of the lower temperature. As the water

bank decreases in temperature and saturation downstream, it displaces a zone with a higher

oil saturation than that of the initial reservoir. This zone is an oil bank with most of the dis-

placed oil and the light oil from cracking reaction. Beyond the oil bank, the reservoir gradually

approaches the undisturbed original status.

Figures 9.25 and 9.27 present the pore volume occupation with the combustion front moving

forward. The high temperature caused by chemical reactions at the combustion front leads to the

disappearance of the water phase and oil phase. The in-situ combustion recovery is a process in

which the gas phase progressively occupies the dominant position.

156 Table 9.1: [Dry tube] Input data.

Unit Field

Grid

Origin (x, y, z) 0.0, 0.0, 0.0

nx, ny, nz 1, 1, 12 ∆x, ∆y, ∆z [ ft] 0.1602, 0.1602, 0.22048

Rock

Porosity

1 1 1 φre f , cφ,p psi , cφ,t F , cφ,p,t psi-F 0.4142, 0.0, 0.0, 0.0 h i   h Absolutei permeability

Kxx, Kyy, Kzz [mD] 12700, 12700, 12700 Thermal conductivity

Btu κr ft-day-F 24 h i Internal energy

Btu Btu Cvr1 ft3-F , Cvr2 ft3-F2 35.02, 0.0 h i h i Surrounding rock

Btu κcap,κbase ft-day-F 0.0, 0.0 h Btu i Cv,cap,Cv,base ft3-F 0.0, 0.0 h i Fluid thermal conductivity

Btu κw, κo, κg ft-day-F 8.64, 1.848, 1.9992 h i Reference conditions

Po [psi] 14.7

T o [F] 77

Continued on next page

157 Surface conditions(1)

Psur f [psi] 14.65

Tsur f [F] 62 Component i

Name H2O LO HO O2 COx Coke Molecular weight

lb Mi lbmol 18.0 156.7 675 32.0 40.8 13.0   Critical pressure, critical temperature

Pc,i [psi] 3155 305.7 120 730.0 500.0

Tc,i [F] 705.7 651.7 1138 -181.0 -232.0 molar density l = w,o

lbmol ρre f ,l,i ft3 3.466 0.3195 0.0914 4.4 h 1 i cl,p,i psi 3e-6 5e-6 5e-6 0.0 h 1 i cl,t1,i F 1.2e-4 2.839e-4 1.496e-4 0.0  1  cl,t2,i F2 0.0 0.0 0.0 0.0 h 1 i cl,p,t,i psi-F 0.0 0.0 0.0 0.0 h i Viscosity in gas phase

cp avg,i b 8.822e-6 2.166e-6 3.926e-6 2.1960e-4 2.1267e-4 F vg,i h i bvg,i 1.116 0.943 1.102 0.721 0.702 Viscosity in liquid phase l = w,o

avl,i [cp] 0.0047352 4.02e-4 4.02e-4

(2) bvl,i R 2728.2 6121.6 6121.6 h i Enthalpy in gas phase

Btu Cp,g1,i lbmol-F 7.613 -1.89 -8.14 6.713 7.44  Btu  Cp,g2,i lbmol-F2 8.616e-4 0.1275 0.549 -4.883e-7 -0.0018 h i Continued on next page

158 Name H2O LO HO O2 COx Coke

Btu Cp,g3,i lbmol-F3 0.0 -3.9e-5 -1.68e-4 1.287e-6 1.975e-6 h Btu i Cp,g4,i lbmol-F4 0.0 4.6e-9 1.98e-8 -4.36e-10 -4.78e-10 h i Enthalpy in liquid phase l = w,o

Btu hvr,l,i lbmol-F0.38 1657.0 1917 12198 h i ev,l,i 0.38 0.38 0.38 Internal energy for coke

Btu Cpc1 lbmol-F 4.06  Btu  Cpc2 lbmol-F2 0.0 h i K value, l = w,o

kl,i,1 [psi] 1.7202e6 1.4546e5 2.7454e5 1 kl,i,2 psi 0.0 0.0 0.0 h i kl,i,3 0.0 0.0 0.0

kl,i,4 [F] -6869.59 -4458.73 -8424.83

kl,i,5 [F] -376.64 -387.78 -205.69 Chemical reactions

r1 LO oxidation LO + 14.06O2 6.58H2O + 11.96COx −→ r2 HO oxidation HO + 60.55O2 28.34H2O + 51.53COx −→ r HO pyrolysis HO 3 2.154LO + 25.96Coke −→ r4 Coke oxidation Coke + 1.18O2 0.55H2O +COx −→ mol Reaction rate R j ft3 day , j = 1...4 − E1 h R T i LO oxidation R1 = A1e− Pgxg,O2 (φSoρoxo,LO) E2 R T  HO oxidation R2 = A2e− Pgxg,O2 (φSoρoxo,HO) E3 R HO pyrolysis R3 = A3e− T (φSoρoxo,HO) E4 R T Coke oxidation R4 = A4e− Pgxg,O2 Cc  Continued on next page

159 Btu Btu A j E j lbmol Hr, j lbmol 1     LO oxidation 7.248e11 psi-day 59450 2.9075e6 h 1 i HO oxidation 7.248e11 psi-day 59450 1.2525e7 h 1 i HO pyrolysis 1.00008e7 day 27000 40000 h 1 i Coke oxidation 1.00008e4 psi-day 25200 2.25e5 h Initiali conditions

Po [psi] 2014.7

xo,LO, xo,HO 0.744, 0.256

xg,H2O,xg,LO,xg,HO,xg,O2 ,xg,COx 0.0, 0.0, 0.0, 0.21, 0.79 lbmol Cc ft3 0.0 Th [F] i 100.0

Sw, Sg 0.178, 0.168 Well conditions

Injection well

WI [ ft-mD] 5.54

Direction Vertical

Injnection fluid Gas

Gas composition (H2O,LO,HO,O2,COx) 0.0, 0.0, 0.0, 0.21, 0.79

MSCF(3) Gas rate day 0.013296 h i Tin j [F] 70

Pin j [psi] 10000.0 Production well

WI [ ft-mD] 5.54

Direction Vertical

Pbh [psi] 2014.7 Continued on next page

160 (1): The surface conditions are used for calculating well rate.

(2): The liquid viscosity equations, (8.34) and (8.36), require the temperature in Rankine, R [84].

(3): MSCF = 1000 ft3 at the surface conditions.

161 Table 9.2: [Dry tube] Swt.

Sw Krw Krow Pcow 0.1 0.0 0.9 0.0

0.25 0.004 0.6 0.0

0.44 0.024 0.28 0.0

0.56 0.072 0.144 0.0

0.672 0.168 0.048 0.0

0.752 0.256 0.0 0.0

Table 9.3: [Dry tube] Slt.

Sl Krg Krog Pcgo 0.21 0.784 0.0 0.0

0.32 0.448 0.01 0.0

0.4 0.288 0.024 0.0

0.472 0.184 0.052 0.0

0.58 0.086 0.152 0.0

0.68 0.024 0.272 0.0

0.832 0.006 0.448 0.0

0.872 0.0 0.9 0.0

162 Figure 9.1: [Dry tube] T at 0.001 days. Figure 9.2: [Dry tube] T at 0.25 days.

Figure 9.3: [Dry tube] T at 0.50 days. Figure 9.4: [Dry tube] T at 0.75 days.

Figure 9.5: [Dry tube] T at 1.00 day. Figure 9.6: [Dry tube] T at 1.25 days.

163 2120

2110

2100

2090

2080

2070

[psi] 2060 bh P 2050

2040

2030

2020

2010

2000 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 1.5 1.6 1.7 t [day] ISC STARS

Figure 9.7: [Dry tube] Pbh of injection well vs. t.

0.004

0.0035

0.003

0.0025

0.002

0.0015 Cumulative oil production [STB] 0.001

0.0005

0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 1.5 1.6 1.7 t [day] ISC STARS

Figure 9.8: [Dry tube] Cumulative oil production vs. t.

164 2120

2110

2100

2090

2080

2070

2060 [psi] o P 2050

2040

2030

2020

2010

2000 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 1.5 1.6 1.7 t [day]

ISC CELL 1 ISC CELL 5 ISC CELL 9 ISC CELL 12

STARS CELL 1 STARS CELL 5 STARS CELL 9 STARS CELL 12

Figure 9.9: [Dry tube] Po vs. t.

1

0.9

0.8

0.7

0.6

0.5 o, LO x

0.4

0.3

0.2

0.1

0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 1.5 1.6 1.7 t [day]

ISC CELL 1 ISC CELL 5 ISC CELL 9 ISC CELL 12

STARS CELL 1 STARS CELL 5 STARS CELL 9 STARS CELL 12

Figure 9.10: [Dry tube] xo,LO vs. t.

165 1

0.9

0.8

0.7

0.6

0.5 o, HO x

0.4

0.3

0.2

0.1

0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 1.5 1.6 1.7 t [day]

ISC CELL 1 ISC CELL 5 ISC CELL 9 ISC CELL 12

STARS CELL 1 STARS CELL 5 STARS CELL 9 STARS CELL 12

Figure 9.11: [Dry tube] xo,HO vs. t.

0.25

0.2

0.15 2 g, O x

0.1

0.05

0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 1.5 1.6 1.7 t [day]

ISC CELL 1 ISC CELL 5 ISC CELL 9 ISC CELL 12

STARS CELL 1 STARS CELL 5 STARS CELL 9 STARS CELL 12

Figure 9.12: [Dry tube] xg,O2 vs. t.

166 1

0.9

0.8

0.7

0.6 2 /N x 0.5 g, CO x 0.4

0.3

0.2

0.1

0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 1.5 1.6 1.7 t [day]

ISC CELL 1 ISC CELL 5 ISC CELL 9 ISC CELL 12

STARS CELL 1 STARS CELL 5 STARS CELL 9 STARS CELL 12

Figure 9.13: [Dry tube] xg,COx/N2 vs. t.

0.15

0.1 ] 3 Cc [lbmole/ft

0.05

0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 1.5 1.6 1.7 t [day]

ISC CELL 1 ISC CELL 5 ISC CELL 9 ISC CELL 12

STARS CELL 1 STARS CELL 5 STARS CELL 9 STARS CELL 12

Figure 9.14: [Dry tube] Cc vs. t.

167 1400 1300 1200 1100 1000 900 800 700 T [F] 600 500 400 300 200 100 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 1.5 1.6 1.7 t [day]

ISC CELL 1 ISC CELL 5 ISC CELL 9 ISC CELL 12

STARS CELL 1 STARS CELL 5 STARS CELL 9 STARS CELL 12

Figure 9.15: [Dry tube] T vs. t.

0.5

0.4

0.3 w S

0.2

0.1

0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 1.5 1.6 1.7 t [day]

ISC CELL 1 ISC CELL 5 ISC CELL 9 ISC CELL 12

STARS CELL 1 STARS CELL 5 STARS CELL 9 STARS CELL 12

Figure 9.16: [Dry tube] Sw vs. t.

168 1

0.9

0.8

0.7

0.6 g

S 0.5

0.4

0.3

0.2

0.1

0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 1.5 1.6 1.7 t [day]

ISC CELL 1 ISC CELL 5 ISC CELL 9 ISC CELL 12

STARS CELL 1 STARS CELL 5 STARS CELL 9 STARS CELL 12

Figure 9.17: [Dry tube] Sg vs. t.

0.5

0.4

0.3 w S

0.2

0.1

0 0 1 2 3 4 5 6 7 8 9 10 11 12 Cell

ISC 0.25 days ISC 0.50 days ISC 0.75 days

STARS 0.25 days STARS 0.50 days STARS 0.75 days

Figure 9.18: [Dry tube] Sw vs. distance.

169 0.7

0.6

0.5

0.4 o S

0.3

0.2

0.1

0 0 1 2 3 4 5 6 7 8 9 10 11 12 Cell

ISC 0.25 days ISC 0.50 days ISC 0.75 days

STARS 0.25 days STARS 0.50 days STARS 0.75 days

Figure 9.19: [Dry tube] So vs. distance.

1

0.9

0.8

0.7

0.6 g

S 0.5

0.4

0.3

0.2

0.1

0 0 1 2 3 4 5 6 7 8 9 10 11 12 Cell

ISC 0.25 days ISC 0.50 days ISC 0.75 days

STARS 0.25 days STARS 0.50 days STARS 0.75 days

Figure 9.20: [Dry tube] Sg vs. distance.

170 0.15

0.1 ] 3 [lbmole/ft c C

0.05

0 0 1 2 3 4 5 6 7 8 9 10 11 12 Cell

ISC 0.25 days ISC 0.50 days ISC 0.75 days

STARS 0.25 days STARS 0.50 days STARS 0.75 days

Figure 9.21: [Dry tube] Cc vs. distance.

171 1200 0.3

1100

1000

900

800 0.2

700 g, HO /x 600 g, LO T [F] /x O 500 2 g, H 400 x 0.1

300

200

100

0 0 0 1 2 3 4 5 6 7 8 9 10 11 12 Cell

ISC x ISC x ISC x ISC T g, H2O g, LO g, HO STARS x STARS x STARS x STARS T g, H2O g, LO g, HO

Figure 9.22: [Dry tube] T, xg,H2O, xg,LO, and xg,HO at 0.25 days vs. distance.

1

0.9

0.8

0.7

0.6

0.5

0.4

Occupation of pore volume 0.3

0.2

0.1

0 0 1 2 3 4 5 6 7 8 9 10 11 12 Cell Water Gas Coke Oil

Figure 9.23: [Dry tube] Occupation of pore volume at 0.25 days vs. distance.

172 1200 0.3

1100

1000

900

800 0.2

700 g, HO /x 600 g, LO T [F] /x O 500 2 g, H 400 x 0.1

300

200

100

0 0 0 1 2 3 4 5 6 7 8 9 10 11 12 Cell

ISC x ISC x ISC x ISC T g, H2O g, LO g, HO STARS x STARS x STARS x STARS T g, H2O g, LO g, HO

Figure 9.24: [Dry tube] T, xg,H2O, xg,LO, and xg,HO at 0.50 days vs. distance.

1

0.9

0.8

0.7

0.6

0.5

0.4

Occupation of pore volume 0.3

0.2

0.1

0 0 1 2 3 4 5 6 7 8 9 10 11 12 Cell Water Gas Coke Oil

Figure 9.25: [Dry tube] Occupation of pore volume at 0.50 days vs. distance.

173 1200 0.3

1100

1000

900

800 0.2

700 g, HO /x 600 g, LO T [F] /x O 500 2 g, H 400 x 0.1

300

200

100

0 0 0 1 2 3 4 5 6 7 8 9 10 11 12 Cell

ISC x ISC x ISC x ISC T g, H2O g, LO g, HO STARS x STARS x STARS x STARS T g, H2O g, LO g, HO

Figure 9.26: [Dry tube] T, xg,H2O, xg,LO, and xg,HO at 0.75 days vs. distance.

1

0.9

0.8

0.7

0.6

0.5

0.4

Occupation of pore volume 0.3

0.2

0.1

0 0 1 2 3 4 5 6 7 8 9 10 11 12 Cell Water Gas Coke Oil

Figure 9.27: [Dry tube] Occupation of pore volume at 0.75 days vs. distance.

174 9.3 Wet Combustion Tube

Wet combustion is a branch of in-situ combustion, in which water is injected with air simultane-

ously or alternately. The injected water absorbs heat from the area swept by the combustion front

and transfers the heat to ahead of the combustion front in the form of steam. The advantage of wet

combustion is that it makes more use of the heat generated by combustion to improve oil recovery

and save air injection. However, the cost of using water resources is inevitable.

The experiment of the wet combustion tube continues to use the parameters in Tables 9.1, 9.2

and 9.3. The vertical tube has 12 cells, with an injection well at cell 1 on the top and a production

well at cell 12 on the bottom. Four cases with different water injection rates are tested in the

MSCF experiment. First, the air is injected into the injector at a rate of 0.013296 day . After 0.2 days, each case will be injected at a different rate of water simultaneously with the air injection. The

MSCF co-injection rate of the liquid water and air keeps unchangedat0.013296 day , but case 1 to case 4 use 2.5 10 4, 5.0 10 4, 7.5 10 4 and 1.0 10 3 volume of water, respectively. The figures × − × − × − × − for this experiment are named with a prefix “[Wet tube]”.

Figures 9.28 to 9.33 present the combustion effect graphs of 7.5 10 4 water injection. Com- × − pared with the effect diagram of dry combustion tube, Figures 9.1 to 9.6, the forward speed of

wet combustion is close to that of dry combustion tube. In order to see the difference between

wet combustion and dry combustion clearly, Figures 9.34 and 9.35 show the comparison of tem-

perature curves between different wet combustion cases and the dry combustion case from two

different angles. Figure 9.34 shows the temperature change of cell 5. On the whole, the time of

peak temperature reaching cell 5 is close to each other. However, as the injected water volume

increases, the peak temperatures of case 3 and case 4 become slightly lower. A large difference is

the temperature change behind the combustion front, which has a steep trend as the water amount

increases. The reason is, the higher the amount of water injected, the more heat carried forward

by the steam, and the faster the decrease of temperature. Figure 9.35 gives the temperature distri-

bution in the tube at 0.50 days. In addition to the characteristics mentioned, we can see that there

175 is a region ahead of the combustion front where the temperature drops slowly; see cell 7 to cell 8.

This plateau is called the vaporization-condensation zone, where steam has a saturated state. The

energy released by the steam heats the local oil and improves the sweep efficiency. Because the

case 5 has the largest amount of water, its plateau is also the longest and advances fastest.

Figures 9.36 to 9.39 display the changes of Sw, So, Sg, and Cc with distance at 0.50 days, respectively. Based on these data, we generate the graphs of pore volume occupation for different cases; see Figures 9.40 to Figure 9.43. Figure 9.42 is taken as an example for analysis of the wet combustion zones.

1. The burned zone (cells 1 to 3).

The burned zone has already burned and been swept by the combustion front. It is filled with

the injected air and water because the temperature is below the boiling point of water at the

reservoir condition.

2. The water evaporation zone (cell 4).

The temperature goes up from the burned zone to the combustion front. The water vaporizes

from the liquid phase to steam in this region. A large amount of heat from combustion is

absorbed and carried by steam.

3. The combustion zone (cell 5).

Oxygen and hydrocarbons react in this region, creating the combustion front that has a peak

temperature.

4. The cracking/vaporization zone (cell 6).

The pyrolysis reaction of heavy oil generates coke mainly in this zone. Light oil vaporizes and

flows downstream.

5. The steam plateau (cells 7 to 9).

The steam generated in the water evaporation zone moves through the combustion front and

carries energy to heat this zone. The heat helps more light oil vaporize and preheat the place

before combustion. The steam plateau is a vaporization-condensation zone. With the advance

176 of the combustion front, water continues to condense and evaporate in this plateau, forming

a temperature platform area. Figure 9.35 shows that the length of the plateau expands as the

amount of water injected increases.

6. The water bank, oil bank, and initial zone (cells 10 to 12).

A water bank and an oil bank formed in turn ahead of the steam plateau. The reservoir returns

to the initial status farther away.

By comparing the area occupied by oil in Figure 9.25 and Figures 9.40 to 9.43, we can see that

more oil is displaced at 0.50 days as we improve the amount of water injected. Wet combustion can

make more use of the heat generated by combustion than dry combustion to improve the production

of oil.

177 Figure 9.28: [Wet tube] T at 0.001 days. Figure 9.29: [Wet tube] T at 0.25 days.

Figure 9.30: [Wet tube] T at 0.50 days. Figure 9.31: [Wet tube] T at 0.75 days.

Figure 9.32: [Wet tube] T at 1.00 day. Figure 9.33: [Wet tube] T at 1.25 days.

178 1400

1300

1200

1100

1000

900

800

700 T [F] 600

500

400

300

200

100

0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 1.5 1.6 1.7 t [day] ISC dry STARS 2.5 × 10−4 water ISC 7.5 × 10−4 water STARS 1.0 × 10−3 water STARS dry ISC 5.0 × 10−4 water STARS 7.5 × 10−4 water ISC 2.5 × 10−4 water STARS 5.0 × 10−4 water ISC 1.0 × 10−3 water

Figure 9.34: [Wet tube] T vs. t of Cell 5.

1200

1100

1000

900

800

700

600 T [F] 500

400

300

200

100

0 0 1 2 3 4 5 6 7 8 9 10 11 12 Cell ISC dry STARS 2.5 × 10−4 water ISC 7.5 × 10−4 water STARS 1.0 × 10−3 water STARS dry ISC 5.0 × 10−4 water STARS 7.5 × 10−4 water ISC 2.5 × 10−4 water STARS 5.0 × 10−4 water ISC 1.0 × 10−3 water

Figure 9.35: [Wet tube] T vs. distance at 0.50 days.

179 0.5

0.4

0.3 w S

0.2

0.1

0 0 1 2 3 4 5 6 7 8 9 10 11 12 Cell ISC dry STARS 2.5 × 10−4 water ISC 7.5 × 10−4 water STARS 1.0 × 10−3 water STARS dry ISC 5.0 × 10−4 water STARS 7.5 × 10−4 water ISC 2.5 × 10−4 water STARS 5.0 × 10−4 water ISC 1.0 × 10−3 water

Figure 9.36: [Wet tube] Sw vs. distance at 0.50 days.

0.7

0.6

0.5

0.4 o S

0.3

0.2

0.1

0 0 1 2 3 4 5 6 7 8 9 10 11 12 Cell ISC dry STARS 2.5 × 10−4 water ISC 7.5 × 10−4 water STARS 1.0 × 10−3 water STARS dry ISC 5.0 × 10−4 water STARS 7.5 × 10−4 water ISC 2.5 × 10−4 water STARS 5.0 × 10−4 water ISC 1.0 × 10−3 water

Figure 9.37: [Wet tube] So vs. distance at 0.50 days.

180 1

0.9

0.8

0.7

0.6 g

S 0.5

0.4

0.3

0.2

0.1

0 0 1 2 3 4 5 6 7 8 9 10 11 12 Cell ISC dry STARS 2.5 × 10−4 water ISC 7.5 × 10−4 water STARS 1.0 × 10−3 water STARS dry ISC 5.0 × 10−4 water STARS 7.5 × 10−4 water ISC 2.5 × 10−4 water STARS 5.0 × 10−4 water ISC 1.0 × 10−3 water

Figure 9.38: [Wet tube] Sg vs. distance at 0.50 days.

0.15

0.1 ] 3 [lbmole/ft c C

0.05

0 0 1 2 3 4 5 6 7 8 9 10 11 12 Cell ISC dry STARS 2.5 × 10−4 water ISC 7.5 × 10−4 water STARS 1.0 × 10−3 water STARS dry ISC 5.0 × 10−4 water STARS 7.5 × 10−4 water ISC 2.5 × 10−4 water STARS 5.0 × 10−4 water ISC 1.0 × 10−3 water

Figure 9.39: [Wet tube] Cc vs. distance at 0.50 days.

181 1

0.9

0.8

0.7

0.6

0.5

0.4

Occupation of pore volume 0.3

0.2

0.1

0 0 1 2 3 4 5 6 7 8 9 10 11 12 Cell

Water Gas Coke Oil

Figure 9.40: [Wet tube] Occupation of pore volume of 2.5 10 4 water at 0.50 days vs. distance. × −

1

0.9

0.8

0.7

0.6

0.5

0.4

Occupation of pore volume 0.3

0.2

0.1

0 0 1 2 3 4 5 6 7 8 9 10 11 12 Cell

Water Gas Coke Oil

Figure 9.41: [Wet tube] Occupation of pore volume of 5.0 10 4 water at 0.50 days vs. distance. × −

182 1

0.9

0.8

0.7

0.6

0.5

0.4

Occupation of pore volume 0.3

0.2

0.1

0 0 1 2 3 4 5 6 7 8 9 10 11 12 Cell

Water Gas Coke Oil

Figure 9.42: [Wet tube] Occupation of pore volume of 7.5 10 4 water at 0.50 days vs. distance. × −

1

0.9

0.8

0.7

0.6

0.5

0.4

Occupation of pore volume 0.3

0.2

0.1

0 0 1 2 3 4 5 6 7 8 9 10 11 12 Cell

Water Gas Coke Oil

Figure 9.43: [Wet tube] Occupation of pore volume of 1.0 10 3 water at 0.50 days vs. distance. × −

183 9.4 Full-Scale Heat Loss

This experiment tests a 1-D full-scale horizontal reservoir. Because the reservoir has a large contact

area with surrounding rocks, the effect of heat loss must be taken into account, which is the focus of

this experiment. The figures and tables of this experiment are named with the prefix “[Full-scale]”.

The input data in Table 9.4 are collected and modified from the publication of Chung-Kan

Huang, Crookston, and Coats [83, 69, 72]. Krw and Krg in Tables 9.5 and 9.6 are generated from parameters of relative permeability. The reservoir has 10 cells named “cell 1” to “cell 10” from left to right. Each cell has a size of 16.4 ft 115.0 ft 21.0 ft. The cap rock and base rock are × × located on the upper and lower “x-y” surface of the reservoir. Oxygen is injected into the well in cell 1. Spontaneous ignition is feasible and adopted in this experiment. The production well lies at cell 10.

Figures 9.44 to 9.49 show the combustion process of the reservoir in consideration of heat loss. The initial temperature is 200.33 F. With the gradual advance of the combustion front, the temperature of the front slowly rises. The changes in temperature are presented by Figures 9.50 and 9.51 from different angles. The gray curves reflect the results of the situation without heat loss. In the case of heat loss, much heat is transferred to the surrounding rocks. The combustion front has a higher temperature and faster-moving speed in the situation without heat loss than those in the situation with heat loss. In Figure 9.50, the cells with heat loss have a sharper decrease in the post-burning period than the cells without heat loss. In Figure 9.51, the maximum temperature with heat loss is close to 600 F, while the highest temperature without heat loss is approximate

900 F, which is a considerable gap. The comparison explains the necessity to consider heat loss when the reservoir has a large contact area with surrounding rocks.

The changes of Sw, So, Sg and Cc with distance are shown on Figures 9.52, 9.53, 9.54 and 9.55. A tiny value or a value close to zero is significantly affected by numerical processing. Due to the very low coke’s concentration, the match effect of ISC and STARS has a certain deviation on Figure 9.55.

184 Base on the results at 90 days, Figures 9.56 and 9.57 show the pore volume occupation without heat loss and with heat loss, respectively. By comparing the two diagrams, we obtain the following conclusions. Because the combustion front moves faster in the case without heat loss, the oil bank in Figure 9.56 moves faster accordingly. Because the temperature is lower in the case with heat loss in Figure 9.57, the amount of coke oxidized is relatively small, and more coke is deposited on the rock.

185 Table 9.4: [Full-scale] Input data.

Unit Field

Grid

Origin (x, y, z) 0.0, 0.0, 0.0

nx, ny, nz 10, 1, 1 ∆x, ∆y, ∆z [ ft] 16.4, 115.0, 21.0

Rock

Porosity

1 1 1 φre f , cφ,p psi , cφ,t F , cφ,p,t psi-F 0.38, 0.0, 0.0, 0.0 h i   h Absolutei permeability

Kxx, Kyy, Kzz [mD] 4000.0, 4000.0, 4000.0 Thermal conductivity

Btu κr ft-day-F 38.4 h i Internal energy

Btu Btu Cvr1 ft3-F , Cvr2 ft3-F2 35.02, 0.0 h i h i Surrounding rock

Btu κcap,κbase ft-day-F 38.4, 38.4 h Btu i Cv,cap,Cv,base ft3-F 35.02, 35.02 h i Relative permeability

Krwro, Krocw, Krgro 0.25, 1.0, 0.7

Zg, Zow, Zog, Zw 1.0, 3.0, 3.0, 3.0

Sorg, Sorw, Swc, Sgc 0.09, 0.4, 0.2, 0.05 Fluid thermal conductivity

Btu κw, κo, κg ft-day-F 0.0, 0.0, 0.0 h i Continued on next page

186 Reference conditions

Po [psi] 14.7

T o [F] 77

Surface conditions(1)

Psur f [psi] 14.65

Tsur f [F] 62 Component i

Name H2O LO HO O2 COx Coke Molecular weight

lb Mi lbmol 18.0 44.0 170.0 32.0 44.0 13.0   Critical pressure, critical temperature

Pc,i [psi] 3206.2 615.9 264.6 730.0 1073.0

Tc,i [F] 705.73 205.93 725.23 -181.77 88.03 molar density l = w,o

lbmol ρre f ,l,i ft3 3.47 0.4545 0.25 6.154 h 1 i cl,p,i psi 1.0e-5 7.69e-4 3.8e-4 0.0 h 1 i cl,t1,i F 3.8e-4 2.2e-4 1.0e-5 0.0  1  cl,t2,i F2 0.0 0.0 0.0 0.0 h 1 i cl,p,t,i psi-F 0.0 0.0 0.0 0.0 h i Viscosity in gas phase

cp avg,i b 0.08822e-4 0.2166e-4 0.03926e-4 2.196e-4 2.1267e-4 F vg,i h i bvg,i 1.116 0.943 1.12 0.721 0.702 Viscosity in liquid phase l = w,o

avl,i [cp] 0.007520 0.02083 3.624e-4

(2) bvl,i R 2492.75 959.6 8485.0 h i Continued on next page

187 Name H2O LO HO O2 COx Coke Enthalpy in gas phase

Btu Cp,g1,i lbmol-F 7.613 -52.0192 57.8 7.68 11.0  Btu  Cp,g2,i lbmol-F2 8.616e-4 0.15189 0.06018 0.0 0.0 h Btu i Cp,g3,i lbmol-F3 0.0 0.0 0.0 0.0 0.0 h Btu i Cp,g4,i lbmol-F4 0.0 0.0 0.0 0.0 0.0 h i Enthalpy in liquid phase l = w,o

Btu hvr,l,i lbmol-F0.38 1657.0 0.0 0.0 h i ev,l,i 0.38 0.0 0.0 Internal energy for coke

Btu Cpc1 lbmol-F 3.9  Btu  Cpc2 lbmol-F2 0.0 h i K value, l = w,o

kl,i,1 [psi] 1.7202e6 1.307e5 1.849e5 1 kl,i,2 psi 0.0 0.0 0.0 h i kl,i,3 0.0 0.0 0.0

kl,i,4 [F] -6869.59 -3370.0 -6739.0

kl,i,5 [F] -376.64 -414.38 -292.57 Chemical reactions

r1 LO oxidation LO + 5.0O2 4.0H2O + 3.0COx −→ r2 HO oxidation HO + 18.0O2 13.0H2O + 11.64COx −→ r3 HO pyrolysis HO 2.0LO + 0.484COx + 4.67Coke −→ r4 Coke oxidation Coke + 1.25O2 0.5H2O +COx −→

Continued on next page

188 mol Reaction rate R j ft3 day , j = 1...4 − E1 h R T i LO oxidation R1 = A1e− Pgxg,O2 (φSoρoxo,LO) E2 R T  HO oxidation R2 = A2e− Pgxg,O2 (φSoρoxo,HO) E3 R HO pyrolysis R3 = A3e− T (φSoρoxo,HO) E4 R T Coke oxidation R4 = A4e− Pgxg,O2 Cc Btu  Btu A j E j lbmol Hr, j lbmol 1     LO oxidation 1.0e6 psi-day 3.33e4 9.48e5 h 1 i HO oxidation 1.0e6 psi-day 3.33e4 3.49e6 h 1 i HO pyrolysis 0.3e6 day 2.88e4 2.0e4 h 1 i Coke oxidation 1.0e6 psi-day 2.34e4 2.25e5 h i Initial conditions

Po [psi] 65.0

xo,LO, xo,HO 0.0, 1.0

xg,H2O,xg,LO,xg,HO,xg,O2 ,xg,COx 0.178617, 0.0, 3.283e-3, 0.0, 0.8181 lbmol Cc ft3 0.0 Th [F] i 200.33

Sw, Sg 0.2, 0.5 Well conditions

Injection well

rw [ ft] 0.5 Direction Vertical

Injnection fluid Gas

Gas composition (H2O,LO,HO,O2,COx) 0.0, 0.0, 0.0, 1.0, 0.0

MSCF(3) Gas rate day 115 h i Tin j [F] 200.33 Continued on next page

189 Pin j [psi] 10000.0 Production well

rw [ ft] 0.5 Direction Vertical

Pbh [psi] 60.0 (1): The surface conditions are used for calculating well rate.

(2): The liquid viscosity equations, (8.34) and (8.36), require the temperature in Rankine, R [84].

(3): MSCF = 1000 ft3 at the surface conditions.

190 Table 9.5: [Full-scale] Swt.

Sw Krw Krow Pcow 0.2 0 1 0.0

0.225 6.10352e-005 0.823975 0.0

0.25 0.000488281 0.669922 0.0

0.275 0.00164795 0.536377 0.0

0.3 0.00390625 0.421875 0.0

0.325 0.00762939 0.324951 0.0

0.35 0.0131836 0.244141 0.0

0.375 0.0209351 0.177979 0.0

0.4 0.03125 0.125 0.0

0.425 0.0444946 0.0837402 0.0

0.45 0.0610352 0.0527344 0.0

0.475 0.0812378 0.0305176 0.0

0.5 0.105469 0.015625 0.0

0.525 0.134094 0.0065918 0.0

0.55 0.16748 0.00195312 0.0

0.575 0.205994 0.000244141 0.0

0.6 0.25 0 0.0

191 Table 9.6: [Full-scale] Slt.

Sl Krg Krog Pcgo 0.29 0.7 0 0.0

0.33125 0.65625 0.000244141 0.0

0.3725 0.6125 0.00195313 0.0

0.41375 0.56875 0.0065918 0.0

0.455 0.525 0.015625 0.0

0.49625 0.48125 0.0305176 0.0

0.5375 0.4375 0.0527344 0.0

0.57875 0.39375 0.0837402 0.0

0.62 0.35 0.125 0.0

0.66125 0.30625 0.177979 0.0

0.7025 0.2625 0.244141 0.0

0.74375 0.21875 0.324951 0.0

0.785 0.175 0.421875 0.0

0.82625 0.13125 0.536377 0.0

0.8675 0.0875 0.669922 0.0

0.90875 0.04375 0.823975 0.0

0.95 0 1 0.0

192 Figure 9.44: [Full-scale] T at 0.001 days. Figure 9.45: [Full-scale] T at 30 days.

Figure 9.46: [Full-scale] T at 60 days. Figure 9.47: [Full-scale] T at 90 days.

Figure 9.48: [Full-scale] T at 120 days. Figure 9.49: [Full-scale] T at 150 days.

193 800

700

600

500 T [F]

400

300

200 0 10 20 30 40 50 60 70 80 90 100 110 120 130 140 150 160 170 180 190 200 210 220 230 240 t [day] ISC CELL 1 with heat loss ISC CELL 1 without heat loss STARS CELL 1 with heat loss STARS CELL 1 without heat loss ISC CELL 4 with heat loss ISC CELL 4 without heat loss STARS CELL 4 with heat loss STARS CELL 4 without heat loss ISC CELL 7 with heat loss ISC CELL 7 without heat loss STARS CELL 7 with heat loss STARS CELL 7 without heat loss ISC CELL 10 with heat loss ISC CELL 10 without heat loss STARS CELL 10 with heat loss STARS CELL 10 without heat loss

Figure 9.50: [Full-scale] T vs. t.

1000

900

800

700

600 T [F]

500

400

300

200 0 1 2 3 4 5 6 7 8 9 10 Cell ISC 30 days with heat loss ISC 30 days without heat loss STARS 30 days with heat loss STARS 30 days without heat loss ISC 60 days with heat loss ISC 60 days without heat loss STARS 60 days with heat loss STARS 60 days without heat loss ISC 90 days with heat loss ISC 90 days without heat loss STARS 90 days with heat loss STARS 90 days without heat loss ISC 120 days with heat loss ISC 120 days without heat loss STARS 120 days with heat loss STARS 120 days without heat loss

Figure 9.51: [Full-scale] T vs. distance.

194 0.4

0.3

w 0.2 S

0.1

0 0 1 2 3 4 5 6 7 8 9 10 Cell ISC 30 days with heat loss ISC 30 days without heat loss STARS 30 days with heat loss STARS 30 days without heat loss ISC 60 days with heat loss ISC 60 days without heat loss STARS 60 days with heat loss STARS 60 days without heat loss ISC 90 days with heat loss ISC 90 days without heat loss STARS 90 days with heat loss STARS 90 days without heat loss ISC 120 days with heat loss ISC 120 days without heat loss STARS 120 days with heat loss STARS 120 days without heat loss

Figure 9.52: [Full-scale] Sw vs. distance.

1

0.9

0.8

0.7

0.6 o

S 0.5

0.4

0.3

0.2

0.1

0 0 1 2 3 4 5 6 7 8 9 10 Cell ISC 30 days with heat loss ISC 30 days without heat loss STARS 30 days with heat loss STARS 30 days without heat loss ISC 60 days with heat loss ISC 60 days without heat loss STARS 60 days with heat loss STARS 60 days without heat loss ISC 90 days with heat loss ISC 90 days without heat loss STARS 90 days with heat loss STARS 90 days without heat loss ISC 120 days with heat loss ISC 120 days without heat loss STARS 120 days with heat loss STARS 120 days without heat loss

Figure 9.53: [Full-scale] So vs. distance.

195 1

0.9

0.8

0.7

0.6 g

S 0.5

0.4

0.3

0.2

0.1

0 0 1 2 3 4 5 6 7 8 9 10 Cell ISC 30 days with heat loss ISC 30 days without heat loss STARS 30 days with heat loss STARS 30 days without heat loss ISC 60 days with heat loss ISC 60 days without heat loss STARS 60 days with heat loss STARS 60 days without heat loss ISC 90 days with heat loss ISC 90 days without heat loss STARS 90 days with heat loss STARS 90 days without heat loss ISC 120 days with heat loss ISC 120 days without heat loss STARS 120 days with heat loss STARS 120 days without heat loss

Figure 9.54: [Full-scale] Sg vs. distance.

0.02

0.015 ] 3

0.01 [lbmole/ft c C

0.005

0 0 1 2 3 4 5 6 7 8 9 10 Cell ISC 30 days with heat loss ISC 30 days without heat loss STARS 30 days with heat loss STARS 30 days without heat loss ISC 60 days with heat loss ISC 60 days without heat loss STARS 60 days with heat loss STARS 60 days without heat loss ISC 90 days with heat loss ISC 90 days without heat loss STARS 90 days with heat loss STARS 90 days without heat loss ISC 120 days with heat loss ISC 120 days without heat loss STARS 120 days with heat loss STARS 120 days without heat loss

Figure 9.55: [Full-scale] Cc vs. distance.

196 1

0.9

0.8

0.7

0.6

0.5

0.4

Occupation of pore volume 0.3

0.2

0.1

0 0 1 2 3 4 5 6 7 8 9 10 Cell Water Gas Coke Oil

Figure 9.56: [Full-scale] Occupation of pore volume without heat loss at 90 days vs. distance.

1

0.9

0.8

0.7

0.6

0.5

0.4

Occupation of pore volume 0.3

0.2

0.1

0 0 1 2 3 4 5 6 7 8 9 10 Cell Water Gas Coke Oil

Figure 9.57: [Full-scale] Occupation of pore volume with heat loss at 90 days vs. distance.

197 9.5 2-D Multi-Perf Wells

In this part, two experiments are used to demonstrate the implementation of multi-perf wells. The

physical properties and kinetic data are taken from Table 9.4 except for the reservoir geometry

and the well locations. This experimental scheme refers to [80]. The reservoir is designed as a

2-D structure of 4 1 3 cells. Each cell has a size of 41 ft 115 ft 21 ft. All the cells are × × × × numbered from “cell 1” to “cell 12” in order of left to right and bottom to top. The injection well

MSCF has a fixed oxygen rate of 220 day . The bottom hole pressure of the production well is fixed at 60 psi.

Single-Perf Injection and Multi-Perf Production

The injection well has only one perforation at cell 1 of the bottom layer. The production well

has three perforations at cell 4, cell 8, and cell 12, which are the far-right cell of each layer. The

experiment is abbreviated to “[2-D 1]” in the captions of its figures. Figures 9.58 to 9.63 show

the effect of temperature change. The fire begins to burn from cell 1 and gradually spread to the

right and upper cells. The ignition time at the top layer is relatively late, and the combustion front

speed is slow. Figures 9.64 to 9.66 describe the temperature curves of each cell from the top layer

to the bottom layer. An oil bank forms ahead of the combustion front. The oil phase disappears,

and the gas phase overwhelmingly dominates after the combustion front passed. The changes of

oil saturation in Figures 9.67 to 9.69 illustrate this feature.

Multi-Perf Injection and Multi-Perf Production

This experiment uses a three-perf injection well. The perforations are located at cell 1, cell 5,

and cell 9, the left side of the reservoir. “[2-D 2]” is the abbreviation of this experiment on figure

captions. Figures 9.70 to 9.75 present the combustion process. Because there are perforations

at the left end of each layer, the combustion speed in the bottom layer does not have the same

advantage as the previous experiment. The ignition time in the middle layer starts early. The

middle layer has no heat loss compared with the top and the bottom layers, so its heat is well

preserved and has a fast burning speed. Although oxygen is injected into all layers, gas with

198 oxygen and steam moves from bottom to top under the effect of gravity. Therefore, the burning rate of the top layer is slightly faster than that of the bottom layer. Figures 9.76 to 9.81 show the temperature and oil saturation curves for reference.

199 Figure 9.58: [2-D 1] T at 0.001 days. Figure 9.59: [2-D 1] T at 60 days.

Figure 9.60: [2-D 1] T at 120 days. Figure 9.61: [2-D 1] T at 180 days.

Figure 9.62: [2-D 1] T at 240 days. Figure 9.63: [2-D 1] T at 300 days.

200 800

700

600

500 T [F]

400

300

200 0 30 60 90 120 150 180 210 240 270 300 330 360 t [day]

ISC CELL 9 ISC CELL 10 ISC CELL 11 ISC CELL 12

STARS CELL 9 STARS CELL 10 STARS CELL 11 STARS CELL 12

Figure 9.64: [2-D 1] T vs. t of top layer.

800

700

600

500 T [F]

400

300

200 0 30 60 90 120 150 180 210 240 270 300 330 360 t [day]

ISC CELL 5 ISC CELL 6 ISC CELL 7 ISC CELL 8

STARS CELL 5 STARS CELL 6 STARS CELL 7 STARS CELL 8

Figure 9.65: [2-D 1] T vs. t of middle layer.

201 800

700

600

500 T [F]

400

300

200 0 30 60 90 120 150 180 210 240 270 300 330 360 t [day]

ISC CELL 1 ISC CELL 2 ISC CELL 3 ISC CELL 4

STARS CELL 1 STARS CELL 2 STARS CELL 3 STARS CELL 4

Figure 9.66: [2-D 1] T vs. t of bottom layer.

0.7

0.6

0.5

0.4 o S

0.3

0.2

0.1

0 0 30 60 90 120 150 180 210 240 270 300 330 360 t [day]

ISC CELL 9 ISC CELL 10 ISC CELL 11 ISC CELL 12

STARS CELL 9 STARS CELL 10 STARS CELL 11 STARS CELL 12

Figure 9.67: [2-D 1] So vs. t of top layer.

202 0.7

0.6

0.5

0.4 o S

0.3

0.2

0.1

0 0 30 60 90 120 150 180 210 240 270 300 330 360 t [day]

ISC CELL 5 ISC CELL 6 ISC CELL 7 ISC CELL 8

STARS CELL 5 STARS CELL 6 STARS CELL 7 STARS CELL 8

Figure 9.68: [2-D 1] So vs. t of middle layer.

0.7

0.6

0.5

0.4 o S

0.3

0.2

0.1

0 0 30 60 90 120 150 180 210 240 270 300 330 360 t [day]

ISC CELL 1 ISC CELL 2 ISC CELL 3 ISC CELL 4

STARS CELL 1 STARS CELL 2 STARS CELL 3 STARS CELL 4

Figure 9.69: [2-D 1] So vs. t of bottom layer.

203 Figure 9.70: [2-D 2] T at 0.001 days. Figure 9.71: [2-D 2] T at 60 days.

Figure 9.72: [2-D 2] T at 120 days. Figure 9.73: [2-D 2] T at 180 days.

Figure 9.74: [2-D 2] T at 240 days. Figure 9.75: [2-D 2] T at 300 days.

204 800

700

600

500 T [F]

400

300

200 0 30 60 90 120 150 180 210 240 270 300 330 360 t [day]

ISC CELL 9 ISC CELL 10 ISC CELL 11 ISC CELL 12

STARS CELL 9 STARS CELL 10 STARS CELL 11 STARS CELL 12

Figure 9.76: [2-D 2] T vs. t of top layer.

800

700

600

500 T [F]

400

300

200 0 30 60 90 120 150 180 210 240 270 300 330 360 t [day]

ISC CELL 5 ISC CELL 6 ISC CELL 7 ISC CELL 8

STARS CELL 5 STARS CELL 6 STARS CELL 7 STARS CELL 8

Figure 9.77: [2-D 2] T vs. t of middle layer.

205 800

700

600

500 T [F]

400

300

200 0 30 60 90 120 150 180 210 240 270 300 330 360 t [day]

ISC CELL 1 ISC CELL 2 ISC CELL 3 ISC CELL 4

STARS CELL 1 STARS CELL 2 STARS CELL 3 STARS CELL 4

Figure 9.78: [2-D 2] T vs. t of bottom layer.

0.7

0.6

0.5

0.4 o S

0.3

0.2

0.1

0 0 30 60 90 120 150 180 210 240 270 300 330 360 t [day]

ISC CELL 9 ISC CELL 10 ISC CELL 11 ISC CELL 12

STARS CELL 9 STARS CELL 10 STARS CELL 11 STARS CELL 12

Figure 9.79: [2-D 2] So vs. t of top layer.

206 0.7

0.6

0.5

0.4 o S

0.3

0.2

0.1

0 0 30 60 90 120 150 180 210 240 270 300 330 360 t [day]

ISC CELL 5 ISC CELL 6 ISC CELL 7 ISC CELL 8

STARS CELL 5 STARS CELL 6 STARS CELL 7 STARS CELL 8

Figure 9.80: [2-D 2] So vs. t of middle layer.

0.7

0.6

0.5

0.4 o S

0.3

0.2

0.1

0 0 30 60 90 120 150 180 210 240 270 300 330 360 t [day]

ISC CELL 1 ISC CELL 2 ISC CELL 3 ISC CELL 4

STARS CELL 1 STARS CELL 2 STARS CELL 3 STARS CELL 4

Figure 9.81: [2-D 2] So vs. t of bottom layer.

207 9.6 3-D Inverted Five-Spot Pattern

An inverted five-spot pattern has four production wells located at the corners of a square and

an injection well sitting at the center. The symmetry of this pattern is usually used to optimize

modeling and reduce calculation cost. Nevertheless, in this experiment, we simulate a complete

inverted five-spot model to verify the 3-D combustion effect. The reservoir model has 7 7 3 × × cells with cell size of 23 ft 23 ft 21 ft. The injection well is single-perf, and its perforation × × MSCF is situated at the center of the top layer. The oxygen injection rate is 220 day . Each corner has a production well, which has a perforation at each layer. The bottom hole pressure for all production

wells is fixed at 60 psi. The other physical properties are taken from Table 9.4. The tables and

figures in this experiment are named after “[3-D]”.

Figures 9.82 to 9.87 display the 3-D profiles of temperature change. Because oxygen is in- jected from the top center, we can see that the temperature of the entire reservoir begins to increase from here. The combustion front gradually spread around the center and to the below layers. The slice views in Figures 9.88 to 9.93 present the details of the temperature changes for each layer.

The middle layer has the fastest combustion speed, followed by the top layer and the bottom layer.

The two times, 60 days and 120 days, are selected as samples to study the match of ISC

and STARS results. The temperature changes at 60 days of top layer, middle layer and bottom

layer are shown in Figures 9.94, 9.95 and 9.96, respectively. The match effects verify that the

ISC simulator has an accurate 3-D combustion calculation function. Figures 9.97 to 9.99 give

the curves of oil saturation at 60 days. Figures 9.100 to 9.105 describe the temperature and oil

saturation distribution at 120 days. We can see a clear combustion front of the middle layer in the temperature in Figure 9.101. Accordingly, an obvious oil bank ahead of the combustion front is found in the oil saturation in Figure 9.104.

208 Figure 9.82: [3-D] T at 0.001 days. Figure 9.83: [3-D] T at 60 days.

Figure 9.84: [3-D] T at 120 days. Figure 9.85: [3-D] T at 180 days.

Figure 9.86: [3-D] T at 240 days. Figure 9.87: [3-D] T at 300 days.

209 Figure 9.88: [3-D] Slice view of T at 0.001 days. Figure 9.89: [3-D] Slice view of T at 60 days.

Figure 9.90: [3-D] Slice view of T at 120 days. Figure 9.91: [3-D] Slice view of T at 180 days.

Figure 9.92: [3-D] Slice view of T at 240 days. Figure 9.93: [3-D] Slice view of T at 300 days.

210 300 280 1000 260 900 800 240 700 600 220 T [F] 500 200 400 300 180 200 160 100 0

7 6 5 0 4 1 2 3 y Cell 3 4 2 5 1 x Cell 6 7 0

ISC STARS

Figure 9.94: [3-D] T at 60 days of top layer.

440 420 1000 400 900 380 800 360 700 600 340 T [F] 500 320 400 300 300 280 200 260 100 0

7 6 5 0 4 1 2 3 y Cell 3 4 2 5 1 x Cell 6 7 0

ISC STARS

Figure 9.95: [3-D] T at 60 days of middle layer.

211 288 286 1000 284 900 800 282 700 600 280 T [F] 500 278 400 300 276 200 274 100 0

7 6 5 0 4 1 2 3 y Cell 3 4 2 5 1 x Cell 6 7 0

ISC STARS

Figure 9.96: [3-D] T at 60 days of bottom layer.

0.32 0.3 1 0.28 0.9 0.8 0.26 0.7 0.6 0.24 So 0.5 0.22 0.4 0.3 0.2 0.2 0.18 0.1 0

7 6 5 0 4 1 2 3 y Cell 3 4 2 5 1 x Cell 6 7 0

ISC STARS

Figure 9.97: [3-D] So at 60 days of top layer.

212 0.35 0.3 1 0.9 0.25 0.8 0.7 0.2 0.6 0.15 So 0.5 0.4 0.3 0.1 0.2 0.05 0.1 0

7 6 5 0 4 1 2 3 y Cell 3 4 2 5 1 x Cell 6 7 0

ISC STARS

Figure 9.98: [3-D] So at 60 days of middle layer.

0.34 0.32 1 0.3 0.9 0.8 0.28 0.7 0.6 0.26 So 0.5 0.24 0.4 0.3 0.22 0.2 0.2 0.1 0

7 6 5 0 4 1 2 3 y Cell 3 4 2 5 1 x Cell 6 7 0

ISC STARS

Figure 9.99: [3-D] So at 60 days of bottom layer.

213 600 550 1000 500 900 800 450 700 600 400 T [F] 500 350 400 300 300 200 250 100 0

7 6 5 0 4 1 2 3 y Cell 3 4 2 5 1 x Cell 6 7 0

ISC STARS

Figure 9.100: [3-D] T at 120 days of top layer.

650 600 1000 550 900 500 800 700 450 600 400 T [F] 500 400 350 300 300 200 250 100 0

7 6 5 0 4 1 2 3 y Cell 3 4 2 5 1 x Cell 6 7 0

ISC STARS

Figure 9.101: [3-D] T at 120 days of middle layer.

214 500

1000 450 900 800 400 700 600 350 T [F] 500 400 300 300 200 250 100 0

7 6 5 0 4 1 2 3 y Cell 3 4 2 5 1 x Cell 6 7 0

ISC STARS

Figure 9.102: [3-D] T at 120 days of bottom layer.

0.4 0.35 1 0.3 0.9 0.25 0.8 0.7 0.2 0.6 0.15 So 0.5 0.4 0.1 0.3 0.05 0.2 0 0.1 0

7 6 5 0 4 1 2 3 y Cell 3 4 2 5 1 x Cell 6 7 0

ISC STARS

Figure 9.103: [3-D] So at 120 days of top layer.

215 0.6 0.5 1 0.9 0.4 0.8 0.7 0.3 0.6 0.2 So 0.5 0.4 0.3 0.1 0.2 0 0.1 0

7 6 5 0 4 1 2 3 y Cell 3 4 2 5 1 x Cell 6 7 0

ISC STARS

Figure 9.104: [3-D] So at 120 days of middle layer.

0.4 0.35 1 0.3 0.9 0.25 0.8 0.7 0.2 0.6 0.15 So 0.5 0.4 0.1 0.3 0.05 0.2 0 0.1 0

7 6 5 0 4 1 2 3 y Cell 3 4 2 5 1 x Cell 6 7 0

ISC STARS

Figure 9.105: [3-D] So at 120 days of bottom layer.

216 Chapter 10

CONCLUSIONS(2)

In this research, a new in-situ combustion simulator with integrated functions is developed based on the PER method. Because the PER method reduces the influence of phase disappearance and appearance on the mathematical system describing the ISC process, it can reduce the complexity of development. In order to prove that the simulator implemented by PER method and the one implemented by the VS method has the equivalent results in full application, the simulator in this study is developed with comprehensive typical functions of the ISC process, and the numerical results are thoroughly compared and analyzed with those of a benchmark simulator implemented by the VS method.

10.1 Conclusions

The research simulator is developed based on the PER method in this research. CMG STARS, a benchmark commercial thermal simulator implemented with the VS method, is used as an object of comparison. The equivalence of the PER method and the VS method in numerical results are verified from an omnidirectional range. The comparison includes a set of numerical experiments, which are dry combustion tube, wet combustion tube, full-scale heat loss, 2-D multi-perf wells, and 3-D inverted five-spot pattern. The last two experiments are both full-scale with heat loss considered. The significance of this research work is not just to verify the application equivalence between the PER method and the VS method from the aspect of numerical experiments, but also to provide comprehensive experimental support for popularizing the use of the PER method to de- velop related simulators, which will effectively reduce the complexity of the development. Because an in-situ model integrates the function of a thermal model and chemical reactions, the conclusion of this research can be directly applied to the development of a thermal recovery simulator. The

217 specific conclusions of numerical experiments are drawn below.

1. The in-situ combustion features are tested by a tube test of dry forward combustion. The exper-

imental results of ISC are in good agreement with those of STARS. Besides a combustion front,

the zones of the combustion procedure are identified and analyzed. Because ISC uses the PER

method, the disappearance and appearance processing of the liquid phases are different from

that of STARS. However, effective numerical results are consistent. The ISC process shows the

characteristics of high-temperature reactions, phase transition, and gradual dominance of the

gas phase.

2. As a method to improve the efficiency of heat use, wet combustion is studied by the ISC sim-

ulator. A certain amount of liquid water is injected after the combustion begins for a while,

following a well schedule. The higher the amount of water is injected, the more the steam

carries away the heat, and the faster the temperature behind the combustion front drops. The

steam that comes ahead of the combustion front heats the local oil to form a plateau by releasing

energy. More water injection might expand the length of the plateau. This energy transfer not

only preheats the unburned area but also evaporates the light oil, thus improving the recovery.

3. Because of the high-temperature characteristic of an in-situ combustion process, the calculation

of heat loss is essential for a full-scale reservoir. The heat loss is implemented in ISC through a

semi-analytical method. The results with heat loss are compared with those without heat loss.

The case with heat loss has a lower burning peak temperature, a slower front propulsion speed,

and a faster temperature drop behind the combustion front than the case without heat loss.

When the concentration of coke is minimal, a difference in the results of different numerical

treatments would exist. So there is a specific deviation in the fitting of this point between ISC

and STARS.

4. The tests of multi-perf wells and multi-dimensional full-scale models are necessary to be car-

ried out on the ISC simulator. The well implementation is based on Peaceman’s model. The

functions of a fixed pressure well, a fixed flow rate well, a multi-perf well, and a well operation

218 schedule are all achieved for the ISC simulator. Two 2-D models and one 3-D inverted five-spot

pattern model, all with heat-loss in a full-scale size, are presented. The accurate result matching

between ISC and STARS verifies that the ISC simulator has real and practical computing power.

Because the simulator owns all the typical functions of the ISC process, it can be easily used as a secondary development platform for further research of ISC simulation. Researchers of reservoir engineering might expand the functions that commercial simulators do not have or do not provide related interfaces.

10.2 Future Work

The following work could be advanced to make this research simulator more serviceable.

1. The chemical reaction model tested is a simple model composed of four reactions. Although

the model has been able to describe the main characteristics of a combustion process, the chem-

ical reactions in practice are often more complicated. Support of advanced reaction models

in the ISC simulator is a required expansion. We have left interfaces for such developments.

Nevertheless, the work for additional model research and case verification is still required.

2. The ISC simulator is developed using structured block-center grids, which are usually adopted.

However, a geologic model in field applications is inconsistently judged from the aspect of

gridding. Lack of flexibility in modeling irregular geometries, the structured models are in-

competent in the simulation of a reservoir with composite fractures, faults, and curved well

trajectories. So support of various gridding strategies is a topic worthy of study to improve the

accuracy of simulation in such cases. The techniques of the corner-point grid and the unstruc-

tured grid could provide viable solutions.

219 Bibliography

[1] AniRaptor2001. Petroleum products made from a typical barrel of US oil. https://www.

eia.gov/dnav/pet/pet_pnp_pct_dc_nus_pct_m.htm. From 2007 Energy Information

Administration data, average April 2006 to March 2007 refinery products.

[2] U.S. Energy Information Administration. Total petroleum and other liquids production

2018. https://www.eia.gov/beta/international/rankings/#?cy=2018, 2018. in-

ternational energy statistics.

[3] E Tzimas, A Georgakaki, C Garcia Cortes, and SD Peteves. Enhanced oil recovery using

carbon dioxide in the European energy system. Report EUR, 21895(6), 2005.

[4] Zhangxin Chen. Reservoir simulation: mathematical techniques in oil recovery, volume 77.

SIAM, 2007.

[5] Jalal Abedi. Introduction to Reservoir Simulation. Introduction to Reservoir Engineering

(ENPE 523), University of Calgary, 2013.

[6] Xiao Zhong, Yifan Gai, Bryan Bacon, Scott Barrier, and Rameen Malik. Evaluation and

Development of Tight Oil Reservoir. Design for Oil and Gas Engineering (ENPE 531), Uni-

versity of Calgary, 2019.

[7] Zhangxin Chen, Guanren Huan, and Yuanle Ma. Computational methods for multiphase

flows in porous media, volume 2. SIAM, 2006.

[8] Department of energy. https://www.netl.doe.gov/oil-gas/natural-gas-resources.

[9] MRST homepage. https://www.sintef.no/MRST/.

[10] Open porous media initiative. https://opm-project.org/.

[11] INTERSECT homepage. https://www.software.slb.com/products/intersect.

220 [12] ECLIPSE homepage. https://www.software.slb.com/products/eclipse.

[13] ECHELON homepage. https://stoneridgetechnology.com/echelon/.

[14] RFD homepage. https://rfdyn.com/.

[15] CMG homepage. https://www.cmgl.ca/.

[16] Coats engineering homepage. https://www.coatsengineering.com/.

[17] Abraham Silberschatz, Greg Gagne, and Peter B Galvin. Operating system concepts. Wiley,

2018.

[18] C CUDA. CUDA C programming guide: Design guide. 10.1. NVIDIA. Mar, 2019.

[19] Yousef Saad. Iterative methods for sparse linear systems, volume 82. SIAM, 2003.

[20] John Richard Wallis, RP Kendall, TE Little, et al. Constrained residual acceleration of con-

jugate residual methods. In SPE Reservoir Simulation Symposium. Society of Petroleum

Engineers, 1985.

[21] Richard Barrett, Michael W Berry, Tony F Chan, James Demmel, June Donato, Jack Don-

garra, Victor Eijkhout, Roldan Pozo, Charles Romine, and Henk Van der Vorst. Templates

for the solution of linear systems: building blocks for iterative methods, volume 43. SIAM,

1994.

[22] Ruipeng Li and Yousef Saad. GPU-accelerated preconditioned iterative linear solvers. The

Journal of Supercomputing, 63(2):443–466, 2013.

[23] Hector Manuel Klie, Hari Hara Sudan, Ruipeng Li, Yousef Saad, et al. Exploiting capabilities

of many core platforms in reservoir simulation. In SPE Reservoir Simulation Symposium.

Society of Petroleum Engineers, 2011.

221 [24] Xiaozhe Hu, Wei Liu, Guan Qin, Jinchao Xu, Zhensong Zhang, et al. Development of a

fast auxiliary subspace pre-conditioner for numerical reservoir simulators. In SPE Reser-

voir Characterisation and Simulation Conference and Exhibition. Society of Petroleum En-

gineers, 2011.

[25] Hui Cao, Hamdi A Tchelepi, John Richard Wallis, Hrant E Yardumian, et al. Parallel scal-

able unstructured CPR-type linear solver for reservoir simulation. In SPE Annual Technical

Conference and Exhibition. Society of Petroleum Engineers, 2005.

[26] Hui Liu, Kun Wang, Zhangxin Chen, Kirk E Jordan, et al. Efficient multi-stage precondition-

ers for highly heterogeneous reservoir simulations on parallel distributed systems. In SPE

Reservoir Simulation Symposium. Society of Petroleum Engineers, 2015.

[27] A Brandt, S McCoruick, and J Huge. Algebraic multigrid (AMG) for sparse matrix equations.

Sparsity and its Applications, 257, 1985.

[28] John W Ruge and Klaus Stuben.¨ Algebraic multigrid. In Multigrid methods, pages 73–130.

SIAM, 1987.

[29] Christian Wagner. Introduction to algebraic multigrid. Course Notes of an Algebraic Multi-

grid Course. University of Heidelberg, 1999.

[30] Robert D Falgout. An introduction to algebraic multigrid computing. Computing in science

& engineering, 8(6):24, 2006.

[31] Panayot S Vassilevski. Lecture notes on multigrid methods. Lawrence Livermore National

Lab.(LLNL), Livermore, CA (United States), 2010.

[32] Klaus Stuben.¨ A review of algebraic multigrid. In : Historical Develop-

ments in the 20th Century, pages 331–359. Elsevier, 2001.

222 [33] Andrew J Cleary, Robert D Falgout, Van Emden Henson, Jim E Jones, Thomas A Manteuffel,

Stephen F McCormick, Gerald N Miranda, and John W Ruge. Robustness and scalability of

algebraic multigrid. SIAM Journal on Scientific Computing, 21(5):1886–1908, 2000.

[34] Ulrike Meier Yang et al. BoomerAMG: a parallel algebraic multigrid solver and precondi-

tioner. Applied Numerical Mathematics, 41(1):155–177, 2002.

[35] Ulrike Meier Yang. Parallel algebraic multigrid methodsłhigh performance preconditioners.

In Numerical solution of partial differential equations on parallel computers, pages 209–236.

Springer, 2006.

[36] Nathan Bell and Michael Garland. Efficient sparse matrix-vector multiplication on CUDA.

Technical report, NVIDIA Technical Report NVR-2008-004, NVIDIA Corporation, 2008.

[37] Nathan Bell and Michael Garland. Implementing sparse matrix-vector multiplication on

throughput-oriented processors. In Proceedings of the conference on high performance com-

puting networking, storage and analysis, page 18. ACM, 2009.

[38] Hui Liu, Song Yu, Zhangxin Chen, Ben Hsieh, and Lei Shao. Sparse matrix-vector multipli-

cation on NVIDIA GPU. International Journal of Numerical Analysis & Modeling, Series B,

3(2):185–191, 2012.

[39] Hui Liu, Zhangxin Chen, Song Yu, Ben Hsieh, and Lei Shao. Development of a restricted

additive Schwarz preconditioner for sparse linear systems on NVIDIA GPU. International

Journal of Numerical Analysis and Modelling: Series B, 5(1-2):13–20, 2014.

[40] Hui Liu, Song Yu, Zhangxin Chen, et al. Development of algebraic multigrid solvers using

GPUs. In SPE Reservoir Simulation Symposium. Society of Petroleum Engineers, 2013.

[41] Hui Liu, Bo Yang, and Zhangxin John Chen. Accelerating the GMRES solver with block

ILU (k) preconditioner on GPUs in reservoir simulation. 2015.

223 [42] Hui Liu, Bo Yang, and Zhangxin Chen. Accelerating algebraic multigrid solvers on NVIDIA

GPUs. Computers & Mathematics with Applications, 70(5):1162–1181, 2015.

[43] Yan Chen, Xuhong Tian, Hui Liu, Zhangxin Chen, Bo Yang, Wenyuan Liao, Peng Zhang,

Ruijian He, and Min Yang. Parallel ILU preconditioners in GPU computation. Soft Comput-

ing, 22(24):8187–8205, 2018.

[44] Greg Ruetsch and Brent Oster. Getting started with CUDA. https://www.nvidia.com/

content/cudazone/download/Getting_Started_w_CUDA_Training_NVISION08.pdf,

2008. NVISION 08, the world of visual computing.

[45] NVIDIA Tesla K20X GPU Accelerator. https://www.nvidia.com/content/pdf/

kepler/tesla-k20x-bd-06397-001-v07.pdf.

[46] NVIDIA Tesla V100 GPU Accelerator. https://images.nvidia.com/content/

technologies/volta/pdf/tesla-volta-v100-datasheet-letter-fnl-web.pdf.

[47] CUDA NVIDIA. Whitepaper, NVIDIAs next generation, CUDA compute architecture:

Fermi, 2009.

[48] Michael Andrew Christie, MJ Blunt, et al. Tenth SPE comparative solution project: A

comparison of upscaling techniques. In SPE Reservoir Simulation Symposium. Society of

Petroleum Engineers, 2001.

[49] Frans JT Floris, Mike D Bush, Maarten Cuypers, Frederic Roggero, and Anne-Randi

Syversveen. Comparison of production forecast uncertainty quantification methods–an in-

tegrated study. In 1st Symposium on petroleum geostatistics, Toulouse, pages 20–23, 1999.

[50] Jørg E Aarnes, Vegard Kippe, and Knut-Andreas Lie. Mixed multiscale finite elements and

streamline methods for reservoir simulation of large geomodels. Advances in Water Re-

sources, 28(3):257–271, 2005.

224 [51] SPE comparative solution project, description of model 2. https://www.spe.org/web/

csp/datasets/set02.htm.

[52] Bjorn Sjodin. What’s the difference between FEM, FDM, and FVM? Machine Design, 2016.

[53] RG Grimes, DR Kincaid, and DM Young. Itpack 2.0 user’s guide. report no. cna-150. Center

for Numerical Analysis, University of Texas at Austin, 1979.

[54] AN Krylov. On the numerical solution of the equation by which in technical questions fre-

quencies of small oscillations of material systems are determined. Izvestija AN SSSR (News

of Academy of Sciences of the USSR), Otdel. mat. i estest. nauk, 7(4):491–539, 1931.

[55] William L Briggs, Steve F McCormick, et al. A multigrid tutorial, volume 72. SIAM, 2000.

[56] Klaus Stuben.¨ Solving reservoir simulation equations. In Ninth International Forum on

Reservoir Simulation, Abu Dhabi, United Arab Emirates, page 92, 2007.

[57] James W Watts III et al. A conjugate gradient-truncated direct method for the iterative solu-

tion of the reservoir simulation pressure equation. Society of Petroleum Engineers Journal,

21(03):345–353, 1981.

[58] Klaus Stuben,¨ Tanja Clees, Hector Klie, Bo Lu, Mary Fanett Wheeler, et al. Algebraic multi-

grid methods (AMG) for the efficient solution of fully implicit formulations in reservoir sim-

ulation. In SPE Reservoir Simulation Symposium. Society of Petroleum Engineers, 2007.

[59] Jer´ omeˆ Jaffre.´ Numerical calculation of the flux across an interface between two rock types of

a porous medium for a two-phase flow. Hyperbolic problems: theory, numerics, applications

(Stony Brook, NY, 1994), pages 165–177, 1996.

[60] Xiao-Chuan Cai and Marcus Sarkis. A restricted additive Schwarz preconditioner for general

sparse linear systems. SIAM Journal on Scientific Computing, 21(2):792–797, 1999.

225 [61] George Karypis and Vipin Kumar. A fast and high quality multilevel scheme for partitioning

irregular graphs. SIAM Journal on Scientific Computing, 20(1):359–392, 1998.

[62] Kun Wang, Hui Liu, and Zhangxin Chen. A scalable parallel black oil simulator on distributed

memory parallel computers. Journal of Computational Physics, 301:19–34, 2015.

[63] Roger M Butler. Thermal recovery of oil and bitumen, volume 46. Prentice Hall Englewood

Cliffs, NJ, 1991.

[64] CV Deutsch and JA McLennan. Guide to SAGD (steam assisted gravity drainage) reservoir

characterization using geostatistics. Centre for Computational Excellence (CCG), Guidebook

Series Vol. 3, University of Alberta, April 2003, 2005.

[65] What the future for extra heavy oil and bitumen: the Orinoco case. World Energy Council,

2006.

[66] HJ Ramey Jr et al. Transient heat conduction during radial movement of a cylindrical heat

source-applications to the thermal recovery process, 1959.

[67] BS Gottfried et al. A mathematical model of thermal oil recovery in linear systems. Society

of Petroleum Engineers Journal, 5(03):196–210, 1965.

[68] KH Coats et al. A highly implicit steamflood model. Society of Petroleum Engineers Journal,

18(05):369–383, 1978.

[69] RB Crookston, WE Culham, Wen H Chen, et al. A numerical simulation model for thermal

recovery processes. Society of Petroleum Engineers Journal, 19(01):37–58, 1979.

[70] Janusz W Grabowski, Paul K Vinsome, Ran C Lin, GA Behie, Barry Rubin, et al. A fully

implicit general purpose finite-difference thermal model for in situ combustion and steam. In

SPE Annual Technical Conference and Exhibition. Society of Petroleum Engineers, 1979.

226 [71] Gary K Youngren et al. Development and application of an in-situ combustion reservoir

simulator. Society of Petroleum Engineers Journal, 20(01):39–51, 1980.

[72] Keith H Coats et al. In-situ combustion model. Society of Petroleum Engineers Journal,

20(06):533–554, 1980.

[73] MK Hwang, WR Jines, AS Odeh, et al. An in-situ combustion process simulator with a

moving-front representation. Society of Petroleum Engineers Journal, 22(02):271–279, 1982.

[74] Barry Rubin, W Lloyd Buchanan, et al. A general purpose thermal model. Society of

Petroleum Engineers Journal, 25(02):202–214, 1985.

[75] JH Abou-Kassem, K Aziz, et al. Handling of phase change in thermal simulators. Journal of

petroleum technology, 37(09):1–661, 1985.

[76] WH Chen, ML Wasserman, RE Fitzmorris, et al. A thermal simulator for naturally fractured

reservoirs. In SPE Symposium on Reservoir Simulation. Society of Petroleum Engineers,

1987.

[77] Yoshiaki Ito, Allan Kwok-Yuen Chow, et al. A field scale in-situ combustion simulator with

channeling considerations. SPE reservoir engineering, 3(02):419–430, 1988.

[78] Richard E Ewing and R. D. Lazarov. Adaptive local grid refinement, 1988.

[79] Pierre A Le Thiez, A Lemonnier, et al. An in-situ combustion reservoir simulator with a new

representation of chemical reactions. SPE Reservoir Engineering, 5(03):285–292, 1990.

[80] Jaafar Sadiq FA Oklany et al. An in-situ combustion simulator for enhanced oil recovery.

PhD thesis, University of Salford, 1992.

[81] JH Abou-Kassem. Practical considerations in developing numerical simulators for thermal

recovery. Journal of Petroleum Science and Engineering, 15(2-4):281–290, 1996.

227 [82] JR Christensen, G Darche, B Dechelette, H Ma, PH Sammon, et al. Applications of dy-

namic gridding to thermal simulations. In SPE international thermal operations and heavy

oil symposium and western regional meeting. Society of Petroleum Engineers, 2004.

[83] Chung-Kan Huang. Development of a general thermal oil reservoir simulator under a mod-

ularized framework. PhD thesis, University of Utah, 2009.

[84] Computer Modelling Group Ltd. STARS USER GUIDE, advanced processes & thermal

reservoir simulator, 2015.

[85] Marinus Izaak Jan Van Dijke and Kenneth Stuart Sorbie. Three-phase relative permeabilities

in porous media of heterogeneous wettability, 2006.

[86] Mojdeh Delshad and Gary A Pope. Comparison of the three-phase oil relative permeability

models. Transport in Porous Media, 4(1):59–83, 1989.

[87] HL Stone et al. Estimation of three-phase relative permeability and residual oil data. Journal

of Canadian Petroleum Technology, 12(04), 1973.

[88] R Juanes. Relative Permeabilities in Reservoir Simulation. Class Notes for PE 224 Advanced

Reservoir Simulation, Department of Petroleum Engineering. Stanford University, California,

2003.

[89] Palmer Vaughn and Laramie Wyoming. A numerical model for thermal recovery processes

in tar sand: description and application. Morgantown Energy Technology Center, US De-

partment of Energy., 1986.

[90] Donald W Peaceman et al. Interpretation of well-block pressures in numerical reservoir simu-

lation (includes associated paper 6988). Society of Petroleum Engineers Journal, 18(03):183–

194, 1978.

[91] Keith James Laidler. Chemical Kinetics, Third Edition. Prentice Hall, 1987.

228 [92] Kenneth Antonio Connors. Chemical kinetics: the study of reaction rates in solution. John

Wiley & Sons, 1990.

[93] IS Bousaid, HJ Ramey Jr, et al. Oxidation of crude oil in porous media. Society of Petroleum

Engineers Journal, 8(02):137–148, 1968.

[94] Mahmoud K Dabbous, Paul F Fulton, et al. Low-temperature-oxidation reaction kinetics

and effects on the in-situ combustion process. Society of Petroleum Engineers Journal,

14(03):253–262, 1974.

[95] PKW Vinsome, J Westerveld, et al. A simple method for predicting cap and base rock heat

losses in thermal reservoir simulators. Journal of Canadian Petroleum Technology, 19(03),

1980.

[96] Hui Liu, Kun Wang, Zhangxin Chen, Bo Yang, and Ruijian He. Large-scale reservoir sim-

ulations on distributed-memory parallel computers. In Proceedings of the 24th High Perfor-

mance Computing Symposium, page 8. Society for Computer Simulation International, 2016.

[97] Marcel Latil. Enhanced oil recovery. Editions´ Technip, 1980.

229