Implementation and Performance Analysis of a Parallel Oil Reservoir Simulator Tool Using a CG Method on a GPU-Based System
Total Page:16
File Type:pdf, Size:1020Kb
2014 UKSim-AMSS 16th International Conference on Computer Modelling and Simulation Implementation and Performance Analysis of a Parallel Oil Reservoir Simulator Tool using a CG Method on a GPU-Based System Leila Ismail Jamal Abou-Kassem Bibrak Qamar Computer and Software Engineering/ Department of Chemical and Petroleum HPGCL Research Lab HPGCL Research Lab Engineering College of IT College of IT College of Engineering UAE University UAE University UAE University Al-Ain, UAE Al-Ain, UAE Al-Ain, UAE Email: [email protected] (Correspondence Email: [email protected] Author) 978-1-4799-4923-6/14 $31.00 © 2014 IEEE 374 DOI 10.1109/UKSim.2014.113 described in section IV. Our experiments and analysis of The above formula is the transmissibility in the x-direction results are presented in section V. Finally, VI concludes our where μ is the viscosity, B is the formation volume factor, work. Δx is the dimension of a grid block in the x-direction, Ax is the area of a grid block (y-dimension * height), Kx is the II. OVERALL IMPLEMENTATION OF THE OIL RESERVOIR permeability in the x-direction and βc is the transmissibility SIMULATOR conversion factor. Based on the values computed for the transmissibility and the production rates of the wells, the simulator then generates the system Ax = d, where A is the coefficient matrix, x is the solution vector for the value of pressure in each grid block of the reservoir and d is is a known vector which represents the right hand side of the generated system of equations. The system Ax = d is generated using the following formula for each grid block, at the time step n +1: n+1 n+1 n+1 n Tln (pl − pn ) − γl,n(Zl − Zn) + l∈ϕn (qn+1 + qn+1)= scl,n scn l∈ n V φS n+1 φS n bn − αcΔt B n B n (2) T q where is the transmissibility, scl,n is the flow from l n q boundary block to block , scn is the well production rate at block n, γ is gravity and Z is the elevation, V is the bulk volume of the block n, S is the saturation and φ is the porosity, and Δt is the time step. In our implementation, A and d are offloaded to the GPU memory and the solution vector x which contains the pressure at each grid block is offload only once to the GPU memory, since afterwards x is updated in GPU memory for consecutive simulator time steps. Once A, x and d are in the GPU memory, the simulator launches the solver (CG) Figure 1. Oil Reservoir Simulator Algorithm. at the GPU for the calculation of the pressure for the next time step at each grid block of the reservoir. Figure 1 shows the flow chart representing our implemen- When the calculation of the new pressure for each grid tation of the oil reservoir simulator. block at the GPU level is completed, the vector x is The simulator first reads from an input data file the transferred from the GPU memory to the CPU memory to characteristics of the reservoir to be modeled. The charac- be used to update the matrix A and the vector d, which are teristics constitute of the followings: the size of reservoir, used for the next time step. the number of grid blocks and the dimensions of each grid III. RELATED WORKS block, the location of the wells and the production rates, the reservoir rock properties, the porosity, the permeability, the A. Prior Implementations of Conjugate Gradient the reservoir fluid properties, the formation volume factor, Several algorithms have been published to parallelize CG and the viscosity. [21], [22], [12], [15]. In references [21] and [22], algorithms The Transmissibility for each grid block is then computed have been implemented on a specialized event-driven multi- using the following formula: threaded platform. In references [12] and [15], algorithms have been implemented on a distributed shared memory AxKx cluster. Reference [23] introduces a parallel algorithm for Tx = βc (1) μBΔx matrix-vector multiplication, considers a particular matrix 375 with regular sparsity and studies the impact of mesh par- [13]. In this work, we use DIA format for the GPU parallel titioning on the performance. References [26] and [27] implementation, as the coefficient matrices generated in 1- introduce data decomposition strategies for matrix-vector D, 2-D and 3-D oil reservoirs are diagonals-based structured multiplications to increase CG efficiency on hypercubes and matrices with few zeros in diagonals. However, we use the meshes network for unstructured sparse matrices. Blocks of CSR format for the sequential CPU implementation, as it matrix are assigned to processors to perform partial result of gives better performance on CPU than GPU [13]. the matrix-vector multiplication. Non-zeros are transferred In DIA, every matrix is represented by 2 arrays: an array using storage techniques of non-zeros in a vector. They which stores the non-zero values of the diagonals and an concluded that the broadcast-based approach does not scale array which stores the offset of each diagonal from the main with increasing number of processors. By using an overlap diagonal. By convention, the offset for main diagonal is 0, mechanisms for global summation to hide communication while the offset for kth super-diagonal is k and the kth cost involved in the parallel CG, a speedup of 2.5 was sub-diagonal is -k. As the indices of rows and columns of obtained on 128 processors compared to the original NAS nonzero entries are not stored explicitly, they are implicitly parallel benchmark [19]. References [28] and [29] divide defined by their position within a diagonal and offset be- the overall CG algorithm into blocks of algorithms to tween diagonals. DIA format reduces the memory size for reduce communication, but did not address the cost of storing matrix and decreases the amount of memory transfers communication among the blocks. Jordan et al. Reference and provides coalesced memory access during sparse matrix- [24] reported experiments where matrix-vector multiplica- vector multiplication (SpMV) [13]. This makes DIA format tion loads are distributed based on processors speed in a appropriate in our case as GPUs have relatively less memory heterogeneous cluster, using a dense matrix. The parallel and perform poorly when memory is accessed irregularly algorithm suggested by Kim et al [11] explored the parallel [14]. CG with both Jacobi diagonal preconditioning and incom- plete Cholesky factorization preconditioning on a number of IV. PROGRAMMING MODEL different meshfree analysis applications. They investigated A. CG Algorithm the parallel performance of the solver using a homogeneous As shown in Figure 5, the CG method starts with a cluster and obtained a speedup of 12 compared to sequential random initial guess of the solution x0 (step1). Then, it execution in both cases, using 12 processors. proceeds by generating vector sequences of iterates (i.e., successive approximations to the solution (step10), residuals B. Sparse Matrix-Vector Multiplication corresponding to the iterates (step11), and search direc- Sparse matrices arise in different problems of scientific tions used in updating the iterates and residuals (step14). and engineering nature. They can range from a very well Although the length of these sequences can become large, structured to those which are highly irregular, where the only a small number of vectors need to be kept in mem- non-zero values in different rows do not follow a distribution ory. In every iteration of the method, two inner products pattern. Methods have been developed on the ways to store (insteps9and13) are performed in order to compute update the non-zero values of a sparse matrix to reduce memory scalars (steps9and13) that are defined to make the se- storage requirements. quences satisfy certain orthogonality conditions. On a sym- There are a large spectrum of sparse matrix representa- metric and positive definite linear system, these conditions tions, each with different storage requirements, computa- imply that the distance to the true solution is minimized in tional characteristics, and memory-access method. Different some norm (step12). storage formats have been introduced, and the DIA (diagonal format) was found to be suited for GPU implementations B. Matrix Structure when the nonzero values are restricted to a small number Figure 2 shows a computational mesh for a discretized of matrix diagonals [13], which is the case of the ma- 3-D oil reservoir, where Nx is the number of cells of trix generated by an oil reservoir simulator. However, the the oil reservoir in the x direction, Ny is the number of downsides of the DIA is that it allocates storage of values cells of the reservoir in the y direction and Nz is the which are not part of the matrix and explicitly stores zero number of cells of the reservoir in the z direction. N = values in occupied diagonals. We have then to introduce an NxNyNz is the generated matrix size. The mesh reveals extra information to differentiate between a zero value which the way the unknown vector is composed. By numbering comes from a diagonal and a zero value which comes from the unknowns, the outcoming linear system of equations the DIA mechanism. Coordinate format and compressed following a discretization is represented by a heptadiagonal sparse row format, such as CSR (Compressed Storage Row), structured sparse matrix. A 2-D computational mesh is a used for instance in reference [27], are general purpose simplified version of a 3-D computational mesh, where only format for storing the positions and the data of non-zero Nx and Ny are considered to produce the matrix.