applied sciences

Article GPU Acceleration of Hydraulic Transient Simulations of Large-Scale Water Supply Systems

Wanwan Meng 1, Yongguang Cheng 1,*, Jiayang Wu 1,2, Zhiyan Yang 1, Yunxian Zhu 3 and Shuai Shang 4

1 State Key Laboratory of Water Resources and Hydropower Engineering Science, Wuhan University, Wuhan 430072, China; [email protected] (W.M.); [email protected] (J.W.); [email protected] (Z.Y.) 2 Ministerial Key Lab of Transients, Ministry of Education, Wuhan University, Wuhan 430072, China 3 Construction Management Company for Chushandian Reservoir Project of Henan Province, Zhengzhou 450000, China; [email protected] 4 Zhangfeng Water Conservancy Management Company LtD., Qinshui 048000, China; [email protected] * Correspondence: [email protected]; Tel.: +86-139-7138-8746

 Received: 28 November 2018; Accepted: 23 December 2018; Published: 27 December 2018 

Abstract: Simulating hydraulic transients in ultra-long water (oil, gas) transmission or large-scale distribution systems are time-consuming, and exploring ways to improve the simulation efficiency is an essential research direction. The parallel implementation of the method of characteristics (MOC) on graphics processing unit (GPU) chips is a promising approach for accelerating the simulations, because GPU has a great parallelization ability for massive but simple computations, and the explicit and local features of MOC meet the features of GPU quite well. In this paper, we propose and verify a GPU implementation of MOC on a single chip for more efficient simulations of hydraulic transients. Details of GPU-MOC parallel strategies are introduced, and the accuracy and efficiency of the proposed method are verified by simulating the benchmark single pipe water hammer problem. The transient processes of a large scale water distribution system and a long-distance water transmission system are simulated to investigate the computing capability of the proposed method. The results show that GPU-MOC method can achieve significant performance gains, and the speedup ratios are up to hundreds compared to the traditional method. This preliminary work demonstrates that GPU-MOC parallel computing has great prospects in practical applications with large computing load.

Keywords: graphics processing unit (GPU); method of characteristics; hydraulic transients; large-scale water supply system; parallel computing; speedup ratio

1. Introduction Hydraulic transients are the fluctuations of and flow in water (oil, gas) transmission or distribution systems caused by or operations, and their numerical simulations are essential for design, operation and management of engineering projects [1–3]. The existing simulation methods of hydraulic transients are adequate for many practical applications in terms of accuracy and efficiency [4]. But in some special cases, simulation efficiency must be improved to better meet demands. First, more and more water transmission systems longer than hundreds of kilometers (e.g., the Da Huofang project and the Liaoning northwest water supply project in China are 354 kilometers and 598.4 kilometers, respectively) and water distribution networks containing thousands of pipes, loops,

Appl. Sci. 2019, 9, 91; doi:10.3390/app9010091 www.mdpi.com/journal/applsci Appl. Sci. 2019, 9, 91 2 of 14 and branches (e.g., the networks in Xiongan New Capital City and Wuhan New Yangtze River City, China) are newly built and planned, for which the hydraulic transient simulations are very time-consuming, normally needing several hours for calculating one working condition. Second, some of these long-distance systems are equipped with real-time monitoring and need online transient simulations for instant decision-making on operations, for which the existing methods are not competent [5]. Third, in design and optimization of many complex pipe systems, a large number of layouts and working conditions should be simulated in a limited time [6]. Therefore, it is necessary to find more efficient simulation methods of hydraulic transients. Two approaches on improving the computing efficiency of hydraulic transients have already been done. The first is to adopt more efficient algorithms. Wood et al. [7,8] proposed the wave characteristic method (WCM) to efficiently simulate the transient processes of water distribution systems. Boulos et al. [9] and Jung et al. [10] further proved WCM is more efficient than the widely-used method of characteristics (MOC) because of its larger time-step interval and exemption from interior section computation. The second approach is to optimize or parallelize computing codes. Izquierdo and Iglesias [11,12] established a sub-procedure library for general hydraulic boundaries, which were invoked directly in simulations to avoid tiny segmentation and reduce resource occupancy. Li et al. [13] used a parallel genetic algorithm (PGA) in the transient simulations with and obtained about 10 times of speedup compared to the serial genetic algorithm. Martinez et al. [14] used an optimized genetic algorithm to redesign existing water distribution networks that do not operate properly. Fan et al. [15] parallelized two computers to simulate the transients of several large hydraulic systems, and achieved an efficiency of 1.442 times to that on a single computer. However, these efficiency improvement are far from meeting the practical needs, therefore, new approaches should be explored. The graphic processing unit (GPU) has shown powerful capability in non-graphical computations and scientists have started using it to conduct large-scale scientific calculations at the beginning of the 21 century [16]. Many successful GPU applications in fluid dynamics calculations have shown quite promising performance gains. Wu et al. [17] accelerated the simulations of fluid structure interaction (FSI) by implementing the immersed boundary-lattice Boltzmann coupling scheme on a single GPU chip, and the attained memory efficiencies for the kernels were up to 61%, 68%, and 66% for the testing case. Bonelli et al. [18] accelerated the simulations of high-enthalpy flows by implementing the state-to-state approach on a single GPU chip, the comparison between GPU and CPU showed a speedup ratio up to 156, and such speedup ratio increased with the problem size and the problem complexity. Zhang et al. [19] implemented a multiple-relaxation-time lattice Boltzmann model (MRT-LBM) on single GPU, and found the GPU-implemented MRT-LBM on a fine mesh can efficiently simulate two-dimensional shallow water flows. Griebel et al. [20] simulated three-dimensional multiphase flows on multiple GPUs of a distributed memory cluster, and the achieved speedup ratios on eight GPUs were 115.8 and 53.7 compared to a single CPU. The supercomputing power of GPU comes from its thousands of cores [21], for example, the number of cores for the latest single GPU GTX Titan Z is up to 5760, and the core number of GPU will continue to grow as the technology develops. Therefore, GPUs show a broad prospect in accelerating computations. Unfortunately, the use of GPUs is sometimes hampered by the fact that most of the scientific codes written in the past should be ported to GPU architecture. Efforts on this should be done; an important example is the porting of a sophisticated CFD code for compressible flows, originally developed in Fortran and described in [22]. Implementing MOC on GPUs is also an effort on this kind of work. By analyzing the explicit and local characteristics of MOC, we find MOC has a good data independence which is very suitable to be implemented on GPU architectures [23]. To our knowledge, there is no existing report about parallelizing MOC computations on GPUs to accelerate hydraulic transient simulations. In this paper, we propose and verify a GPU implementation of MOC (GPU-MOC), and try to apply it to transient simulations of practical systems. Appl. Sci. 2019, 9, 91 3 of 14

2. Implementation of GPU-MOC Method The essence of GPU acceleration is that it allows thousands of threads to execute in parallel. To be specific, a GPU graphics card is made up of a series of streaming multiprocessors (SM), and it allows thousands of threads to be executed at the same time [24]. Taking the TITAN X chip for example, it is composed of 24 SMs, and in each SM can reside 2048 threads at most. Therefore, a single GPU card can run 49,152 threads in total, at most, at the same time.

2.1. Characteristics of a GPU A distinguishing feature of a GPU is that the greater number of threads that it includes, the higher efficiency that it can achieve. The reasons may be ascribed to the structure and instruction load of Compute Unified Device Architecture (CUDA). Thread switching is a zero-cost fine-grained execution on a GPU. That is, when a warp (the basic execution unit) in one block has to wait for access to off-chip memory or synchronizing instructions, a warp in another block in a ready state instead will execute the calculation immediately. In this way, computational latency is efficiently hidden and a high throughput is achieved.

2.2. Parallel Characteristics of MOC The basic equations for unsteady flow are a pairs of hyperbolic partial differential equations as follows. Continuity equation: ∂H a2 ∂Q A + = 0 (1) ∂t g ∂x equation: ∂Q ∂H f Q|Q| + gA + = 0 (2) ∂t ∂x 2DA where H(x,t) is piezometric head; Q(x,t) is discharge; D, A, a, g, and f are the diameter, cross-section area of the pipe, wave speed, acceleration of gravity, and the Darcy-Weisbach friction factor, respectively. The continuity and momentum equations are transformed into two pairs of equations which are grouped and identified as C+ and C−, Equations (3) and (4), by the method of characteristics [25], and the final formulas can be expressed as Equations (5) and (6): ( dH + a dQ + a f Q|Q| C+ dt gA dt g 2DA2 : dx (3) dt = +a ( − dH + a dQ + a f Q|Q| C− dt gA dt g 2DA2 : dx (4) dt = −a

Cn − Cm QP = (5) Bn + Bm

CnBm + CmBn HP = (6) Bn + Bm where QL, QR, QP HL, HR, and HP are the discharge and head of sections L, R, and P, in which discharge and head of sections L and R are known; the coefficients Cn = HL + BQL, Bn = B + R|QL|, 2 Cm = HR − BQR, Bm = B + R|QR|, R = f ∆x/(2gDA ), and B = a/(gA) are known constants; ∆x is the space interval. As shown in the Cartesian mesh in the x-t plane (Figure1), through Equations (5) and (6), the values of section P can be calculated by those of sections L and R. The values of other sections at the same time level as P can be solved with the same process. Appl. Sci. 2018, 8, x FOR PEER REVIEW 4 of 14

C B + C B H = n m m n P + (6) Bn Bm where QL, QR , QP HL , HR, and HP are the discharge and head of sections L, R, and P, in which = + = + discharge and head of sections L and R are known; the coefficients Cn H L BQL , Bn B R QL , = − = + = Δ 2 = Δ Cm H R BQR , Bm B R QR , R f x (2gDA ) , and B a /(gA) are known constants; x is the space interval. As shown in the Cartesian mesh in the x-t plane (Figure 1), through Equations (5) and (6), the values of section P can be calculated by those of sections L and R. The values of other sections at the same time level as P can be solved with the same process. The solution process described above reflects the explicit and local features of MOC, which is the most critical point for parallelizing MOC on GPU. The explicit feature refers to that the values at section P at the unknown time level is only related to the values at section L and R at the previous time level, having nothing to do with the values at other sections Pi at the same time level as P. The local feature refers to that the values of each section are just connected with the values of its adjacent sections. These features match the multi-thread parallel characteristics of GPU quite well. Therefore, MOC can be built conveniently on a GPU-based CUDA architecture platform, which may fully take advantageAppl. Sci. 2019 of, 9, 91GPU (strong ability of floating-point computing) to compensate the drawbacks4 of of 14 MOC.

CPU conducts logical operations //preparing initial data... //Invoke kernel for(j=0;j>>(q,h,fq,fh); else Kernel<<>>(fq,fh,q,h); cudaDeviceSynchronize(); … }

i-1 i i+1 N GPU conducts floating-point operations t+∆t j+1 int tx=threadIdx.x+blockIdx.x*blockDim.x Block 0 Block 1 Block Y P P PPP t 1 2 4 N j C + C -

LR t-∆t j-1

Block 0 x=0 ∆x x=L Thread 0 Thread 2 Thread Z

Known sections Unknown sections

Figure 1. Schematic of parallel strategy of the GPU-MOC method: CPU codes dispose logical Figureoperations, 1. Schematic the Cartesian of parallel mesh instrategy the x-t of plane the forGPU-MOC MOC, and method: the thread CPU organizationcodes dispose structure logical operations,of the GPU. the Cartesian mesh in the x-t plane for MOC, and the thread organization structure of the GPU. The solution process described above reflects the explicit and local features of MOC, which is 2.3.the mostImplementation critical point of MOC for parallelizing on a GPU MOC on GPU. The explicit feature refers to that the values at sectionThe P basic at the GPU-MOC unknown time parallel level strategy is only relatedis that the to the logical values operations at section are Land conducted R at the on previous a CPU timeand thelevel, floating-point having nothing operations to do with are theexecuted values on at othera GPU. sections The logical Pi at the operations same time like level initializing as P. The data, local allocatingfeature refers memory to that for the arrays, values oftransferring each section data are justbetween connected the host with and the valuesdevice, of and its adjacentinvoking sections. MOC kernels,These features are executed match on the the multi-thread CPU, because parallel the CPU characteristics has many oflogical GPU processing quite well. units Therefore, and is MOC efficient can inbe disposing built conveniently complex on logical a GPU-based operations. CUDA The architecture floating-point platform, operations, which including may fully calculating take advantage the presentof GPU (strongtime level ability values of floating-point and updating computing) them to the to previous compensate time the level drawbacks values, ofare MOC. conducted on GPU, because GPU is equipped with the layouts of pipelines and is efficient in disposing simple 2.3. Implementation of MOC on a GPU floating-point operations. The basic GPU-MOC parallel strategy is that the logical operations are conducted on a CPU 2.3.1.and the MOC floating-point Kernel on GPU operations are executed on a GPU. The logical operations like initializing data, allocating memory for arrays, transferring data between the host and device, and invoking MOC kernels, are executed on the CPU, because the CPU has many logical processing units and is efficient in disposing complex logical operations. The floating-point operations, including calculating the present time level values and updating them to the previous time level values, are conducted on GPU, because GPU is equipped with the layouts of pipelines and is efficient in disposing simple floating-point operations.

2.3.1. MOC Kernel on GPU The MOC kernel, executed on GPU, mainly calculates value of all MOC sections. The execution configuration and parameters for MOC kernel are outlined in Figure1, in which the thread blocks and the thread grid are organized as 1D linear form that fits the line structure of pipeline well. The thread grid includes Y+1 thread blocks, and each thread block contains Z+1 threads. Each thread is allocated to a MOC section, thus, the total number of threads is equal to the total number of MOC sections. During the calculation, threads are located by thread index tx, which is accessible within the kernel through the built-in threadIdx.x and blockIdx.x variables. Threads that indexed in consecutive can be accessed by global memory at the same time to calculate the corresponding MOC sections. The problem of non-aligned access is unnoticeable because the information of the adjacent MOC sections like L Appl. Sci. 2019, 9, 91 5 of 14 and R are in linear order. Thus, shared memory may not make obvious improvement in computing efficiency. In AppendixA, an example of the MOC kernel for direct water hammer is given. When deciding the threading arrangement, great care should be taken in terms of the block size (number of threads per block), which may affect computing efficiency in some degree. Since resources of shared memory and registers for each block are limited, and number of threads per block and number of blocks per SM are limited. Too large or too small of a block size may reduce the number of threads in each SM. Therefore, reasonable allocation of the block size is important to achieve a high computing efficiency.

2.3.2. Implementation Procedures on CPU GPUs and CPUs maintain their own dynamic random access memory (DRAM). Data transmission between GPU memory and CPU memory is the bridge between GPU and CPU. In order to simplify the codes on CPU, the unified memory is adopted. The codes of data transmission between CPU memory and GPU memory are omitted, because unified memory can access both CPU memory and GPU memory. Though codes for data transmission process are omitted, it is still needed to be conducted. Thus, the computing efficiency is not improved. Data transmission process usually includes three steps: first, copy input data from the CPU memory to the GPU memory; second, load the GPU codes and execute them; and, thirdly, copy output results from the GPU memory back to CPU memory. It is worth mentioning that data transmission efficiency is the determining factor that limits the computational efficiency of the GPU, and many optimizing strategies are aimed at accelerating these processes or reducing data transfer loads. Two sets of 1D arrays are used to store discharge and head variables for MOC sections, one stores input data and the other stores output data. Initially, the input data are stored in arrays q[tx] and h[tx], and the output data will be stored in arrays fq[tx] and fh[tx]. After the calculation of one time step, the output data are used as the input data for the next time step. Then the input data are stored in fq[tx] and fh[tx], and the output data have to be stored in q[tx] and h[tx] instead. As a result, the MOC kernel has to be invoked alternately by parity counts to switch the arrays that store input and output data. A synchronous function cudaDeviceSynchronize() is used to make sure all computation of this time step are completed before starting the next time step calculation. In general, the implementation procedures of the CPU codes are listed as follows.

(1) Allocate unified memory and initialize two set of arrays, and copy input data from the CPU memory to the GPU memory. (2) Launch MOC kernel for parallel computing. (3) Synchronize threads. (4) Write the output data back from the GPU memory to the CPU memory, and set them as input data for the next time step. (5) Repeat steps (2)–(4) until the calculation for the last time step finishes. (6) Output the results.

In AppendixB, the details of the CPU codes for direct water hammer are given.

3. Verification of GPU-MOC by Benchmark Cases The benchmark cases regarding water hammer in a simple reservoir-single pipe-valve system are simulated to verify the performance of GPU-MOC. The water level of reservoir is constant, and the hydraulic transients are caused by the operations of the end valve. Four benchmark water hammer phenomena, including direct water hammer, first-phase water hammer, last-phase water hammer, and negative water hammer, are simulated. The CPU-MOC program is compiled by the standard C based on Visual Studio 2010 and the GPU-MOC program is compiled by the combination of standard C and CUDA C (an extended language of standard C) based on Microsoft Visual Studio Appl. Sci. 2018, 8, x FOR PEER REVIEW 6 of 14

The benchmark cases regarding water hammer in a simple reservoir-single pipe-valve system are simulated to verify the performance of GPU-MOC. The water level of reservoir is constant, and the hydraulic transients are caused by the operations of the end valve. Four benchmark water hammer phenomena, including direct water hammer, first-phase water hammer, last-phase water hammer, and negative water hammer, are simulated. The CPU-MOC program is compiled by the Appl. Sci. 2019, 9, 91 6 of 14 standard C based on Visual Studio 2010 and the GPU-MOC program is compiled by the combination of standard C and CUDA C (an extended language of standard C) based on Microsoft Visual Studio 2010 (Microsoft Corporation: Redmond, Redmond, WA, WA, USA) USA) and and NVIDIA CUDA 8.0 (Nvidia Corporation: Santa Clara, CA, USA). The results of GPU-MOC and CPU-MOC show quit quitee minor difference and agree well with the pressure variationvariation lawslaws ofof the the four four water water hammer hamme phenomenar phenomena (Figure (Figure2), which2), which successfully successfully prove prove the thecorrectness correctness of the of the proposed proposed method. method.

Figure 2. Pressure curves at the valve for GPU-MOC and CPU-MOC method: ( a) first-phase first-phase water hammer; ( b)) direct water hammer; ( c) last-phase water hammer; and ( d) negative water hammer. H denotes head and t denotes time.

Here, thethe computing computing efficiency efficiency of directof direct water wa hammerter hammer simulation simulation is analyzed. is analyzed. Since the Since efficiency the efficiencyof GPU-MOC of GPU-MOC strategy isstrategy only sensitive is only sensitive to the number to the number of calculations, of calculations, it is assumed it is assumed that the that pipe the pipelength length is long is enoughlong enough to be partitioned to be partitioned into sufficient into sufficient calculation calculation sections. Threesections. cases Three are simulated,cases are 14 16 18 simulated,in which the in totalwhich section the total numbers section are numbers 2 , 2 ,are and 214 2, 2,16 respectively., and 218, respectively. Three scenarios Three ofscenarios block size of block(number size of (number threads of per threads block) per 64, 128,block) and 64, 256 128, are and included 256 are in included each case in for each GPU-MOC case for GPU-MOC method to methodfind out to the find effect out of the block effect size of onblock computing size on computing efficiency. Theefficiency. transient The behavior transient of behavior 20 s is simulated of 20 s is simulatedby using single by using precision. single All precision. the calculations All the arecalculations conducted are on theconducted computer on that the iscomputer configured that with is configuredone CPU processor with one (Intel CPU Xeonprocessor e5-2620 (Intel v3 2.4GXeon 12e5-2620 cores, v3 Intel 2.4G Corporation: 12 cores, Intel Santa Corporation: Clara, CA, Santa USA) Clara,and one CA, graphics USA) and card one (NVIDIA graphics Geforce card (NVIDIA GTX TITAN Geforce X, Nvidia GTX TITAN Corporation: X, Nvidia Santa Corporation: Clara, CA, USA),Santa Clara,and the CA, testing USA), platform and the is Windowstesting platform 10 64-bit is operating Windows system 10 64-bit (Microsoft operating Corporation: system (Microsoft Redmond, WA,Corporation: USA). Moreover, Redmond, all WA, the simulations USA). Moreover, are run all for the at leastsimulations several timesare run to for avoid at least the effect several of machinetimes to avoidtemperature the effect variation of machine and possible temperature frequency variatio down-steppingn and possible of the frequenc GPU. Eachy down-stepping simulation starts of when the GPU.the GPU Each is cooledsimulation down starts to room when temperature. the GPU is cooled down to room temperature. The performances of GPU-MOCGPU-MOC areare muchmuch betterbetter than than those those of of CPU-MOC. CPU-MOC. As As shown shown in in Figure Figure3, 3,the the maximum maximum performance performance for for GPU-MOC GPU-MOC is is up up to to 1760.7 1760.7 MSUPSMSUPS (millions(millions ofof sectionsection updates per second), whilewhile forfor CPU-MOC CPU-MOC is is only only 7.5 7.5 MSUPS. MSUPS. The Th reasone reason is that is that the GPUthe GPU executes executes in parallel in parallel while whilethe CPU the executes CPU executes serially. serially. Note that Note the performancethat the performance data are given data inare millions given in of sectionmillions updates of section per updatessecond (MSUPS). per second Moreover, (MSUPS). the Moreover, performances the ofperf GPU-MOCormances improve of GPU-MOC as the section improve numbers as the increase.section numbersThis is because increase. sufficient This activeis because warps sufficient in the chip active can effectivelywarps in hidethe chip the computational can effectively latency hide andthe computationalachieve a high throughout.latency and achieve a high throughout. The same trend can be found in Table1, in which the speedup ratios (the ratios of updating section numbers per second of GPU-MOC to those of CPU-MOC) increase as the section numbers increase, ranging from 36 to 236.6 when the section numbers increase from 214 to 218. Moreover, for each case, GPU-MOC can achieve its best performance when the block size is 256. This is because a block size of 256 just makes a full use of the device processing capacity. Therefore, the block size of 256 is adopted in the following application cases. Appl. Sci. 2019, 9, 91 7 of 14 Appl. Sci. 2018, 8, x FOR PEER REVIEW 7 of 14

The performances of GPU-MOC 2500 The performances of CPU-MOC 64

8 7.5 7.5 ) 7.4 128 2000 6 256 MSUPS ( 1745.5 1752 1760.7 4

1500 2 Performance 0

214 216 218 1000 Section number 818.8 825.6 839.5 Performance (MSUPS) 500 269.9 271.6 272.5

0 18 214 216 2 Section number FigureFigure 3.3. PerformancesPerformances of of CPU-MOC CPU-MOC in differentin different section section numbers numbers and GPU-MOCand GPU-MOC in different in different section sectionnumbers numbers with different with different block sizes. block sizes. Table 1. Speedup ratios of GPU-MOC to CPU-MOC. The same trend can be found in Table 1, in which the speedup ratios (the ratios of updating section numbers per secondSection of GPU-MOCGPU-MOC to those of with CPU-MOC) Different Block increase Sizes as the section numbers increase, ranging from 36Numbers to 236.6 when the 64section numbers 128 increase from 256 214 to 218. Moreover, for each case, GPU-MOC can achieve214 its best performance36 when 36.2 the block 36.3size is 256. This is because a block size of 256 just makes 2a16 full use of the109.2 device processing 110.1 capacity. 111.9 Therefore, the block size of 256 is adopted in the following218 application cases.234.6 235.5 236.6

4. Application of GPU-MOCTable to 1. a Speedup Water Distribution ratios of GPU-MOC Network to CPU-MOC. System GPU-MOC with different block sizes The transient processSection of numbers an actual water distribution network system in Karney’s work [26] is simulated. The length of the original system64 (Figure 4), namely128 Case 1,256 is about 9 km, with seven nodes along the pipeline, a surge214 tank at node36 3, a relief valve36.2at node 6,36.3 and a control valve at node 7 (partially closed with τ = 0.6216 initially). The109.2 transient phenomena110.1 are caused111.9 by closing the control valve from opening τ = 0.6 to218τ = 0.2 within 10234.6 s. The relief valve235.5 is initially236.6 closed, but it is set to open linearly within 3 s once the pressure exceeds 210 m and then it is closed linearly within 60 s. Details of 4.other Application parameters of canGPU-MOC be found to in a [Water26]. Distribution Network System TheAs showntransient in Figureprocess5 ,of the an results actual ofwater GPU-MOC distribu andtion Karney’snetwork system work at in nodes Karney’s 3, 6, work and 7 [26] show is simulated.quite minor The differences. length of Thus, the original the GPU-MOC system method(Figure is4), reliable. namely The Case high 1, is pressure about 9 waves km, with initiated seven at nodesnode 7along by the the closing pipeline, valve a surge are propagated tank at node throughout 3, a relief thevalve system. at node At 6,t =and 3 s a water control starts valve to at flow node into 7 (partiallythe surge tankclosed and with makes τ = the0.6 tankinitially). water The level transient rises gradually, phenomena and pressureare caused at junctionby closing 3 fluctuates the control in valvea relative from narrow opening range. τ =0.6 The to pressureτ = 0.2 within relief valve10 s. exceedsThe relief 210 valve m at ist =initially 6.4 s. Whenclosed, the but valve it is opens,set to openthe release linearly of fluidwithin at 3 that s once location the pressure reduces theexceeds pressure. 210 m The and water then through it is closed the relieflinearly valve within at node 60 s. 6 Detailsand the of water other level parameters change incan the be surge found tank in [26]. at node 3 attenuate the transient pressure and allow the system to reach steady state quickly. Reservoir 1 Surge Tank Reservoir 2 Conduits

1 Connector 3 4 Length(m) Pipe 2 5 3 Qout3=1.0 m /s number Case Case 3 Case 3 Q =2.0 m /s 1 2 out1 6 Q out2 3 Qout4=2.028 m /s Relief Valve L1 (1 to 7 1001.2 10,012 100,120 Control Valve 2) Appl. Sci. 2018, 8, x FOR PEER REVIEW 8 of 14 AAppl.ppl. Sci. Sci. 20182019,, 89,, x 91 FOR PEER REVIEW 8 8of of 14 14

Reservoir 1 Surge Tank Reservoir 1 Surge Tank Reservoir 2 Reservoir 2 Conduits

1 Connector 3 4 Length(m) 1 Connector 3 4 Length(m) Pipe 2 3 number 2 5 3 5 Qout3=1.0 m /s number Case 1 Case 2 Case 3 Qout3=1.0 m /s Case 1 Case 2 Case 3 3 Q =2.0 m3/s 6 Qout1=2.0 m /s Q out1 6 Qout2 3 out2 Q =2.028 m3/s L1 (1 to 2) 1001.2 10,012 100,120 Qout4=2.028 m /s L1 (1 to 2) 1001.2 10,012 100,120 Relief Valve out4 Relief Valve 7 7 Control Valve L2 (2 to 3) 2000 20,000 200,000 Control Valve L2 (2 to 3) 2000 20,000 200,000

L3 (3 to 4) 2000 20,,000 200,,000

L4 (3 to 5) 502.5 5025 50,,250

L5 (6 to 5) 502.5 5025 50,,250

L6 (2 to 6) 1001.2 10,,012 100,,120

L7 (6 to 7) 2000.2 20,002 200,020 L7 (6 to 7) 2000.2 20,002 200,020

Figure 4. Sketch map of the simple hydraulic network. FigureFigure 4. 4. SketchSketch map map of of the the simple simple hydraulic hydraulic network. network.

FiguFigurere 5. Comparison of of head head and and discharge discharge at at su surgerge tank tank (Node (Node 3), 3), pressure pressure relief relief valve valve (Node (Node 6), 6), Figure 5. Comparison of head and discharge at surge tank (Node 3), pressure relief valve (Node 6), and control valve (Node 7) with data from Karney [[26].26]. and control valve (Node 7) with data from Karney [26].

AsTo shown demonstrate in Figure the 5, strong the results computing of GPU capability-MOC and of Karney’s the GPU-MOC work at innodes computing 3, 6, and intensive 7 show problems,As shown the pipein Figure length 5, ofthe the results above of CaseGPU- 1MOC is increased and Karney’s tenfold work and at hundredfold, nodes 3, 6, and which 7 show are quite minor differences.. Thus,Thus, the GPU-MOC method is reliable.. TheThe highhigh pressurepressure waveswaves initiatedinitiated atcorresponding node 7 by the to closing Case 2 valve and Case are propagated 3 (Figure4), throughout respectively. the The system. pipe segment At t = 3 lengthss water starts∆x of to Case flow 1, atCase node 2 and 7 by Case the closing 3 are all valve set to are 1 m, propagated and the corresponding throughout the section system. numbers At t = are 3 s 9014,water 90,077, starts 900,767,to flow into the surge tank and makes the tank water level rises gradually, and pressure at junction 3 fluctuatesrespectively. in a The relative first 60narrow s transient range. behavior The pressure of the relief system valve caused exceeds by operating 210 m at thet = control6.4 s. When valve the at fluctuatesnode 7 is simulated.in a relative The narrow block range. size is setThe to pressure 256. relief valve exceeds 210 m at t = 6.4 s. When the valve opens, the release of at that location reduces the pressure. The water through the relief valveThe at node performances 6 and the of water GPU-MOC level change are much in the better surge than tank those at node of the 3 atten CPU-MOCuate the (Figure transient6). valveThe maximum at node 6performance and the water forGPU-MOC level change is up in to the 568 surge MSUPS, tank but at for node CPU-MOC 3 attenuate is only the 3.6 transient MSUPS. pressure and allow the system to reach steady state quickly. In addition,To demonstrate the performances the strong of GPU-MOC computing improve capability significantly of the GPU as- theMOC scale in of computing the system intensive becomes larger.To For demonstrate Case 3, it updates the strong 568 MSUPS, computing but for capability Case 1 is onlyof the 55 GPU MSUPS.-MOC That in is computing to say, the computing intensive problems, the pipe length of the above Case 1 is increased tenfold and hundredfold, which are efficiency of Case 3 is more than 10 times to that of Case 1. x corresponding to Case 2 and Case 3 (Figure 4), respectively. The pipe segment lengths x of Case 1, Case 2 and Case 3 are all set to 1 m, and the corresponding section numbers are 9014, 90,,077, 900,,767, Appl. Sci. 2018, 8, x FOR PEER REVIEW 9 of 14

Appl. Sci. 2018, 8, x FOR PEER REVIEW 9 of 14 The performances of GPU-MOC are much better than those of the CPU-MOC (Figure 6). The maximumThe performance performances for of GPU-MOC GPU-MOC is upare to much 568 MSUPS, better than but forthose CPU-MOC of the CPU-MOC is only 3.6 (Figure MSUPS. 6). In The addition,maximum the performancesperformance forof GPU-MOCGPU-MOC isimprove up to 568 significantly MSUPS, but as for the CPU-MOC scale of the is systemonly 3.6 becomes MSUPS. In larger.addition, For Case the performances3, it updates 568of GP MSUPS,U-MOC but improve for Case significantly 1 is only as55 thMSUPS.e scale ofThat the issystem to say, becomes the computing efficiency of Case 3 is more than 10 times to that of Case 1. Appl.larger. Sci. 2019For, 9Case, 91 3, it updates 568 MSUPS, but for Case 1 is only 55 MSUPS. That is to say,9 of the 14 computing efficiency of CaseThe 3 is performances more than 10 of timesGPU-MOC to that of Case 1. The performances of CPU-MOC 600 4 The performances of GPU-MOC 568 3.6 The performances of CPU-MOC 600 4 3.1 568 3 2.8 3.6 480 3.1 3 480 2 2.8 374.9 360 2 374.9 1 360 240 Performance(MSUPS) 1 0 Case 1 Case 2 Case 3 240 Performance(MSUPS) 0 120 Case 1 Case 2 Case 3 Performance(MSUPS) 55 120 Performance(MSUPS) 55 0 Case 1 Case 2 Case 3 0 Figure 6. Performances of GPU-MOC andCase CPU-MOC1 Case in 2 differentCase section 3 numbers. FigureFigure 6.6. PerformancesPerformances ofof GPU-MOCGPU-MOC andand CPU-MOCCPU-MOC inin differentdifferent sectionsectionnumbers. numbers. The consumed time of GPU-MOC is far less than that of CPU-MOC (Table 2). Especially in Case 3, the consumedTheThe consumedconsumed time timeof time GPU-MOC of of GPU-MOC GPU-MOC is less is is farthan far less less tw thano than minutes, that that of of CPU-MOCbut CPU-MOC for CPU-MOC (Table (Table2). is2). Especially more Especially than in Casefourin Case 3, hours,the3, the consumedthe consumed achieved time timespeedup of GPU-MOC of GPU-MOC ratio is is up less is to thanless 158. than two Therefore, minutes,two minutes, finer but forcalculation but CPU-MOC for CPU-MOC sections is more canis thanmore be fourused than hours, to four simulatethehours, achieved the the large achieved speedup scale problemsspeedup ratio is up ratioby to GPU-MOC 158.is up Therefore, to 158. method. Th finererefore, Note calculation thatfiner th calculatione sectionsconsumed cansections time be usedonly can toincludes be simulate used to thethe simulatetime large cost scalethe for large floating problems scale point problems by GPU-MOC calculation, by GPU-MOC method. but does Notemethod. not thatinclude Note the consumedtime that thspente consumed time on preparing only time includes onlythe initial theincludes time data.costthe time for floating cost for point floating calculation, point calculation, but does not but include does not time include spent time on preparing spent on thepreparing initial data. the initial data. TableTable 2. Consumed 2. Consumed time time and and speedup speedup ratios ratios of the of the two two methods. methods.

Table 2. ConsumedCasesCases time and Case Casesspeedup 1 1 Caseratios Cases 2 of 2the Case two Cases 3methods. 3 TimeTime of ofGPU-MOC GPU-MOCCases (s) (s) 9.8 Case9.8 1 14.4 Case 14.4 2 95.2 Case 95.2 3 TimeTimeTime of ofCPU-MOC of CPU-MOC GPU-MOC (s) (s) (s) 191.6 191.6 9.8 1761.4 1761.4 14.4 15,039.2 15,039.2 95.2 Speedup ratio 19.5 122.2 158.0 TimeSpeedup of CPU-MOC ratio (s) 19.5 191.6 122.2 1761.4 158.0 15,039.2 Speedup ratio 19.5 122.2 158.0 5. Application5. Application of GPU-MOC of GPU-MOC to a to Long-Distance a Long-Distance Water Water Transmission Transmission System System 5.The ApplicationThe transient transient ofof GPU-MOCa of long a long distance distance to a waterLong-Distance water transmission transmission Water project projectTransmission is simulated, is simulated, System in which in which the the pumping pumping partpart is from isThe from antransient anexample example of ain long the in the bookdistance book [27] [ 27 waterby] byChaudhry. transmission Chaudhry. We We modified project modified is itsimulated, by it byadding adding in a whichlong a long trapezoidal the trapezoidal pumping channelchannelpart isbetween from between an theexample the upstream upstream in the reservoir reservoir book [27] and by thetheChaudhry. pumppump station stationWe modified (Figure (Figure7 ).it The7).by Theadding parameters parameters a long of trapezoidal the of project the projectarechannel shown are shownbetween in Table in 3theTable. The upstream discharge3. The discharge reservoir of the two ofand th parallel ethe two pu pumpsparallelmp station is lumped (Figure is together, lumped 7). The andtogether, parameters the suction and theof lines the suctionbeingproject lines short are being areshown not short includedin Tableare not 3. in includedThe the analysis.discharge in the Pump ofanalysis. the trip two isPump parallel considered trip pumps is considered to be is thelumped cause to be together, of the the cause transients, and of the theand suctiontransients, there lines is and no being valve there shortaction. is no are valve not includedaction. in the analysis. Pump trip is considered to be the cause of the transients, and there is no valve action.

L =200 km L2=1.0 km Reservoir 1 1 Reservoir 2 PipeL =1.0 km ChannelL =200 km 2 Reservoir 1 1 Reservoir 2 Parallel Pumps Pipe Channel Figure 7. Notation for open channel andParallel pipe Pumps combined system. Figure 7. Notation for open channel and pipe combined system. Table 3. Computation parameters. Figure 7. Notation for open channel and pipe combined system. Table 3. Computation parameters. Pipe Pump Trapezoidal Channel Table 3. Computation parameters. Pipe Pump 3 Trapezoidal channel Length = 1000 m QR = 0.25 m /s Water level of reservoir 1 = 1.5 m D =Pipe 0.75 m HPumpR = 60 m LengthTrapezoidal = 200 km, channel wide = 1 m Wave speed a =1000 m/s NR = 1100 rpm Bottom slope = 0.0002 Q = 0.5 m3/s WR2 = 16.85 kg·m2 Side slopes 0.5 horizontal to 1 vertical Friction factor = 0.012 ηR= 0.84 Manning roughness = 0.012 Appl. Sci. 2019, 9, 91 10 of 14

The differential equations for free-surface unsteady flow are solved also by MOC [25]. Appl.An interpolationSci. 2018, 8, x FOR procedure PEER REVIEW is necessary so that calculations can proceed along characteristic10 lines.of 14 Errors are introduced in the interpolation approximation, and high order interpolation are always Length = 1000 m QR = 0.25 m3/s Water level of reservoir 1 = 1.5 m adopted to reduce the errors. However, the consumed time and the storage space of the simulation D = 0.75 m HR = 60 m Length = 200 km, wide = 1 m for high order interpolation would increase significantly. To investigate the computing efficiency Wave speed a =1000 of the GPU-MOC method in highN orderR = 1100 interpolation, rpm first-order Bottom slope and = second-order 0.0002 Lagrange m/s interpolations [28] are used in simulating the transient processes of free-surface unsteady flow of WR2 = 16.85 Side slopes 0.5 horizontal to 1 this system. Q = 0.5 m3/s kg∙m2 vertical Figure8 presents the comparisons ofη the history curves of pump head, discharge, speed, and head of open channelFriction at factor downstream = 0.012 by first-orderR = 0.84 and second-orderManning interpolations. roughness = The 0.012 results of pump parametersThe differential for the two equations interpolations for free-surface are nearly theunst same,eady becauseflow are the solved treatments also forby theMOC pressurized [25]. An interpolationpipe and pump procedure are identical, is necessary and the water so that level calculations change for thecan open proceed channel along is small, characteristic which has lines. little Errorsimpact are on pumpintroduced parameters. in the Afterinterpolation power is approx lost, theimation, reactive and torque high of theorder interpolation on the impeller are always causes adoptedthe rotational to reduce speed the to errors. decrease, However, which, the in turn,consum reducesed time pressure and the head storage and space discharge. of the Assimulation a result, forthe high discharge order atinterpolation the channel would downstream increase end significantly. that delivers To to investigate the pump reducesthe computing quickly, efficiency causing the of thewater GPU-MOC level rises. method The water in levels high at order the channel interpolat downstreamion, first-order end by theand two second-order interpolation Lagrange methods interpolationsare different due [28] to are interpolation used in simulating errors. Oncethe tran thesient pump processes reaches of its free-surface reverse rotating unsteady steady-state flow of thisconditions, system. the water level at the channel downstream end increases uniformly and slowly.

Figure 8. HistoryHistory curves curves of of pump pump and and channel channel parameters parameters of of First-order First-order and second-order interpolations.interpolations.

FigureThe computing 8 presents efficiency the comparisons of the GPU-MOC of the histor in highy curves order of interpolationpump head, discharge, is analyzed. speed, The pipeand headsegment of open length channel is ∆x = at1 downstreamm, time step isby∆ tfirst-order= 0.001 s, and thesecond-order corresponding interpolations. section number The results is 201,000. of pumpThe transient parameters behavior for inthe the two system interpolations within 600 are s after nearly the pumpthe same, trip isbecause simulated. the Thetreatments block size for is the set pressurizedto 256. pipe and pump are identical, and the water level change for the open channel is small, whichGPU-MOC has little impact method on shows pump a parameters. great advantage After in po computingwer is lost, intensive the reactive problem, torque especially of the liquid in high on theorder impeller interpolations causes (Figurethe rotational9). For both speed first-order to decr andease, second-order which, in turn, interpolations, reduces pressure the performances head and discharge.of GPU-MOC As area result, much the better discharge than those at the of CPU-MOC. channel downstream Moreover, forend GPU-MOC, that delivers the to second-order the pump reducesinterpolation quickly, achieves causing more the significant water level performance rises. The water gains comparedlevels at the with channel the first-order downstream interpolation. end by theFor two first-order interpolation interpolation, methods the are consumed different time due is to 45.726 interpolation hours by errors. CPU-MOC, Once butthe onlypump 0.217 reaches hour its by reverseGPU-MOC, rotating and steady-state the achieved conditions, speedup ratio the iswate 210.7.r level For at second-order the channel interpolation, downstream theend consumed increases uniformlytime is 288.9 and hours slowly. by CPU-MOC, but only 0.281 hour by GPU-MOC, and the achieved speedup ratio is upThe to 1027.computing That is efficiency to say, the of speedup the GPU-MOC ratio of Second-Orderin high order interpolation is 4.9 times to is that analyzed. of the first-order. The pipe segmentTherefore, length high-order is Δ=x computing1 m, time methods step is executed Δ=t 0.001 ons, the and GPU the can corresponding be used in simulating section number large-scale is 201,000.practical The problems transient for behavior obtaining in more the system accurate within results. 600 s after the pump trip is simulated. The block size is set to 256. Appl. Sci. 2019, 9, 91 11 of 14 Appl. Sci. 2018, 8, x FOR PEER REVIEW 11 of 14

200

CPU-MOC GPU-MOC 0.8 0.73 154.47 160

0.6 119.1 120

0.4 80

0.2 Performance(MSUPS) 0.12 Performance(MSUPS) 40

0.0 First-Order Second-Order 0 First-Order Second-Order Figure 9.9. Performances of GPU-MOC andand CPU-MOCCPU-MOC forfor thethe twotwo interpolationinterpolation methods.methods. 6. Conclusions and Future Works GPU-MOC method shows a great advantage in computing intensive problem, especially in highThis order paper interpolations presents a novel(Figure GPU 9). implementationFor both first-order of MOC and on second-order single GPU card interpolations, for accelerating the theperformances simulation of of GPU-MOC hydraulic transients.are much better CPU than is mainly those responsibleof CPU-MO forC. Moreover, executing for logical GPU-MOC, operations the andsecond-order GPU is mainly interpolation responsible achieves for executing more floatingsignificant operations. performance The MOC gains kernel, compared which iswith mainly the responsiblefirst-order interpolation. for calculation, For is executed first-order on GPUinterpol andation, invoked the by consumed an odd-even time alternate is 45.726 form hours by CPU. by TheCPU-MOC, configuration but only parameters 0.217 hour of the by MOC GPU-MOC, kernel are and organized the achieved as 1D linearspeedup structure ratio tois fit210.7. the lineFor structuresecond-order of the interpolation, pipes. the consumed time is 288.9 hours by CPU-MOC, but only 0.281 hour by GPU-MOC,The benchmark and the achieved single pipe speedup water ratio hammer is up problemto 1027. isThat simulated is to say, to the verify speedup the accuracy ratio of andSecond-Order efficiency is of 4.9 GPU-MOC times to method.that of the It isfirst- foundorder. that Therefore, the numerical high-order results computing of GPU-MOC methods are identicalexecuted withon the CPU-MOC, GPU can be but used the in computational simulating large-scale efficiency practical is much problems higher, for and obtaining the obtained more speedupaccurate ratiosresults. of GPU-MOC are up to hundreds compared to CPU-MOC. Two computing-intensive problems, the large-scale water distribution system and the long-distance water transmission system, are6. Conclusions simulated, which and Future successfully Works validate the high computational efficiency of the GPU-MOC method. Moreover,This paper the GPU-MOC presents a novel method GPU can implementation achieve more significantof MOC on performance single GPU card gains for as accelerating the size of thethe problemsimulation increases. of hydraulic Therefore, transients. finer calculationCPU is mainly sections responsible and high-order for executing methods logical can beoperations used to simulateand GPU large-scale is mainly responsible practical problems for executing for getting floati moreng operations. accurate results. The MOC kernel, which is mainly responsibleThe GPU-MOC for calculation, method is has executed great value on GPU for practicaland invoked applications by an odd-even because thealternate algorithm form is by simple, CPU. high-efficiency,The configuration easy parameters to program, of the and MOC convenient kernel toare popularize. organized as Future 1D linear work structure will focus to onfit the further line optimizationstructure of the of thepipes. present algorithm and more computational intensive transient simulations like 2D or 3DThe and benchmark on multiple single GPUs. pipe water hammer problem is simulated to verify the accuracy and Authorefficiency Contributions: of GPU-MOCConceptualization: method. It is W.M.found and that Y.C.; the methodology: numerical W.M.results and of J.W.; GPU-MOC software: W.M.are identical and J.W.; validation:with CPU-MOC, W.M., Y.C., but and the Z.Y.; computational writing—original efficiency draft: W.M. is much and Y.C.;higher, writing—review and the obtained and editing: speedup W.M., ratios Y.C., Z.Y.,of GPU-MOC Y.Z., and S.S. are up to hundreds compared to CPU-MOC. Two computing-intensive problems, the Funding:large-scaleThis water research distribution was funded system by theand Nationalthe long-distance Natural Science water Foundation transmission of Chinasystem, (NSFC) are (grantsimulated, nos. 51839008, which successfully 51579187, and validate 11172219). the high computational efficiency of the GPU-MOC method. Acknowledgments:Moreover, the GPU-MOCOur grateful method thanks can to achieve Linsheng more Xia forsignificant his valuable performance advice for gains revising as thethe manuscript.size of the Thanksproblem also increases. to Editor LynneTherefore, Xu for finer correcting calculation my English, sections editing and and high-order proof reading methods the text. can be used to Conflictssimulate oflarge-scale Interest: The practical authors problems declare no for conflict getting of interest. more accurate results. The GPU-MOC method has great value for practical applications because the algorithm is simple, high-efficiency, easy to program, and convenient to popularize. Future work will focus on further optimization of the present algorithm and more computational intensive transient simulations like 2D or 3D and on multiple GPUs.

Author Contributions: Conceptualization: Wanwan Meng and Yongguang Cheng; methodology: Wanwan Meng and Jiayang Wu; software: Wanwan Meng and Jiayang Wu; validation: Wanwan Meng, Yongguang Cheng, and Zhiyan Yang; writing—original draft: Wanwan Meng and Yongguang Cheng; writing—review and editing: Wanwan Meng, Yongguang Cheng, Zhiyan Yang, Yunxian Zhu, and Shuai Shang. Appl. Sci. 2018, 8, x FOR PEER REVIEW 12 of 14 Appl. Sci. 2018, 8, x FOR PEER REVIEW 12 of 14

Funding: This research was funded by the National Natural Science Foundation of China (NSFC) (grant nos. Funding: This research was funded by the National Natural Science Foundation of China (NSFC) (grant nos. 51839008, 51579187, and 11172219). 51839008, 51579187, and 11172219). Acknowledgments: Our grateful thanks to Dr. Linsheng Xia for his valuable advice for revising the manuscript. Acknowledgments: Our grateful thanks to Dr. Linsheng Xia for his valuable advice for revising the manuscript. Thanks also to Editor Lynne Xu for correcting my English, editing and proof reading the text. Thanks also to Editor Lynne Xu for correcting my English, editing and proof reading the text. Conflicts of Interest: The authors declare no conflict of interest. Appl. Sci. 2019, 9Conflicts, 91 of Interest: The authors declare no conflict of interest. 12 of 14

Appendix A Appendix A Appendix A MOC Kernel Function MOC KernelMOC Function Kernel Function __global__ void MOC_kernel (float B, float R, float *q, float *h, float *fq, float *fh) __global__ void MOC_kernel (float B, float R, float *q, float *h, float *fq, float *fh) { { // Declear variables // Declear variables float Cn, Cm, Bn, Bm; float Cn, Cm, Bn, Bm; // Thread ID // Thread ID int tx = threadIdx.x + blockIdx.x * blockDim.x; int tx = threadIdx.x + blockIdx.x * blockDim.x; // Calculate constant coefficients // Calculate constant coefficients Cn = h[tx-1] + B * q[tx-1]; Cn = h[tx-1] + B * q[tx-1]; Cm = h[tx+1] - B * q[tx+1]; Cm = h[tx+1] - B * q[tx+1]; Bn = B + R *fabs(q[tx-1]); Bn = B + R *fabs(q[tx-1]); Bm = B + R *fabs(q[tx+1]); Bm = B + R *fabs(q[tx+1]); // Calculate head and discharge of MOC sections // Calculate head and discharge of MOC sections fq[tx] = (Cn - Cm) / (Bn + Bm); fq[tx] = (Cn - Cm) / (Bn + Bm); fh[tx] = (Cn * Bm + Cm *Bn) / (Bn + Bm); fh[tx] = (Cn * Bm + Cm *Bn) / (Bn + Bm); } } Appendix B Appendix B Appendix B Codes on CPU Codes on CPUCodes on CPU

int main() int main() { { // Allocate memory // Allocate memory float *q, *h, *fq, *fh; float *q, *h, *fq, *fh; int size = N* sizeof(float); int size = N* sizeof(float); cudaError_t (cudaMallocManaged ((void**) &h, size)); cudaError_t (cudaMallocManaged ((void**) &h, size)); cudaError_t (cudaMallocManaged ((void**) &q, size)); cudaError_t (cudaMallocManaged ((void**) &q, size)); cudaError_t (cudaMallocManaged ((void**) &fh, size)); cudaError_t (cudaMallocManaged ((void**) &fh, size)); cudaError_t (cudaMallocManaged ((void**) &fq, size)); cudaError_t (cudaMallocManaged ((void**) &fq, size)); // Initiallize variables // Initiallize variables ...... // Configuration parameters // Configuration parameters int threads = 256; int threads = 256; int blocks = N/threads + 1; int blocks = N/threads + 1; // Launch kernel // Launch kernel for (int i = 0; i>> (B, R, q, h, fq, fh); MOC_kernel <<< blocks, threads >>> (B, R, q, h, fq, fh); else else MOC_kernel <<< blocks, threads >>> (B, R, fq, fh, q, h); MOC_kernel <<< blocks, threads >>> (B, R, fq, fh, q, h); // Synchronize threads // Synchronize threads cudaDeviceSynchronize(); cudaDeviceSynchronize(); } } // Output results // Output results ...... } } Appl. Sci. 2019, 9, 91 13 of 14

References

1. Boulos, P.; Karney, B.; Wood, D.J.; Lingireddy, S. Hydraulic Transient Guidelines. J. Am. Water Resour. Assoc. 2005.[CrossRef] 2. Duan, H.F.; Che, T.C.; Lee, P.J.; Ghidaoui, M.S. Influence of nonlinear turbulent friction on the system frequency response in transient pipe flow modelling and analysis. J. Hydraul. Res. 2018, 56, 451–463. [CrossRef] 3. Pozos-Estrada, O.; Sánchez-Huerta, A.; Breña-Naranjo, J.A.; Pedrozo-Acuña, A. Failure analysis of a water supply pumping pipeline system. Water 2016, 8, 395. [CrossRef] 4. Capponi, C.; Zecchin, A.C.; Ferrante, M.; Gong, J. Numerical study on accuracy of frequency-domain modelling of transients. J. Hydraul. Res. 2017, 55, 813–828. [CrossRef] 5. Gonçalves, F.; Ramos, H. Hybrid energy system evaluation in water supply systems: Artificial neural network approach and methodology. J. Water Supply Res. Technol. AQUA 2012, 61, 59–72. [CrossRef] 6. Shibu, A.; Reddy, M.J. Optimal design of water distribution networks considering fuzzy randomness of demands using cross entropy optimization. Water Resour. Manag. 2014, 28, 4075–4094. [CrossRef] 7. Wood, D.J.; Dorsch, R.G.; Lighter, C. Wave-plan analysis of unsteady flow in closed conduits. J. Hydraul. Div. 1966, 102005, 1–9. [CrossRef] 8. Wood, D.J.; Lingireddy, S.; Boulos, P.F.; Karney, B.W.; Mcpherson, D.L. Numerical methods for modeling transient flow in distribution systems. J. Am. Water Resour. Assoc. 2005, 97, 104–115. [CrossRef] 9. Boulos, P.F.; Karney, B.W.; Wood, D.J.; Lingireddy, S. Hydraulic transient guidelines for protecting water distribution systems. J. Am. Water Work. Assoc. 2005, 97, 111–124. [CrossRef] 10. Jung, B.S.; Boulos, P.F.; Wood, D.J. Pitfalls of water distribution model skeletonization for surge analysis. J. Am. Water Work. Assoc. 2007, 99, 87–98. [CrossRef] 11. Izquierdo, J.; Iglesias, P.L. Mathematical modelling of hydraulic transients in single systems. Math. Comput. Model. 2002, 35, 801–812. [CrossRef] 12. Izquierdo, J.; Iglesias, P.L. Mathematical modelling of hydraulic transients in complex systems. Math. Comput. Model. 2004, 39, 529–540. [CrossRef] 13. Li, S.; Yang, C.; Jiang, D. Modeling of Hydraulic Pipeline Transients Accompanied With Cavitation and Gas Bubbles Using Parallel Genetic Algorithms. J. Appl. Mech. 2008, 75.[CrossRef] 14. Martínez-Bahena, B.; Cruz-Chávez, M.; Ávila-Melgar, E.; Cruz-Rosales, M.; Rivera-Lopez, R. Using a Genetic Algorithm with a Mathematical Programming Solver to Optimize a Real Water Distribution System. Water 2018, 10, 1318. [CrossRef] 15. Fan, H.; Chen, N.; Yang, L. Parallel transient flow computations of large hydraulic systems. J. Tsinghua Univ. 2006, 46, 696–699. 16. Owens, J.D.; Houston, M.; Luebke, D.; Green, S.; Stone, J.E.; Phillips, J.C. GPU computing. Proc. IEEE 2008, 96, 879–899. [CrossRef] 17. Wu, J.Y.; Cheng, Y.G.; Zhou, W.; Zhang, C.Z.; Diao, W. GPU acceleration of FSI simulations by the immersed boundary-lattice Boltzmann coupling scheme. Comput. Math. Appl. 2016.[CrossRef] 18. Bonelli, F.; Tuttafesta, M.; Colonna, G.; Cutrone, L.; Pascazio, G. An MPI-CUDA approach for hypersonic flows with detailed state-to-state air kinetics using a GPU cluster. Comput. Phys. Commun. 2017, 219, 178–195. [CrossRef] 19. Zhang, C.Z.; Cheng, Y.G.; Wu, J.Y.; Diao, W. Lattice Boltzmann simulation of the open channel flow connecting two cascaded hydropower stations. J. Hydrodyn. 2016, 28, 400–410. [CrossRef] 20. Griebel, M.; Zaspel, P. A multi-GPU accelerated solver for the three-dimensional two-phase incompressible Navier-Stokes equations. Comput. Sci. Res. Dev. 2010, 25, 65–73. [CrossRef] 21. Nvidia Corporation. NVIDIA CUDA C Programming Guide; Nvidia Corporation: Santa Clara, CA, USA, 2011; ISBN 9783642106712. 22. Bonelli, F.; Viggiano, A.; Magi, V. A Numerical Analysis of Hydrogen Underexpanded Jets. In Proceedings of the ASME 2012 Internal Combustion Engine Division Spring Technical Conference, Torino, Piemonte, Italy, 6–9 May 2012; American Society of Mechanical Engineers: New York, NY, USA, 2012; pp. 681–690. 23. Niemeyer, K.E.; Sung, C.J. Recent progress and challenges in exploiting graphics processors in computational fluid dynamics. J. Supercomput. 2014, 67, 528–564. [CrossRef] Appl. Sci. 2019, 9, 91 14 of 14

24. Kirk, D.B.; Hwu, W.M.W. Programming Massively Parallel Processors: A Hands-on Approach, 2nd ed.; Morgan Kaufmanm: San Francisco, CA, USA, 2013; ISBN 9780124159921. 25. Wylie, E.B.; Streeter, V.L.; Suo, L.S. Fluid Transients in Systems; Prentice Hall: Englewood Cliffs, NJ, USA, 1993; ISBN 9780133221732. 26. Karney, B.W.; McInnis, D. Efficient Calculation of Transient Flow in Simple Pipe Networks. J. Hydraul. Eng. 1992, 118, 1014–1030. [CrossRef] 27. Chaudhry, M.H. Applied Hydraulic Transients; Van Nostrand Reinhold: New York, NY, USA, 1979; ISBN 9781461485377. 28. Singh, A.K.; Bhadauria, B.S. Finite difference formulae for unequal sub-intervals using Lagrange’s interpolation formula. Int. J. Math. Anal. 2009, 3, 815.

© 2018 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).