Computers & Fluids 110 (2015) 198–210

Contents lists available at ScienceDirect

Computers & Fluids

journal homepage: www.elsevier.com/locate/compfluid

Parallel hybrid PSO with CUDA for lD heat conduction equation ⇑ Aijia Ouyang a, Tang a, , Zhou a, Yuming Xu a, Guo Pan a, Keqin a,b a College of Information Science and Engineering, University, Changsha, Hunan 410082, China b Department of Computer Science, State University of New York at New Paltz, New Paltz, NY 12561, USA article info abstract

Article history: Objectives: We propose a parallel hybrid particle swarm optimization (PHPSO) algorithm to reduce the Received 8 December 2013 computation cost because solving the one-dimensional (1D) heat conduction equation requires large Received in revised form 6 May 2014 computational cost which imposes a great challenge to both common hardware and software Accepted 15 May 2014 equipments. Available online 28 May 2014 Background: Over the past few years, GPUs have quickly emerged as inexpensive parallel processors due to their high computation power and low price, The CUDA library can be used by Fortran, C, C++, Keywords: and by other languages and it is easily programmed. Using GPU and CUDA can efficiently reduce the com- Heat conduction equation putation time of solving heat conduction equation. Spline difference method Particle swarm optimization Methods: Firstly, a spline difference method is used to discrete 1D heat conduction equation into the Conjugate gradient method form of linear equation systems, secondly, the system of linear equations is transformed into uncon- GPU strained optimization problem, finally, it is solved by using the PHPSO algorithm. The PHPSO is based on CUDA by hybridizing the PSO and conjugate gradient method (CGM). Results: A numerical case is given to illustrate the effectiveness and efficiency of our proposed method. Comparison of three parallel algorithms shows that the PHPSO is competitive in terms of speedup and standard deviation. The results also show that using PHPSO to solve the one-dimensional heat conduction equation can outperform two parallel algorithms as well as HPSO itself. Conclusions: It is concluded that the PHPSO is an efficient and effective approach towards the 1D heat conduction equation, as it is shown to be with strong robustness and high speedup. Ó 2014 Elsevier Ltd. All rights reserved.

1. Introduction simulations [5], volume visualization [6], hopfield neural network [7], particle filters [8], robotic map [9], finite difference schemes Recently, taking advantage of graphic processing capability, [10,11], object recognition [12], thermal analysis [13], hydrody- numerical simulation has become available via the utilization of namic simulations [14], thermodynamic systems [15], solidifica- a GPU (graphics processing unit) instead of a CPU (central process- tion process simulation [16], computational fluid dynamics ing unit) [1]. In its initial stage, however, the GPU cannot be made [17,18], particle simulation [19,20]. to communicate with an API (application programming interface) The one-dimensional heat conduction equation is a well- such as OpenGL and DirectX. In November 2006, NVIDIA unveiled known simple second order linear partial differential equation the industry’s first DirectX 10 GPU, the GeForce 8800 GTX. The (PDE) [21–24]. PDEs like the heat conduction equation often arise GeForce 8800 GTX was also the first GPU to be built with NVIDIAs in modeling problems in science and engineering. It is also used CUDA (compute unified device architecture). Over the past few in financial mathematics in the modeling of options. For years, graphics processing units (GPUs) have quickly emerged as example, the Black–Scholes option pricing model’s differential inexpensive parallel processors due to their high computation equation can be transformed into the heat conduction equation power and low price [2,3]. The CUDA library can be used by For- [25,26]. tran, C, C++, and by other languages and it is easily programmed. Over the last few years, many physical phenomena were formu- Graphics processing units (GPU) technique has been applied to lated into nonlocal mathematical model. These physical phenom- large-scale social networks [4], smoothed particle hydrodynamics ena are modeled by nonclassical parabolic initial-boundary value problems in one space variable which involve an integral term over the spatial domain of a function of desired solution. This integral ⇑ Corresponding author. term may appear in a boundary condition, which is called nonlocal, E-mail address: [email protected] (Z. Tang). or in the governing partial differential equation itself, which is http://dx.doi.org/10.1016/j.compfluid.2014.05.020 0045-7930/Ó 2014 Elsevier Ltd. All rights reserved. A. Ouyang et al. / Computers & Fluids 110 (2015) 198–210 199 often referred to as a partial integro-differential equation, or in @u @2u ¼ a ; 0 < x < 1; 0 < t 6 T; ð1Þ both [27]. @t @x2 The presence of an integral term in a boundary condition can with initial conditions greatly complicate the application of standard numerical tech- niques such as finite difference procedures, finite element methods, uðx; 0Þ¼f ðxÞ; 0 6 x 6 1; ð2Þ spectral techniques, and boundary integral equation schemes. @u uð1; tÞ¼gðtÞ; 0 < t 6 T; ð3Þ Therefore, it is essential that nonlocal boundary value problems @x can be converted a more desirable form and to make them more applicable to problems of practical interest. and a nonlocal condition R Recently, nonlocal boundary value problems have been covered b uðx; tÞdx ¼ mðtÞ; in the literature [28–31]. For instance, Dehghan [32] presented a 0 ð4Þ 0 < t 6 T; 0 < b < 1: new finite difference technique for solving the one-dimensional heat conduction equation subject to the specification of mass, In this problem, f ðxÞ; gðtÞ and mðtÞ are given functions and a but their methods are only first-order accurate. Caglar et al. [33] and b are constants. If b ¼ 1, Eq. (4) can be differentiated as Z Z developed a third degree B-spines method to solve the heat 1 1 2 2 0 conduction Eqs. (1)–(3) with the accuracy of Oðk þ h Þ, and at m ðtÞ¼ utdx ¼ auxxdx ¼ auxð1; tÞauxð0; tÞ: ð5Þ the endpoints, the approximation order is second order only. 0 0 et al. [34] used a quartic spline method to solve one-dimensional The derivation holds only when m and u are differentiable. telegraphic equations, we consider the lD heat conduction equa- tion in accordance with the above method. Tsourkas and Rubinsky 2.2. Quartic spline and interpolation error used a parallel genetic algorithm for solving heat conduction prob- lems [35,36]. Let P be a uniform partition of ½0; 1 as follows It is well-known that genetic algorithm is a classic evolution- 0 ¼ x0 < x1 < < xn ¼ 1; ð6Þ ary algorithm.It has been successfully applied to all kinds of real problems, such asoptimizing system state in a cloud system where xi ¼ ih; h ¼ 1=n. [37]. PSO is also an evolutionary algorithm and it is developed The quartic spline space on ½0; 1 is defined as no by Dr.Eberhart and Dr. Kennedy in 1995, inspired by social behav- 3 3 S ðPÞ¼ s 2 C ½0; 1 : sj 2 P4; i ¼ 1ð1Þn ; ior ofbird flocking or fish schooling. PSO has an outstandingperfor- 4 ½xi1;xi mance in optimizing functions. Integrating the successful methods where P is the set of polynomials of degree at most d. and efficient hybrid strategies, we present a PHPSO algorithm d It is easy to know that dim S3ðPÞ¼n þ 4. For any s 2 S3ðPÞ, the based on CUDA for the one-dimensional heat conduction equa- 4 4 restriction of s in ½x ; x can be expressed as tion. This paper has some contributions, which are described as i1 i follows. t2 t4 t3 sðxÞ¼s þ hs0 t þ h2s00 ð6 t2Þþh2s00 þ h3s000 ð2 tÞ; i1 i1 i1 12 i 12 i1 12 1. A new method for solving the one-dimensional heat conduction ð7Þ equation with the nonlocal boundary value conditions is con- where x x th, t 0; 1 , s s x ; s0 s0 x , s00 s00 x ; structed by using quartic spline functions. It shows that the ¼ i1 þ 2½ i ¼ ð iÞ i ¼ ð iÞ i ¼ ð iÞ 2 4 s000 s000 x ; i 0 1 n. accuracy of the presented method can reach Oðk þ h Þ at the i ¼ ð iÞ ¼ ð Þ interior nodal points. This leads to 2. We also obtain a new method to deal with the nonlocal bound- 0 5 2 00 1 2 00 1 3 000 4 s ¼ s þ hs þ h s þ h s þ h s ; ð8Þ ary conditions with the accuracy being Oðk þ h Þ. It is much bet- i i1 i1 12 i1 12 i 12 i1 ter than the classical finite difference method with the accuracy 2 of Oðk þ h Þ. 2 00 1 00 1 2 s0 ¼ s0 þ hs þ hs þ h s000 ; ð9Þ 3. We design a parallel hybrid algorithm by hybridizing the PSO i i1 3 i1 3 i 6 i1 and CGM with the CUDA technique. Such an algorithm can 2 s000 ¼s000 þ s00 s00 ; ð10Þ speed up the convergence to the optimum solution. i i1 h i i1 for i ¼ 1ð1Þn. Based on the above three equations, we can have The remainder of this paper is organized in the following way. Section 2 gives a detailed theoretical analysis and mathematical 1 00 5 00 1 00 1 siþ1 þ þ si1 ¼ 2 ½siþ1 2si þ si1; i ¼ 1ð1Þn 1: ð11Þ proof concerning the 1D heat conduction equation. Section 3 12 6 12 h shows the process of transforming a system linear systems into Likewise, if we write the quartic spline function s in ½xi1; xi as an unconstrained optimization problem. In Section 4, the serial follows algorithms of PSO and CGM are presented. In Section 5, the archi- 2 4 3 tecture of CUDA is introduced first, and a parallel PSO algorithm, a 0 2 00 t 2 2 00 t 3 000 t sðxÞ¼si þ hsit þ h si ð6 t Þþh si1 þ h si ð2 þ tÞ; parallel CGM algorithm, and a parallel hybrid PSO algorithm are 12 12 12 proposed respectively in detail. Section 6 displays the experimen- ð12Þ tal results and discussions. Finally, the paper concludes with where x ¼ xi þ th; t 2½1; 0. This leads to Section 7. 5 1 1 s ¼ s hs0 þ h2s00 þ h2s00 þ h3s000; ð13Þ i1 i i 12 i 12 i1 12 i 2. Method of numerical processing 2 1 1 s0 ¼ s0 hs00 hs00 þ h2s000; ð14Þ i1 i 3 i 3 i1 6 i 2.1. Mathematical model 000 000 2 00 00 si1 ¼si þ si si1 ; ð15Þ We consider the one-dimensional nonclassical heat conduction h equation for i ¼ 1ð1Þn. 200 A. Ouyang et al. / Computers & Fluids 110 (2015) 198–210

i ðiÞ Denote D sðxÞ¼s ðxÞ. We have xi ¼ ih; i ¼ 0ð1Þn; ð27Þ

tj ¼ jk; j ¼ 0ð1Þm; ð28Þ 0 1 2 00 hD si1 ¼ si hsi þ h si ¼e si: ð16Þ 2 in which n and m are integers. j Then Eq. (11) can be rewritten as Let sðxi; tjÞ and Ui be approximations to uðxi; tjÞ and 2 @ sðx;tÞ MiðtÞ¼ 2 , respectively. 1 hD hD 00 1 hD hD @x ðxi;tÞ ½e þ 10 þ e si ¼ 2 ½e 2 þ e si; i ¼ 1ð1Þn 1: ð17Þ 12 h Assume that uðx; tÞ is the exact solution to Eq. (1). For any fixed 3 Equivalently, t, let sðx; tÞ2S4ðPÞ be the quartic spline interpolating to uðx; tÞ as in

hD hD sðxi; tÞ¼uðxi; tÞ; i ¼ 0ð1Þn; ð29Þ 00 12 e 2 þ e si ¼ si; i ¼ 1ð1Þn 1: ð18Þ h2 ehD þ 10 þ ehD @2s @2u ðx ; tÞ¼ ðx ; tÞ; ð30Þ @x2 n @x2 n

Theorem 1. Assume that gðxÞ is sufficiently smooth in the interval 3 3 3 ½0; 1. Let s 2 S4ðPÞ be the quartic spline which interpolates gðxÞ as @s 1 2 @ s @u 1 2 @ u ðx0; tÞþ h ðx0; tÞ¼ ðx0; tÞþ h ðx0; tÞ; ð31Þ follows @x 12 @x3 @x 12 @x3 si ¼ gðxiÞ; i ¼ 0ð1Þn; ð19Þ 3 3 @s 1 2 @ s @u 1 2 @ u 00 00 ðxn; tÞ h ðxn; tÞ¼ ðxn; tÞ h ðxn; tÞ: ð32Þ sn ¼ g ðxnÞ; @x 12 @x3 @x 12 @x3 h2 h2 s0 þ s000 ¼ g0ðx Þþ g000ðx Þ; 0 12 0 0 12 0 ð20Þ Then it follows from Theorem 1 that 2 2 h h 2 2 0 000 0 000 @ u @ s 4 sn sn ¼ g ðxnÞ g ðxnÞ: jðx ; tÞ¼ ðx ; tÞþOðh Þ; i ¼ 1ð1Þn 1: ð33Þ 12 12 @x2 i @x2 i Then we have For any fixed t, by using the Taylor series expansion, Eq. (1) can

00 00 1 4 ð6Þ 6 be discretized at the point ðxi; tjÞ into si ¼ g ðxiÞ h g ðxiÞþOðh Þ; i ¼ 1ð1Þn 1; ð21Þ 2 ! !3 240 j jþ1 2 2 uðxi;tjþ1Þuðxi;tjÞ 1 4 @ sðx;tÞ @ sðx;tÞ 5 2 4 ¼ a 2 þ 2 þ Oðk þ h Þ; g1 g0 1 00 5 00 1 2 4 k 2 @x @x g0 ¼ hg hg h g000 þ Oðh Þ; ð22Þ i i 0 h 12 1 12 0 12 0 ð34Þ

0 gn gn1 5 00 1 00 1 2 000 4 gn ¼ þ hgn þ hgn1 þ h gn þ Oðh Þ; ð23Þ where i ¼ 1ð1Þn 1; j ¼ 0ð1Þm 1. h 12 12 12 Substituting (34) into spline relation (11) and using Eq. (29),we 0 0 00 00 000 000 where gi ¼ gðxiÞ; gi ¼ g ðxiÞ, gi ¼ g ðxiÞ; gi ¼ g ðxiÞ; i ¼ 0ð1Þn. conclude

Proof. The proof of Eq. (21) can been seen in [38]. ð1 6arÞuðxiþ1; tjþ1Þþð10 þ 12arÞuðxi; tjþ1Þ From Eqs. (11) and (19), we can get þð1 6arÞuðxi1; tjþ1Þ ¼ 1 þ 6arÞuðxiþ1; tjÞþð10 12arÞuðxi; tjÞ 2Dh Dh 00 12 2Dh Dh 00 2 4 ðe þ 10e þ 1Þs0 ¼ ðe 2e þ 1Þg0: ð24Þ þð1 þ 6arÞuðxi1; tjÞþOðk þ h Þ; ð35Þ h2 where r ¼ a k . Then, similar to the proof of Eq. (21), we can obtain the following h2 result Neglecting the error term, we can get the following difference

00 00 4 scheme s0 ¼ g ðx0ÞþOðh Þ: ð25Þ jþ1 jþ1 jþ1 Then, according to Eqs. (8) and (20), we have ð1 6arÞUiþ1 þð10 þ 12arÞUi þð1 6arÞUi1 j j j 2 2 ¼ð1 þ 6arÞU þð10 12arÞU þð1 þ 6arÞU ; h h s s 1 5 iþ1 i i1 g0 þ g000 ¼ s0 þ s000 ¼ 1 0 hs00 hs00 0 12 0 0 12 0 h 12 1 12 0 i ¼ 1ð1Þn 1; j ¼ 0ð1Þm; ð36Þ

g1 g0 1 00 5 00 4 ¼ hg hg þ Oðh Þ: ð26Þ with the accuracy being Oðk2 þ h4Þ. h 12 1 12 0 It is evident that the difference scheme (36) identifies the clas- Thus, Eq. (22) holds. Similarly, by using Eq. (13), we can prove Eq. sical fourth order compact difference scheme, which is uncondi- (23). h tionally stable.

2.3. Quartic spline method 2.4. The difference scheme on the boundary

We consider the following heat conduction Eq. (1) with the ini- From the interpolation conditions (30)–(32) and Theorem 1, for tial boundary value conditions (2) and (3) and derivative boundary each t > 0, we have condition (5). @uðx ; tÞ uðx ; tÞuðx ; tÞ 1 @2uðx ; tÞ 5 @2uðx ; tÞ The domain ½0; 1½0; T is divided into an n m mesh with the 0 1 0 1 0 ¼ h 2 h 2 spatial step size h ¼ 1=n in x direction and the time step size @x h 12 @x 12 @x 3 k ¼ T=m, respectively. 1 2 @ uðx0; tÞ 4 h 3 þ Oðh Þ; ð37Þ Grid points ðxi; tjÞ are defined by 12 @x A. Ouyang et al. / Computers & Fluids 110 (2015) 198–210 201

2 @uðx ; tÞ uðx ; tÞuðx ; tÞ 5 @ uðx ; tÞ 1 tion problem. Obviously, the L1 norm of linear equation group (43) n ¼ n n1 þ h n þ h @x h 12 @x2 12 is the optimal solution of the problem (47). Both of them are 2 3 completely equivalent. The formula of the problem is shown as @ uðxn1; tÞ 1 2 @ uðxn; tÞ 4 þ h þ Oðh Þ: ð38Þ follows: @x2 12 @x3 Xm XN Let t ¼ tjþ1, it follows form Eqs. (1) and (37) that min E ¼ min jaijxj bij: ð47Þ uðx ; t Þuðx ; t Þ 1 i¼1 j¼1 u ðx ; t Þ¼ 1 jþ1 0 jþ1 h x 0 jþ1 h 12a Consequently, we set the fitness function as follows: uðx ; t Þuðx ; t Þ 5 1 jþ1 1 j h Xm XN k 12a f ðxÞ¼min jaijxj bij; ð48Þ uðx ; t Þuðx ; t Þ 1 @3uðx ; t Þ i¼1 j¼1 0 jþ1 0 j h2 0 jþ1 k 12 @x3 in formula (48), The solution vector X ¼ðxi; x2; ...; xnÞ is composed þ Oðk þ h4Þ: ð39Þ by xjðj ¼ 1; 2; ...; NÞ. X is a real vector.

From Eqs. (1), (3) and (5), we can get 4. Serial algorithms @3uð0; tÞ 1 @2uð0; tÞ 1 1 ¼ ¼ m00ðtÞ g0ðtÞ: ð40Þ 4.1. Particle swarm optimization @x3 a @x@t a2 a

Substituting (40) into (39) and neglecting the error term, we get the Particle swarm optimization (PSO), a swarm intelligence algo- fourth-order difference scheme at x ¼ x0 rithm to solve optimization problems, is initialized randomly with a cluster of particles, after which it searches for optimal value by 1 5 1 5 1 Ujþ1 1 þ Ujþ1 þ Uj þ Uj means of updating generations [39]. Given that the search space 12r 1 12r 0 12r 1 12r 0 is D-dimensional, the population is composed of many particles. Particle i refers to the i-th particle in the population, which is 1 3 1 00 1 0 ¼ huxð0; tjþ1Þþ h 2 m ðtjþ1Þ g ðtjþ1Þ : ð41Þ T 12 a a denoted by D-dimensional vector Xi ¼ðxi1; xi2; ...; xiDÞ . V denotes T the search velocity of particles, and V i ¼ðvi1; vi2; ...; viDÞ reveals Similarly, the difference scheme at the other end x ¼ xn is the search velocity of particle i. The visited positions of particles T 5 1 5 1 are denoted by P, and Pi ¼ðpi1; pi2; ...; piDÞ reveals local best posi- 1 þ Ujþ1 þ1 þ Ujþ1 Uj þ Uj 12r n 12r n1 12r n 12r n1 tion, and it is the best previously visited position of particle i. The T global best position is represented by G ¼ðg1; g2; ...; gDÞ , which 1 3 0 ¼ huxð1; tjþ1Þ h g ðtjþ1Þ: ð42Þ shows the optimal of all local best positions among the population. 12a At each step, the search velocity of all particles and their local best position will be given a value according to Eqs. (49) and (50), the 3. Derivation of objective function process goes repeatedly till a user-defined stopping criterion is met. The formula of a system linear systems is shown as follows: kþ1 k t k k k vid ¼ xvid þ c1r1 pid xid þ c2r2 gid xid ; ð49Þ AX ¼ B; ð43Þ kþ1 k kþ1 xid ¼ xid þ vid : ð50Þ or Here, k indicates the current iterative step, i 2f1; 2; ...; Ng, XN and d 2f1; 2; ...; Dg; N refers to the number of the total popula- aijxj ¼ ; i ¼ 1; 2; ...; M: ð44Þ tions. r1 and r2 represent independently uniformly distributed j¼1 random variables with a range of ð0; 1Þ; c1 and c2 denote the MN acceleration constants, and x refers to the inertia weight factor In the above formula, A ¼ðaijÞMN 2 R is a matrix, T T [40]. X ¼ðx1; x2; ...; xNÞ is a vector, B ¼ðb1; b2; ...; bMÞ is also a vector. Set the absolute error of X and B be respectively dX and dB.So 4.2. Conjugate gradient method the relative error of X can be estimated as follow: The CGM is one of the most popular iterative methods kdXk=kXk 6 KðAÞkdBk=kBk: ð45Þ dedicated to solving linear systems of equations with sparse, symmetric and positive-definite matrix of coefficients. A version If we obtain the L2 norm for the above formula, the condition num- of the CGM algorithm implemented in this study is shown in ber of matrix A is: Algorithm 1. The main steps of CGM for an unconstrained optimization KðAÞ¼r =r : ð46Þ max min problem are as follows:

In the formula, rmax and rmin is respectively the maximum ð0Þ and minimum of singular value of matrix A.IfKðAÞ is very large, Step 1. Set a initial point x , and precision definition e > 0. ð0Þ 6 the formula (43) is morbid state linear equation group. The char- Step 2. If krf ðx Þk e, stop computation, the minimum point is ð0Þ acteristic of morbid state linear equation group is: even though A x , otherwise, go to Step 3. ð0Þ ð0Þ and B are little changed, the solution of them will be greatly Step 3. Let p ¼rf ðx Þ, and set k ¼ 0. different. Step 4. Use one dimensional search to solve tk, obtain ðkÞ ðkÞ ðkÞ ðkÞ ðkþ1Þ ðkÞ To use PSO to solve linear equation group (43), we transform f ðx þ tkp Þ¼min f ðx þ tp Þ; t P 0, let x ¼ x ðkÞ the linear equation group problem into an unconstrained optimiza- þtkp , go to Step 5. 202 A. Ouyang et al. / Computers & Fluids 110 (2015) 198–210

Step 5. If krf ðxðkþ1ÞÞk 6 e, stop computation, the minimum point Algorithm 2. Particle swarm optimization is xðkþ1Þ, otherwise, go to Step 6. ð0Þ ðnÞ Step 6. If k þ 1 ¼ n, let x ¼ x , go to Step 3, otherwise, go to Input: xmin; xmax; G; D; N; p; lbp; gbp; f ; lbf ; gbf ; v Step 7. 0 0 ðkþ1Þ Output: gbp ; gbf Step 7. Let pðkþ1Þ ¼rf ðxðkþ1ÞÞþk pðkÞ, k ¼ krf ðx Þk, let k ¼ k þ 1, k k krf ðxðkÞÞk 1: Initialize parameters p; lbp, and v go to Step 4. 2: Compute fitness f 3: Find lbp and gbp 4: for g ¼ 1 : G do Algorithm 1. Conjugate gradient method 5: Update v and p 6: Evaluate fitness f Input: A; b; n 7: Update lbp and gbp Output: xkþ1 8: end for 1: r0 :¼ b Ax0 9: Output gbp and gbf 2: p0 :¼ r0 3: k :¼ 1 4: while k <¼ n do rT r Algorithm 3. Hybrid particle swarm optimization k k 5: ak :¼ T Pk APk 6: xKþ1 : xk þ akPk Input: xmin; xmax; G; D; N; p; lbp; gbp; f ; lbf ; gbf ; v 0 0 7: rKþ1 : rk akAPk Output: gbp ; gbf 8: if rkþ1 < m then 1: Initialize parameters p; lbp, and v rT r 9: b : kþ1 kþ1 2: Compute fitness f k ¼ rT r k k 3: Find lbp and gbp 10: p :¼ r þ b P kþ1 kþ1 k k 4: for g ¼ 1 : G do 11: k : k 1 ¼ þ 5: Update v and p 12: end if 6: Evaluate fitness f 13: end while 7: Update lbp and gbp 14: Output solution 8: Execute local search of CGM 9: end for 10: Output gbp and gbf

4.3. Hybrid particle swarm optimization

5. Parallel algorithms The global search ability of the basic PSO is extremely strong while the local search ability is extremely weak, which is exactly 5.1. The architecture of CUDA the opposite of the CGM. To overcome the shortcomings of the two algorithms and utilize only their advantages, we present a We describe in brief the CUDA architecture in this subsection; by hybrid particle swarm optimization (HPSO) algorithm, in which this means, we hope that to comprehend the design of our presented the CGM is incorporated into the basic PSO. In the HPSO, because GPU implementation can become easier. A GPU is a multi-core, the CGM is a local algorithm, it can obtain the optimal value with multi-threaded highly parallel processor with enormous no need to calculate the derivative of fitness function; therefore, computing power. The NVIDIA Corporation released the CUDA based on the original PSO, only a decent computation cost is added. framework to help users develop software for their GPUs. The In addition, each particle in the population is regarded as a point of architecture of CUDA is designed around a scalable array of multi- the CGM and each iteration of HPSO has two parts: the CGM part processors as shown in Fig. 1, which is based on a SIMT programming and PSO part. In each iteration, all particles are updated, and the elite individual generated by PSO algorithm will be replaced by Host Device the best individual generated by CGM. In this way, the accuracy GPU Grid 1 of searching the position in each HPSO iteration is improved. kernel 1 Therefore, the HPSO not only has higher precision but also helps Block Block Block to obtain a faster convergent speed compared with the original (0, 0) (0, 1) (0, 2) PSO and CGM. CPU Block Block Block The main steps of HPSO for an unconstrained optimization serial code (1, 0) (1, 1) (1, 2) problem are as follows: Global memory Texture memory Step 1. Initialize the parameters of population, such as, population GPU Grid 2 Constant memory size, position, velocity, acceleration factor, and inertia kernel 2 weight factor. Block(1, 1) Step 2. Compute the fitness value of each particle. Step 3. Find the local optimum and the global optimum. Thread Thread Thread (0, 0) (0, 1) (0, 2) Step 4. Update the velocity and position of each particle. Step 5. Compute the fitness value of population. Thread Thread Thread Step 6. Update the local optimum and the global optimum. (1, 0) (1, 1) (1, 2)

Step 7. Execute a local search precisely by CGM for the individual Thread Thread Thread Shared memory Register (2, 0) (2, 1) (2, 2)

of global optimum in the population of HPSO. Local memory Step 8. If the termination criteria are satisfied, output results and terminate the HPSO algorithm; otherwise, go to Step 4. Fig. 1. The architecture of CUDA. A. Ouyang et al. / Computers & Fluids 110 (2015) 198–210 203

Table 1 Specifications of GPU.

Parameter GTX 465 C2050 Number of stream processors 352 448 Global memory size (GB) 1 2.62 Shared memory size (KB) 48 48 Constant memory size (KB) 64 64 Memory bus width (bits) 256 384 Memory bandwidth (GB/s) 102.6 144 Memory clock rate (MHz) 802 1536 Estimated peak performance (Gflops) 855.4 1030 GPU clock rate (MHz) 0.6 1.15 Warp Size 32 32 Maximum number of threads per MP 1024 1024 CUDA compute capability 2.0 2.0 modal. A parallel program on a GPU (also termed device) is inter- leaved with a serial one executed on the CPU (also termed host). A thread block has a stream of threads executed concurrently on a multiprocessor, which has synchronized access to a shared memory. However, threads originating from different blocks cannot be coop- erated in this way. These threads are combined to form a grid. Threads in a grid are enumerated and distributed to multiprocessor when a CUDA code on the CPU calls a kernel function. All the threads in the grid can access the same global memory. Two additional read- only memory spaces are accessible by all the threads, the constant memory and texture one (see Table 1).

5.2. Parallel particle swarm optimization Fig. 2. Flowchart of PPSO. In this section, we would like to explain the meaning of each kernel function in detail with its pseudo code attached. There are curand uniformðÞ to generate two random numbers stochastically five kernel functions in the PPSO algorithm, and they are imple- and uniformly distributed between 0.0 and 1.0, and then use them mented in GPU. The flow chart of PPSO is shown in Fig. 2. to initialize the position and velocity respectively.

5.2.1. Kernel function dev_rnd 5.2.3. Kernel function Update The purpose of dev rnd is to generate a random number Update is used to modify the position, individual optimal posi- sequence of initial state. Each thread is a complex variable, which tion and velocity of each particle in the population for each thread. needs to contain curand library files and curand kernel, and call the As for the parameter setting, it is the same as init. It firstly updates library function curand init to generate high quality pseudo- the individual optimal position according to the modifying mark, random number. Given the differenced between seeds, and then respectively calls curand uniform to generate two random curand init would produce different initial states and sequence. numbers between two 0.0 and 1.0 respectively. They are used to Each thread is assigned a unique serial number. generate a new particle velocity, further update particle position, Sequences generated with different seeds usually do not have and reinitialize the numbers exceeding the limits at the same time. statistically correlated values, but some choices of seeds may result in statistically correlated sequences. Sequences generated with the 5.2.4. Kernel function Findgbf same seed and different sequence numbers will not have statisti- Findgbf is used to update the individual optimal position, and cally correlated values. For the highest quality parallel pseudo-- find the global optimal index and global optimal fitness. A particle dom number generation, each experiment should be assigned a is a thread. In this paper, the population numbers are set less than unique seed. Within an experiment, each thread of computation 512, thus a thread block is a whole population. We give full play to should be assigned a unique sequence number. If an experiment the advantages of shared memorizer and registers. First, we need to spans multiple kernel launches, it is recommended that threads define the thread pid and shared memorizer variable s and register between kernel launches be given the same seed, and sequence variable t, and then initialize s and t in accordance with different numbers be assigned in a monotonically increasing way. If the same requirements. If the initial mark flag ¼ 0, the fitness value f should configuration of threads is launched, random state can be preserved be copied directly to individual extremum lbf, otherwise, lbf and in global memory between launches to avoid state setup time [41]. modifying mark g u should be modified correspondingly after com- Experiments have shown that the random function in parison. Then we need to call device function reduceToMaxðs; pidÞ or curand library has better quality and higher efficiency than that reduceToMinðs; pidÞ to operate and summarize according to the generated by the rand function of CPU. maximum or minimum value for parallel reduction, in order to find the group extremum gbf, and ultimately the global optimal index 5.2.2. Kernel function init ID. The skill is that we should set t to the maximum number of Init is used to initialize a particle position, velocity, and the indi- threads that are not the group extremum gbf, otherwise t would vidual best position in populations according to the formula in the be the thread number of gbf. Then we should reuse shared memo- PSO. Every thread in thread blocks represents a complex variable. rizer and call reduceToMinðs; pidÞ to execute the reduction then

Each row in the grid represents a particle. After defining xid assign the first thread to ID. If the most optimal value of this dimension index, index of population yid, thread (namely, each generation is not superior to group optimal value gbf, ID would be position variable) position in the whole grid posID, we call P, which means there is no need to update the best position gp of 204 A. Ouyang et al. / Computers & Fluids 110 (2015) 198–210 the group. Parallel reduction reduceToMax or reduceToMin. The first Algorithm 5. Random numbers of CUDA thing is to perform the first-level reduction (parallel reading data), then record them in a shared memorizer, and finally perform the Kernel function: dev_rndnGrid, Blocko(ds, seed)) second-level reduction in the shared memorizer. Since the execu- Define executive parameters tion efficiency of the loop on the CPU is not high, the optimization dim3 Block(block_xsize, block_ysize,1) seems very important. The full loop unrolling, the code optimiza- dim3 Grid(grid_xsize, grid_ysize) tion by a template parameter, and the unnecessary synchronization Input: seed in a warp greatly improve the reduction efficiency. Output: ds

1: Define the parameters: dimension index xid, population 5.2.5. Kernel function Findgbp index yid, thread position posID It is required by Findgbp to find best position gbp for groups 2: if xid < d then according to the global optimal index. A thread is a complex vari- 3: Invoke curand initðseed; posID; 0; &ds½posIDÞ to able, and the size of dimension y of the grid is 1, which means that generate random numbers in CUDA if global optimal index ID is not equivalent to p, the ID line of the 4: end if position of the only particle would be copied to the best position gbp for groups. These functions attached by detailed pseudo code. Experiments have proved that it is very effective to combine the particle swarm with conjugate gradient algorithm to implement the GPU parallel, which greatly improves the search quality and Algorithm 6. Initialize population with CUDA computing speed, as well as the performance of algorithm. In ker- nel function executive parameter, we define block xsize (not more kernel function: initnGrid, Blocko(p, lp, v, ds) than 512; 128 or 256 is more appropriate; at the same time, SM Input: ds usually have enough resources to perform at least two active Output: p; lbp; v, blocks), block ysize ¼ 1; grid ysize ¼ p, namely the number of popu- 1: Define the dimension index xid, group index yid, the lations, grid xsize ¼ d þ block xsize 1Þ=block xsize, to represent thread position in the grid posID the block dimensions by this form, to ensure that the number of 2: if xid < d then blocks is an integer, and the number of threads in the whole grid 3: Load the sequence of the initialization status of the is more than the actual amount of elements to process. Accord- random number ds from global memory and also assign it ingly, in kernel, if ðxid < dÞ which can skip the redundant data is to the register variable ls used to calculate. If the minimum value is needed, we define better 4: Invoke curand uniformð&lsÞ to generate the random as ‘‘<’’, and otherwise ‘‘>’’. number r1 and r2 between 0.0 and 1.0 5: Store the ls back into the global memory variable ds Algorithm 4. Parallel PSO algorithm in GPU 6: Generate register variable x and vel based on the velocity and position equation Input: G; D; g D; flag; N; p; lbp; gbp; h gbp; f ; lbf ; gbf ; h gbf ; v; 7: Copy x back into p and lbp; copy vel back into v dsg u; ID; Grid; Block; gbfGrid; gbfBlock; gbpGrid; gbpBlock 8: end if Output: gbp; gbf 1: Initialize the parameters of device 2: Initialize the numbers of random in CUDA software: dev_rndnGrid, Blocko(ds, time (NULL)) Algorithm 7. Update of velocity and position 3: Initialize population parameters and compute fitness f: initnGrid, Blocko(p, lbp, v, ds), h_cf(p, f, N, D) n o 4: Update optimum lbp and find optimum gbp: Kernel function: Update Grid, Block (p, lbp, gbp, v, g_u, ds) FindgbfngbfGrid, gbfBlocko(f, lbf, gbf, g_u, ID, flag), Input: p; lbp; gbp; v; g u; ds 0 0 0 FindgbpngbpGrid, gbpBlocko(gbp, p, ID) Output: p ; lbp ; v 5: for g ¼ 1toG do 1: Define the dimension index xid, group index yid, the 6: flag ¼ 1 thread position in the grid posID 7: Update the velocity and position of each individual in 2: if xid < d then accordance with Eqs. (49) and (50), UpdatenGrid, 3: Load p; lbp; gbp; v; g u, and ds from the global memory Blocko(p, lbp, gbp, v, g_u, ds) and also assign them to register variables pos, lb; gb; vel; u 8: Compute the fitness f: h_cf(p, f, N, D) and ls 9: Update individual best position lbp, and find global 4: Invoke curand uniform(&ls) to generate the random best individual gbp number r1 and r2 between 0.0 and 1.0 10: end for 5: Store the ls back into the global memory variable ds 11: Return the global optimal position gbp and fitness gbf: 6: if u ¼¼ 1 then cudaMemcpy(h_gbp, gbp, D⁄sizeof(float), 7: Assign pos to lb and lbp cudaMemcpyDevieToHost), cudaMemcpy(h_gbf, gbf, 8: end if sizeof(float), cudaMemcpyDevieToHost) 9: Update the velocity based on Eq. (49) and copy it back 12: printResults() to v 13: writedata() 10: Update the particle position based on Eq. (50) and 14: H_Free() copy it back to p 15: cudaThreadExit() 11: end if A. Ouyang et al. / Computers & Fluids 110 (2015) 198–210 205

Algorithm 8. Update of local and global optimum

Kernel function: FindgbfngbfGrid, gbfBlocko(f, lbf, gbf, g_u, ID, flag) Define executive parameters dim3 gbfGridð1; 1Þ dim3 gbfBlockðdev_P; 1; 1Þ Input: f ; lbf ; gbp; v; g u; ds; flag Output: f 0; lbf 0; gbf ; g u0; ID

1: Define the thread or particle index tid 2: Initialize the shared memory vector s and register variable bf. To get the maximum value, it will be valued as 0. To get the minimum value, it will be valued as 1E20

3: if tid < N then 4: Load f from global memory and assign it to s and bf 5: Synchronization 6: if flag ¼¼ 0 then 7: Copy the bf to the individual best fitness lbf 8: else 9: Load lbf from global memory and assign it to the register variable bf 1 10: Assign the ‘‘bf better bf 1’’ to register variable update and copy it back to g u 11: if update ¼¼ 1 then 12: Replace the individual optimum using new fitness

13: end if Fig. 3. Flowchart of PCGM. 14: end if 15: if MAXIMIZE then

16: reduceToMax < dev N > ðs; tidÞ 17: else 5.3. Parallel conjugate gradient method 18: reduceToMin < dev N > ðs; t idÞ 19: end if Conjugate gradient is a method which lies between the 20: if flag ¼¼ 0jjðflag ¼¼ 1&&s½0better gbf Þ then steepest descent method and Newton method. It only needs 21: if t id ¼¼ 0 then first derivative information and overcomes the slow conver- 22: Assign the best fitness to the global best value gence of steepest descent method. Moreover, it does not need 23: end if to store and calculate Hesse matrix and inverse of Newton 24: Assign thread index of the best value to the bf,or method. The characters including the fast convergence speed the bf valued as dev N; Assign the bf to the shared memory and quadratic termination make it one of the most efficient variable s algorithms to solve large linear equations. It needs small storage 25: reduceToMin < dev N > ðs; t idÞ capacity, has step convergence, high stability, and does not 26: Copy the best value index s½0 back into the ID require any external parameters. To realize GPU of conjugate 27: else gradient requires CUBLAS library, that is, the header file needs 28: ID ¼ N to contain cublas.h. Assume that we use it to obtain the solu- 29: end if tion of linear equations Ax ¼ b, the pseudo code is shown as 30: end if Algorithm 10. All operations are completed in the GPU. In the first place, the distribution of the temporary variables is conducted in the Algorithm 9. Search global optimal position video memory of GPU, so cublasAlloc() and cublasFree() are used to replace malloc() and free() in standard C library. And all the Kernel function: FindgbpngbpGrid, gbpBlocko(gbp, p, ID) vector and matrix operations adopt the functions in CUBLAS Define executive parameters library. The role of each function is explained as follows. Cubla- dim3 gbpBlockðblock xsize; 1; 1Þ sAlloc is to allocate the video memory space; CublasFree is to dim3 gbpGridðgrid xsize; 1Þ release the video memory space; This function of cublasScopy Input: p; ID; d copies the vector x into the vector y; whist the function of Output: gbp cublasSdot: computes the dot product of vectors x and y; cublasSgemv is to performs the matrix–vector multiplication; 1: Define the dimension index xid the function of cublasSaxpy multiplies the vector x by the scalar 2: if xid < D && ID! ¼ N then 3: Assign the solution of the No. ID particle in the p to the and adds it to the vector y overwriting the latest vector with global best position gbp the result; with regards to the function of cublasSscal, it mainly 4: end if scales the vector x by the scalar and overwrites it with the result (see Fig. 3). 206 A. Ouyang et al. / Computers & Fluids 110 (2015) 198–210

Algorithm 10. Parallel conjugate gradient method

Input: n; m; A; b; x Output: X 1: Define variables a; b; e; r2 temporary vectors p, r, AP. Invoke the instruction cublasAlloc to distribute storage 2: Invoke cublasScopy to achieve the first step: r :¼ b Ax 3: Invoke cublasScopy(n, r,1,p, 10), cublasSdot(n, r, 1, emphr, 1) to achieve the second step: p :¼ r 4: for k ¼ 1ton do do 5: Assign the value of e to r2 6: Invoke cublasSgemv(‘n’, n, 1.0f, A, n, p, 1, 0.0f, Ap, 1); 7: Solve the value of a; a ¼ r2=cublasSdot; 8: Invoke cublasSaxpy(n, a, p,1,x, 1) to achieve the sixth step and update x 9: Invoke cublasSaxpy(n, a, Ap,1,r, 1) to achieve the seventh step and update r 10: if e ¼ cublasSdot < m then 11: break; 12: end if 13: Achieve the ninth step, obtain b; b ¼ e=r2 14: Invoke cublasSscal(n, b, p, 1) and cublasSaxpy(n, 1.0f, r, 1, p, 1) to achieve the tenth step and update p. 15: end for 16: Invoke the instruction cublasFree to free the memory of p, r and Ap 17: Realize the thirteenth step and obtain x. Fig. 4. Flowchart of PHPSO.

5.4. Parallel hybrid particle swarm optimization Algorithm 11. Parallel hybrid PSO algorithm in GPU

Both PSO and CGM have strong parallelism. Operating the com- Input: bination of them on GPU can improve the search quality and effi- x min; x max; G; D; g D; N; flag; ds; p; lbp; v; f ; lbf ; g u; gbp; ciency. GPU program needs to take into consideration algorithm, gbf ; ID; A; b; x; n; m; Grid; Block; gbfGrid; gbfBlock; gbpGrid; parallel partition, instruction stream throughput, memory band- gbpBlock width and many other factors. First, we need to confirm the serial Output: gbp; gbf and parallel part of the task. In the PHPSO algorithm, all the steps 1: Initialize the parameters of device in the process can be implemented on GPU. Then we need to ini- 2: Initialize the number of random in CUDA: tialize the device and prepare data. We store all vectors in the glo- dev_rndnGrid, Blocko(ds, seed) bal memorizer and constants in a constant cache memorizer to 3: Initialize population parameters and compute fitness f: optimize the memory and meet the need of merger access. We initnGrid, Blocko(p, lbp, v, ds), h_cf(p, f, N, D) have to make sure that the first visiting address starts from 16 inte- 4: Update optimum lbp and find optimum gbp: ger times, and the word length of data read each time in every FindgbfngbfGrid, gbfBlocko(f, lbf, gbf, g_u, ID, flag), thread is 32 bit. Finally, each step is mapped to a kernel function FindgbpngbpGrid, gbpBlocko(gbp, p, ID) suiting CUDA two parallel layers model and an instruction stream 5: for g ¼ 1toGdo would be optimized. If only a small amount of thread is needed, an 6: flag ¼ 1 approach similar to ‘‘if thread < N’’ could be used to avoid the long 7: Update the velocity and position of each individual in time cost or incorrect results caused by the multiple threads run- accordance with Eqs. (49) and (50), UpdatenGrid, simultaneously. To speed up, quick instructions that adopt Blocko(p, lbp, gbp, v, g_u, ds) CUDA arithmetic instruction to centralize could be used. Using 8: Compute the fitness f: h_cf(p, f, N, D) #unroll could let the compiler develop loops effectively. We should 9: FindgbfngbfGrid, gbfBlocko(f, lbf, gbf, g_u, ID, flag); try to avoid bank conflict, balance the resources, and adjust the 10: end for usage of shared memory and the register. We can regard particles, 11: Return the global optimal position gbp and fitness gbf: that is, individuals as threads, or the complex variables as threads. ⁄ cudaMemcpy(h_gbp, gbp, D sizeof(float), Therefore, we can use a complex variable as a thread to initialize cudaMemcpyDevieToHost), cudaMemcpy(h_gbf, gbf, and update the population to improve parallelism. CUDA runtime sizeof(float), cudaMemcpyDevieToHost) API or CUDA driver API can be used to manage GPU resources, allo- 12: printResults() cate the memory and start the kernel function on GPU. In this 13: writedata() paper, we use the runtime API to implement operations, such as 14: H_Free() equipment management, context management, and executive con- 15: cudaThreadExit() trol. The detailed processing flow is shown as Fig. 4. A. Ouyang et al. / Computers & Fluids 110 (2015) 198–210 207

6. Experimental results and discuss Table 2 Relative errors at various step.

6.1. Example of application Step 0.0500 0.0250 0.0100 0.0050 0.0025 0.0010 [32] 9.6E3 2.5E3 3.9E4 9.6E5 2.5E5 4.3E6 In this section, our new method is tested on the following [42] 9.1E3 2.3E3 3.8E4 9.4E5 2.3E5 4.1E6 problems selected the literature [32]. Absolute errors of numerical [43] 9.9E2 3.0E2 4.9E3 1.2E3 3.1E4 5.0E5 solutions are calculated and compared with those obtained by [44] 9.4E2 2.4E2 4.1E3 1.0E3 2.5E4 4.0E5 [45] 9.8E2 3.7E2 6.1E3 1.5E3 3.5E4 6.0E5 using the three degree B-spline method [33]. CGM 1.2E4 3.0E5 4.4E6 1.1E6 2.8E7 4.5E8 In this example, we consider the following heat conduction PSO 4.3E2 8.9E3 5.1E3 2.1E3 3.3E4 2.6E5 equation HPSO 1.0E4 1.3E5 2.1E6 1.0E6 1.4E7 2.0E8 p f ðxÞ¼cos x ; 0 < x < 1; ð51Þ 2 p2 of the third degree B-splines approximation method is only second gðtÞ¼exp t ; 0 < x < 1; ð52Þ order, and the approximation order at the end points is third order. 4 4 p2 mðtÞ¼ exp t ; 0 < x < 1; ð53Þ 6.4. Performance of PHPSO p2 4 with Parallelization is a very important strategy for reducing the computing time for corresponding sequential algorithms. In this p2 p experiment, we evaluate our algorithm on multi-core CPU and uðx; tÞ¼exp t cos x ; ð54Þ 4 2 GPU systems to accelerate the computing speed of HPSO. as its analytical solution. 6.4.1. Experimental platform setup In this paper, apart from the proposed GPU implementation, we 6.2. Parameter setting also implement the algorithm on multi-core CPU system with shared memory architectures (using OpenMP as programming All the experiments are carried out on an eight-core 2.3 GHz model), so as to compare the performance of two GPU versions AMD Opteron 6134 machine with 8 GB main memory, using the with one multi-core CPU version. GCC compiler under Linux. The parameter setting of PHPSO in A series of experiments are carried out on a dual-processor terms of population size and stopping criteria is the same as PSO, eight-core 2.3 GHz AMD Opteron 6134 machine with 8 GB main HPSO and PPSO to ensure fairness of comparison. The population memory. Two different NVIDIA graphics cards GTX 465 and Tesla size of the five evolutionary algorithms is set as to 16, and each C2050 are used to check the scalability of our approach as follows: of the evolutionary algorithms is stopped after 1000 generations. The other parameters of PSO, HPSO, PPSO and PHPSO are set as The GTX 465 has 11 multiprocessors, 352 CUDA cores, 48 kB of follows: the maximum velocity V max ¼ 1:5, acceleration factor shared memory per block, 607 MHz processor clock, 1 GB

C1 ¼ C2 ¼ 1:2, inertia weight factor x ¼ 0:9. Therefore, in these GDDR5 RAM, 102.6 GB/s memory bandwidth, and 802 MHz experiments, all the methods are run on the same computer con- memory clock. figuration for one-dimensional heat conduction equation. Twenty The Tesla C2050 has 14 multiprocessors, 448 CUDA cores, 48 kB independent runs with different methods are performed for differ- of shared memory per block, 1.15 GHz processor clock, 3 GB ent years and the average values are recorded in the corresponding GDDR5 RAM, 144 GB/s memory bandwidth, and 1.5 GHz mem- tables, with the best solution marked in bold. ory clock.

For these two cards, the maximum number of resident blocks 6.3. Performance of HPSO per multiprocessor is 8, the maximum number of threads per block is 1024, the maximum number of resident threads per multipro- At first, the results with h ¼ k ¼ 0:05; h ¼ k ¼ 0:025, cessor is 1536, and the total number of registers available per block h ¼ k ¼ 0:01; h ¼ k ¼ 0:005, h ¼ k ¼ 0:0025; h ¼ k ¼ 0:001, using is 32,768. HPSO discussed in Sections 3 and 4, are shown in Table 2. We pres- j In software, the testing system is built on top of the Linux ent the relative error absðU uðxi; tjÞÞ=absðuðxi; tjÞÞ for uð0:5; 0:1Þ i (Ubuntu 10.10) operating system, the NVIDIA CUDA Driver version using HPSO. The numerical results are compared with those results 4.2, and GCC version 4.4.5. obtained by the methods in [42–45], see Table 2. It can be seen from Table 2 that the numerical results of the proposed HPSO are much better than that in [42–45].In[32], Dehghan solved the 6.4.2. Result analyses of PHPSO example by using the Saulyev I’s technique with the accuracy being For each test, we first run the sequential algorithm on a single first-order with respect to the space and time variables, and the core of the CPU and obtain the computational time. Secondly, we numerical integration trapezoidal rule is used to approximate the run the parallel algorithm on 16 cores of the CPU with 16 concur- integral condition in (4). However, in this paper, we present a rent threads. Finally, we run the parallel algorithm on the GTX 465 new method to deal with the nonlocal boundary conditions by GPU and Tesla C2050 GPU, respectively. To avoid system jitter, using quartic splines [38], and the method is very easier than the numerical integration trapezoidal rule [32]. Table 3 Secondly, for different steps h ¼ k ¼ 1=20; h ¼ k ¼ 1=40 and The Maximum absolute errors at various step h and k. h ¼ 1=40; k ¼ 1=60, the example is solved by using HPSO. The Steps [33] CGM PSO HPSO maximum absolutes errors of the numerical solutions are calcu- lated and compared with those results obtained by the method h ¼ k ¼ 1=20 6.2E3 2.8E4 5.3E2 1.5E4 in [33], see Table 3.In[33], Caglar et al. developed a numerical h ¼ k ¼ 1=40 1.6E3 6.2E5 1.4E2 3.4E5 h ¼ 1=40; k ¼ 1=60 9.6E4 3.6E5 3.3E3 1.3E5 method by using third degree B-splines functions, but the accuracy 208 A. Ouyang et al. / Computers & Fluids 110 (2015) 198–210

Table 4 each experiment is repeated 30 times and the average value is con- The speedup of PCGM with GTX465. sidered. In terms of execution time, we only report the running Step CGM (ms) PCGM (ms) Speedup CGM (std) PCGM (std) time for the three stages of the algorithm, and do not report the

0.0500 3701.24 1075.94 3.44 8.2E01 7.8E01 time for other operations. Remarkably, the time associated with 0.0250 7809.62 1743.22 4.48 3.8E01 3.4E01 the data transfer between CPU and GPU is included in the execu- 0.0100 18899.27 3405.27 5.55 4.8E02 5.9E02 tion time for the GPU implementation. Traditionally, the speedup 0.0050 36853.58 4600.95 8.01 2.5E02 5.4E02 is defined as sequential execution time over parallel execution 0.0025 76655.45 8172.22 9.38 3.1E 02 2.3E 02 time. 0.0010 200070.71 17262.36 11.59 1.2E02 1.9E02 Moreover, to achieve the most effective and robust performance for the GPU implementation, we should select the optimum num- bers of threads per block (TPB) to keep the GPU multiprocessors as Table 5 active as possible. Therefore, we first conduct our preliminary The speedup of PPSO with GTX465. speedup performance tests on the Tesla C2050 GPU using six dif- Step PSO (ms) PPSO (ms) Speedup PSO (std) PPSO (std) ferent sizes of TPB (respectively, 32, 64, 128, 256, 512, and 1024). The NVIDIA GPUs used here only allow the CUDA kernel to launch 0.0500 2635.14 1365.36 1.93 8.6E02 4.5E04 0.0250 6547.86 2762.81 2.37 8.5E03 7.8E05 a maximum of 1024 TPB. 0.0100 15484.62 4931.41 3.14 9.2E03 8.6E05 The computational experiments focus on two aspects of perfor- 0.0050 30018.98 6791.62 4.42 5.7E03 3.2E05 mance: speedup comparison and the robustness comparison. Then 0.0025 68975.27 11515.07 5.99 6.3E04 2.9E05 results are reported below. The aim of the first set of experiments 0.0010 162562.02 22640.95 7.18 4.0E04 1.7E05 is to evaluate which part of the proposed PHPSO algorithm improves over its corresponding CPU implementation. A real case of heat conduction equation is selected to process this investiga-

Table 6 tion. Tables 6 and 9 present the comparison of computation times The speedup of PHPSO with GTX465. between the PHPSO and the CPU implementation of the same algo- rithm. Currently, both implementations have used single precision Step HPSO (ms) PHPSO (ms) Speedup HPSO (std) PHPSO (std) floating numbers, as it meets the precision requirements for the 0.0500 3845.23 852.60 4.51 8.9E03 5.5E08 objective function and provides much better performance in speed 0.0250 7925.46 1096.19 7.23 5.3E04 3.2E08 than double precision floating numbers on a current GPU [46]. 0.0100 19965.25 2112.72 9.45 7.7E04 5.1E09 0.0050 38014.85 2830.59 13.43 7.1E05 4.5E09 We can see that: the average speedup of PHPSO is 12.12 over 0.0025 78987.26 4766.88 16.57 5.9E05 3.2E09 HPSO with GPU GTX465 in Table 6; the average speedup of PHPSO 0.0010 230415.22 10692.12 21.55 4.4E06 2.2E10 is 13.06 over HPSO with GPU GTX465 in Table 9. These data indi- cate that the speedup of PHPSO is ideal for the one-dimensional heat conduction equation. In all the tables, the ‘‘std’’ denotes the standard deviation of statistical data. The standard deviation of Table 7 PHPSO on GPU is much smaller than that of HPSO. It indicates that The speedup of PCGM with Tesla C2050. the robustness of PHPSO is stronger than HPSO. The random num- Step CGM (ms) PCGM (ms) Speedup CGM (std) PCGM (std) ber generated by GPU is better than that of CPU, consequently, the 0.0500 3701.24 1025.27 3.61 6.7E01 5.6E01 quality of PHPSO is better than that of HPSO. We can conclude that 0.0250 7809.62 1701.44 4.59 2.2E01 2.7E01 the parallel implementation of HPSO is very significant. 0.0100 18899.27 3247.30 5.82 7.1E02 6.3E02 After comparing the speedups of three parallel methods 0.0050 36853.58 4477.96 8.23 5.5E02 3.9E02 between Tables 4–9, we can find that the speedups of three meth- 0.0025 76655.45 8035.16 9.54 3.9E02 2.4E02 0.0010 200070.71 16955.15 11.80 2.8E02 1.4E02 ods on GTX465 is a little smaller than those of on Tesla C2050. The average speedup of PCGM at various steps (spatial step is equal to time step) with GTX465 and Tesla C2050 is respectively 7.08, 7.27; The average speedup of PPSO at various steps (spatial step is equal

Table 8 to time step) with GTX465 and Tesla C2050 is respectively 4.17, The speedup of PPSO with Tesla C2050.

Step PSO (ms) PPSO (ms) Speedup PSO (std) PPSO (std)

0.0500 2635.14 1273.01 2.07 5.0E02 2.2E04 0.0250 6547.86 2305.58 2.84 8.6E02 8.4E05 0.0100 15484.62 3785.97 4.09 7.2E03 6.7E05 0.0050 30018.98 5528.36 5.43 5.4E03 4.6E05 0.0025 68975.27 10828.14 6.37 3.7E03 3.6E05 0.0010 162562.02 21617.29 7.52 2.4E04 2.0E05

Table 9 The speedup of PHPSO with Tesla C2050.

Step HPSO (ms) PHPSO (ms) Speedup HPSO (std) PHPSO (std)

0.0500 3845.23 767.51 5.01 6.8E04 4.4E08 0.0250 7925.46 1045.58 7.58 5.4E05 2.6E08 0.0100 19965.25 1959.30 10.19 3.4E05 3.4E08 0.0050 38014.85 2621.71 14.50 5.9E05 5.9E09 0.0025 78987.26 4400.40 17.95 2.8E06 2.8E09 0.0010 230415.22 9970.37 23.11 2.0E07 2.0E10 Fig. 5. The speedups of three methods on GTX465. A. Ouyang et al. / Computers & Fluids 110 (2015) 198–210 209

(Nos. 13C333, 08D092), the Project Supported by the Science and Technology Research Foundation of Hunan Province (Grant Nos. 2013GK3082, 2014GK3043)

References

[1] Satake S, Yoshimori H, Suzuki T. Optimizations of a GPU accelerated heat conduction equation by a programming of CUDA Fortran from an analysis of a PTX file. Comput Phys Commun 2012;183(11):2376–85. [2] Shi L, Chen H, Sun J, Li K. vCUDA: GPU-accelerated high-performance computing in virtual machines. IEEE Trans Comput 2012;61(6):804–16. [3] Chen H, Shi L, Sun J, Li K, He L. A fast RPC system for virtual machines. IEEE Trans Parallel Distrib Syst 2013;24(7):1267–76. [4] Liu X, Li M, Li S, Peng S, Liao X, X, et al. IMGPU: GPU accelerated influence maximization in large-scale social networks. IEEE Trans Parallel Distrib Syst 2014;25(1):136–45. [5] Rustico E, Bilotta G, Herault A, Del Negro C, Gallo G. Advances in multi-GPU smoothed particle hydrodynamics simulations. IEEE Trans Parallel Distrib Syst 2014;25(1):43–52. [6] Nelson B, Kirby RM, Haimes R. GPU-based volume visualization from high- Fig. 6. The speedups of three methods on Tesla C2050. order finite element fields. IEEE Trans Visual Comput Graph 2014;20(1):70–83. [7] S, He M, Shen Z. Optimizing hopfield neural network for spectral mixture unmixing on GPU platform. IEEE Geosci Remote Sens Lett 2014;11(4):818–22. 4.72. However, the average speedup of PHPSO at various step (spa- [8] Chitchian M, Simonetto A, van Amesfoort A, Keviczky T. Distributed computation particle filters on GPU architectures for real-time control tial step is equal to time step) with GTX465 and Tesla C2050 is applications. IEEE Trans Control Syst Technol 2013;21(6):2224–38. respectively 12.12, 13.06. Obviously, Figs. 5 and 6 show that the [9] Rodriguez-Losada D, San Segundo P, Hernando M, de la Puente P, Valero- speedup of PHPSO is much higher than that of PCGM and PPSO. Gomez A. GPU-mapping: robotic map building with graphical multiprocessors. IEEE Robot Autom Mag 2013;20(2):40–51. Consequently, the parallel implementation of CGM-only or PSO- [10] Richter C, Schops S, Clemens M. GPU acceleration of finite difference schemes only is not so ideal and it is necessary for the parallel implementa- used in coupled electromagnetic/thermal field simulations. IEEE Trans Magn tion to hybridize PSO and CGM. 2013;49(5):1649–52. [11] Z, Lewis TJ, Kirby RM, Whitaker RT. Architecting the finite element method We can see that the standard deviation of PHPSO at various pipeline for the GPU. J Comput Appl Math 2014;257:195–211. steps is smaller than that of PPSO and much smaller than that of [12] Orchard G, Martin J, Vogelstein R, Etienne-Cummings R. Fast neuromimetic PCGM. Consequently, the robustness of PHPSO is stronger than that object recognition using FPGA outperforms GPU implementations. IEEE Trans Neural Networks Learn Syst 2013;24(8):1239–52. of PCGM and PPSO. It is completely obvious that the hybrid imple- [13] Feng Z, Li P. Fast thermal analysis on GPU for 3D ICs with integrated mentation of CGM and PSO (PHPSO) is very significant. microchannel cooling. IEEE Trans Very Large Scale Integr (VLSI) Syst 2013;21(8):1526–39. [14] Westphal E, Singh S, C-C, Gompper G, Winkler R. Multiparticle collision 7. Conclusion dynamics: GPU accelerated particle-based mesoscale hydrodynamic simulations. Comput Phys Commun 2014;185(2):495–503. One-dimensional heat conduction equation is highly common [15] Tuttafesta M, DAngola A, Laricchiuta A, Minelli P, Capitelli M, Colonna G. GPU and multi-core based reaction ensemble Monte Carlo method for non-ideal in the science research field and engineering application. We adopt thermodynamic systems. Comput Phys Commun 2014;185(2):540–9. quartic spline and difference scheme to transform the heat conduc- [16] Jie L, Li K, Shi L, Liu R, Mei J. Accelerating solidification process simulation for tion equation into a system of linear equations. In order to employ large-sized system of liquid metal atoms using GPU with CUDA. J Comput Phys 2014;257, Part A:521–35. PSO and HPSO to solve it, a system of linear equations is success- [17] Ye Y, Li K. Entropic lattice Boltzmann method based high Reynolds number fully transformed into an unconstrained optimization problem. A flow simulation using CUDA on GPU. Comput Fluids 2013;88:241–9. parallel hybrid PSO (PHPSO) algorithm is designed to solve a real [18] Leskinen J, Priaux J. Distributed evolutionary optimization using Nash games and GPUs-applications to CFD design problems. Comput Fluids case of one-dimensional heat conduction equation. We therefore 2013;80:190–201. implement the PHPSO in CUDA for use on GPUs and verify its per- [19] Hashimoto T, Tanno I, Tanaka Y, Morinishi K, Satofuka N. Simulation of doubly formance. The relative error and abstract error of calculation periodic shear layers using kinetically reduced local Navier–Stokes equations on a GPU. Comput Fluids 2013;88:715–8. results for the PHPSO is the same as the HPSO and the standard [20] Hu Q, Gumerov NA, Duraiswami R. GPU accelerated fast multipole methods for deviation of PHPSO is much smaller than the PHPSO. The speedups vortex particle simulation. Comput Fluids 2013;88:857–65. of PHPSO on two different GPUs are up to an average of 12.12 and [21] H-T, Xu H-Y, Guo X-W. The Cattaneo-type time fractional heat conduction equation for laser heating. Comput Math Appl 2013;66(5):824–31. 13.06 times faster, respectively, for different time and spatial steps. [22] Hosseini S, Shahmorad S, Masoumi H. Extension of the operational tau method The programs of PPSO, PCGM and PHPSO are implemented on for solving 1-d nonlinear transient heat conduction equations. J King Saud two various GPUs, the speedups and standard deviations of three Univ – Sci 2013;25(4):283–8. algorithms are displayed in this study. Comparison of three parallel [23] Yılmazer A, Kocar C. Exact solution of the heat conduction equation in eccentric spherical annuli. Int J Therm Sci 2013;68:158–72. algorithms shows that the PHPSO is competitive in terms of [24] Liu J, Zheng Z. A dimension by dimension splitting immersed interface method speedup and standard deviation. The results also show that using for heat conduction equation with interfaces. J Comput Appl Math PHPSO to solve the one-dimensional heat conduction equation 2014;261:221–31. [25] Liu L-B, Liu H-W, Chen Y. Polynomial spline approach for solving second-order can outperform two parallel algorithms as well as HPSO itself. boundary-value problems with Neumann conditions. Appl Math Comput Thus, the PHPSO is an efficient and effective approach towards 2011;217(16):6872–82. the heat conduction equation, as it is shown to be with strong [26] Liu L-B, Liu H-W. A new fourth-order difference scheme for solving an n- carrier system with Neumann boundary conditions. Int J Comput Math robustness and high speedup. 2011;88(16):3553–64. [27] Cao H-H, Liu L-B, Y, Fu S-m. A fourth-order method of the convection– Acknowledgements diffusion equations with Neumann boundary conditions. Appl Math Comput 2011;217(22):9133–41. [28] Bougoffa L, Rach RC, Wazwaz A-M. Solving nonlocal initial-boundary value This paper is partially funded by the National Natural problems for the Lotkac–von Foerster model. Appl Math Comput Science Foundation of China (Grant Nos. 61070057, 61133005, 2013;225:7–15. [29] Bougoffa L, Rach RC. Solving nonlocal initial-boundary value problems for 61370095, 61103047, 61202109), the Project Supported by Scien- linear and nonlinear parabolic and hyperbolic partial differential equations by tific Research Fund of Hunan Provincial Education Department the Adomian decomposition method. Appl Math Comput 2013;225:50–61. 210 A. Ouyang et al. / Computers & Fluids 110 (2015) 198–210

[30] Jankowski T. Nonnegative solutions to nonlocal boundary value problems for [39] Kennedy J, Eberhart R. Particle swarm optimization. Proceedings of IEEE systems of second-order differential equations dependent on the first-order international conference on neural networks, vol. 4. Piscataway, NJ, USA: IEEE derivatives. Nonlinear Anal: Theory Methods Appl 2013;87:83–101. Press; 1995. p. 1942–8. [31] Asanova AT, Dzhumabaev DS. Well-posedness of nonlocal boundary value [40] Ouyang A, Tang Z, Li K, Sallam A, Sha E. Estimating parameters of Muskingum problems with integral condition for the system of hyperbolic equations. J model using an adaptive hybrid PSO algorithm. Int J Pattern Recognit Artif Math Anal Appl 2013;402(1):167–78. Intell 2014;28(1):1–29. [32] Dehghan M. The one-dimensional heat equation subject to a boundary integral [41] Mussi L, Daolio F, Cagnoni S. Evaluation of parallel particle swarm specification. Chaos Solitons Fract 2007;32(2):661–75. optimization algorithms within the CUDA architecture. Inform Sci [33] Caglar H, Ozer M, caglar N. The numerical solution of the one-dimensional heat 2011;181(20):4642–57. equation by using third degree b-spline functions. Chaos Solitons Fract [42] Cannon JR, Esteva SP, Hoek JVD. A Galerkin procedure for the diffusion 2008;38(4):1197–201. equation subject to the specification of mass. SIAM J Numer Anal [34] Liu L-B, Liu H-W. Quartic spline methods for solving one-dimensional 1987;24(3):499–515. telegraphic equations. Appl Math Comput 2010;216(3):951–8. [43] Cannon JR, Matheson AL. A numerical procedure for diffusion subject to the [35] Tsourkas P, Rubinsky B. A parallel genetic algorithm for heat conduction specification of mass. Int J Eng Sci 1993;31(3):347–55. problems. Numer Heat Transfer Part B 2005;47(2):97–110. [44] Ewing RE, T. A class of parameter estimation techniques for fluid flow in [36] Gosselin L, Tye-Gingras M, Mathieu-Potvin F. Review of utilization of genetic porous media. Adv Water Resour 1991;14(2):89–97. algorithms in heat transfer problems. Int J Heat Mass Transfer 2009;52(9- [45] Wang Y, Liu L, Y. Positive solutions for singular semipositone 10):2169–88. boundary value problems on infinite intervals. Appl Math Comput [37] He L, Zou D, Zhang Z, Chen C, H, Jarvis SA. Developing resource 2014;227:256–73. consolidation frameworks for moldable virtualmachines in clouds. Future [46] Zhu W. Nonlinear optimization with a massively parallel evolution strategy– Gener Comp Syst 2014;32:69–81. pattern search algorithm on graphics hardware. Appl Soft Comput [38] Liu H-W, Liu L-B, Chen Y. A semi-discretization method based on quartic 2011;11(2):1770–81. splines for solving one-space-dimensional hyperbolic equations. Appl Math Comput 2009;210(2):508–14.