Computational Mechanics manuscript No. (will be inserted by the editor)

Scalable parallel implementation of CISAMR: A non-iterative mesh generation algorithm

Bowen Liang · Anand Nagarajan · Soheil Soghrati

Received: date / Accepted: date

Abstract We present the parallel implementation of a Keywords Parallel mesh generation, Finite element, non-iterative mesh generation algorithm, named Con- Scalability, CISAMR, Heterogeneous materials forming to Interface Structured Adaptive Mesh Refine- ment (CISAMR). The partitioning phase is tightly inte- grated with a microstructure reconstruction algorithm 1 Introduction to determine the optimized arrangement of partitions based on shapes/sizes of particles. Each processor then The availability of high-performance parallel comput- creates a structured sub-mesh with one layer of ghost ing resources has enabled scientists and engineers to elements for its designated partition. h-adaptivity and simulate a variety of physical phenomena using large- r-adaptivity phases of the CISAMR algorithm are also scale models with unprecedented geometrical details [1, carried out independently in each sub-mesh. Processors 2]. In order to perform such computationally expensive then communicate to merge mesh/hanging nodes along simulations using the finite element method (FEM), the faces shared between neighboring partitions. The final typical workflow begins with constructing the geomet- mesh is constructed by performing face-swapping and rical model and discretizing that into an appropriate element subdivision phases, after which a minimal com- conforming mesh on a workstation [2]. Next, mesh files munication phase is required in 3D CISAMR to merge are transferred into a supercomputer to perform mesh new nodes created on partition boundaries. Several ex- partitioning [3] and subsequently approximate the gov- ample problems, together with scalability tests demon- erning partial differential equations (PDEs) using a par- strating a super-linear speedup, are provided to show allel solver [4]. However, as such simulations grow be- the application of parallel CISAMR for generating mas- yond tens of millions of degrees of freedom (DOFs), sive conforming meshes. the initial phase of constructing a massive unstructured mesh could alone be a highly computationally demand- ing task that involves solving billions of nonlinear geo- metrical equations. Besides the high computational cost Bowen Liang Department of Mechanical and Aerospace Engineering associated with this process, it may not even be feasi- The Ohio State University ble to generate such meshes sequentially due to memory 201 West 19th Avenue, Columbus, OH limitations [5]. Therefore, significant research effort [6, Anand Nagarajan 7,8] has been dedicated to develop parallel mesh gener- Department of Mechanical and Aerospace Engineering ation algorithms that satisfy four major objectives [8, The Ohio State University 9,10]: (i) stability to enable constructing high-quality 201 West 19th Avenue, Columbus, OH meshes for a variety of geometrical models; (ii) efficient Soheil Soghrati (corresponding author) domain decomposition to achieve load balancing and Department of Mechanical and Aerospace Engineering Department of Materials Science and Engineering reduce the pre-processing overhead; (iii) code re-use to The Ohio State University allow utilizing a pre-existing optimized sequential mesh 201 W. 19th Avenue, Columbus, OH generation code; and (iv) scalability to achieve a linear E-mail: [email protected]. speedup for large-scale problems. 2 Bowen Liang et al.

In order to create massive meshes, while satisfying yields a near-linear speedup and high performance on the criteria enumerated above, parallel mesh generation shared memory machines. algorithms often have a sequential pre-processing phase In the advancing front method, new elements are for partitioning the entire domain into optimized sub- progressively inserted to discretize the entire domain regions [2]. Two objectives of this partitioning phase are by applying certain geometrical constraints on active to achieve balanced load measures for all processors and fronts [26]. A parallel version of this algorithm is in- to minimize the total area of shared interfaces between troduced in [19], where all interior sub-regions are first them [11,12]. Several domain decomposition techniques meshed independently and then a buffer zone is utilized have been introduced, which can be categorized into to synchronize corresponding meshes along shared in- discrete and continuous methods [10]. The former be- terfaces of neighboring partitions. Using a discrete do- gins with generating an initial coarse background mesh main decomposition approach and surface mesh genera- that conforms to domain boundaries using a sequen- tion as pre-processing phases, a parallel advancing front tial mesh generator. Graph partitioning algorithms [13, algorithm is presented in [5] that generates the mesh for 14,15] are then employed to decompose vertices of the each sub-region independently. In the last step of this corresponding graph structure into a number of simi- method, an iterative smoothing algorithm is applied to larly sized partitions and simultaneously minimize the reconstruct elements near shared interfaces to improve number of connecting edges. Popular open-source graph the mesh quality. In edge subdivision based algorithms partitioning libraries such as Metis/Parallel Metis [3] [27,28], a triangle (2D) or a (3D) is bisected and Chaco [16] can be used for this purpose. In con- by its longest-edge midpoint and vertices to generate a tinuous domain decomposition methods, on the other conforming mesh. In their parallel versions introduced hand, the original domain is directly partitioned us- in [29,30], a number of subdivision templates are em- ing quadtree/octree [11,12] or medial axis [17] tech- ployed to decouple the refinement process on different niques. In addition to avoiding the overhead associ- processors and minimize the communication cost. The ated with creating an initial background mesh and the terminal-edge bisection method [8] is an inherently de- corresponding graph data structure, continuous meth- coupled algorithm and thus scalable in the parallel im- ods enable a better code-reuse by applying the sequen- plementation. A critical review of advantages and lim- tial meshing code to each partition independently [18]. itations of different parallel mesh generation algorithm However, these advantages come at the price of gen- is provided in [10]. erating polyhedral surfaces, which can deteriorate the Majority of existing algorithms for parallel mesh convergence rate of the meshing algorithm, as well as generation [5,20,31], including those discussed above, the quality of the resulting mesh [10]. require an iterative smoothing/optimization phase in- volving edge/face swapping [32], removal of bad ele- After partitioning the domain, several robust algo- ments, or Laplacian smoothing [33] to improve the qual- rithms can be utilized to build a massive conforming ity of elements. This iterative phase could be time- mesh in parallel, among which we can mention Delau- consuming and in some cases difficult to converge due to nay based techniques [17], advancing front the geometrical complexity, especially for the construc- [19,20], and edge subdivision methods [21]. In the De- tion of massive meshes [10]. This often leads to a tightly launay triangulation algorithm, a nonlinear system of coupled problem near partition interfaces, which could equations is iteratively solved and new mesh nodes are cause slow convergence, excessive communication cost, created/relocated until a set of constraints on the mesh and even inability to build the final mesh. Recently, quality and element size are satisfied [22,23]. A paral- Soghrati et al. [34,35] have introduced the Conform- lel 3D implementation of this algorithm is introduced ing to Interface Structured Adaptive Mesh Refinement in [17], which achieves a linear speedup using data- (CISAMR) algorithm, that enables the non-iterative parallel architectures and by expanding open faces via a construction of high-quality conforming meshes. In this bucketing technique. By combining boundary merging approach, an initial structured mesh is transformed into with an incremental construction algorithm, the inde- a conforming mesh with proper element shapes and as- pendent parallel near (IPNDT) pect ratios in four consecutive steps: (i) h-adaptive re- method [24] allows efficient mesh generation with a low finement in the vicinity of material interfaces; (ii) r- overhead. Two fully decoupled parallel Delaunay meth- adaptivity to relocate selected nodes; (iii) non-iterative ods are introduced in [18,25], which utilize a continu- face-swapping to eliminate highly distorted ; ous domain decomposition approach, together with a and (iv) sub-division of remaining nonconforming ele- pre-processing interface zone for mesh generation. The ments. A more comprehensive overview of the sequen- parallel sparse Voronoi refinement (SVR) algorithm [7] tial CISAMR algorithm is provided in Section2. Scalable parallel implementation of CISAMR: A non-iterative mesh generation algorithm 3

In this manuscript, we present a scalable parallel subdividing algorithms. Each step is described in more implementation of CISAMR and demonstrate its ap- details next. plication for the construction of massive conforming ◦ h-adaptive refinement: In order to reduce the geo- meshes for materials with intricate microstructures. In metric discretization error and more importantly to ob- order to eliminate the overhead associated with build- tain a more accurate approximation of gradients (e.g., ing the background mesh and a graph structure, a new stress concentrations), a Structured Adaptive Mesh Re- quadtree/octree-based continuous domain partitioning finement (SAMR) algorithm is applied to background algorithm is introduced, which yields load-balanced sub- elements in the vicinity of material interfaces. Given the domains. Each processor independently creates an ini- fact that in 3D CISAMR each cubic block of the back- tial structured mesh for its designated sub-region with ground mesh is composed of four orthocentric and one one layer of ghost elements, which significantly reduces regular tetrahedrons, different subdivision patterns are the coupling (and thus communication cost) between applied to each type of tetrahedron in this recursive neighboring partitions. The h-adaptive and r-adaptive SAMR algorithm to maintain the same arrangement refinement phases of CISAMR are then carried out for of elements after refinement (see [35] for details). As each partition independently, i.e., without any commu- shown in Figure 1b, in addition to elements cut by ma- nication between processors. The communication only terial interfaces, a selected number of their neighboring happens after the completion of these phases to merge elements are also subjected to refinement to satisfy a set resulting mesh nodes and hanging nodes along partition of constraints associated with this phase. For example, interfaces. The subsequent face-swapping (only needed no background element can by cut by more than one in 3D) and subdividing phases, as well as the removal of interface or hold more than one hanging node on any of ghost elements, also occur independently for each par- its edges [35]. Besides satisfying these constraints, an a tition to construct the final conforming mesh. In the priori number of refinement levels is assigned to each in- 3D implementation, an additional minor communica- terface based on its curvature or the accuracy needed in tion phase is required between processors after subdi- the recovery of gradients. Note that during the SAMR viding elements to merge new nodes generated at inter- phase, each background tetrahedron is subjected to re- section points of element edges and material interfaces. finement when either it is cut by an interface or re- The remainder of this manuscript is structured as quested by its neighboring element (which intersects follows: In Section2 we provide a brief overview of the the interface) to satisfy the aforementioned constraints. sequential CISAMR algorithm. We then introduce a We use this feature in the parallel implementation of new partitioning algorithm in Section3, which is tightly CISAMR to minimize the coupling between neighbor- integrated with a virtual microstructure reconstruction ing partitions, as background elements cut by the inter- algorithm for the parallelization of this method. The face only need to communicate with their neighboring parallel CISAMR algorithm is described in Section4, tetrahedrons during the SAMR phase. where we also discuss implementation details such as ◦ r-adaptivity: Next, an r-adaptivity algorithm is imposing the periodic boundary condition (PBC) and applied to nodes of orthocentric tetrahedrons intersect- provide proper pseudo codes. Several 2D and 3D exam- ing material interfaces, during which they either main- ple problems, each of which is accompanied with scala- tain their location or snap to the interface along one bility tests, are presented in Section5. Final concluding of the edges of such elements. A unique feature of this remarks are summarized in Section6. non-iterative relocation pattern is that the new posi- tion of a node only depends on its distance to the in- terface. During the parallel implementation, this means 2 Sequential CISAMR algorithm that the r-adaptivity phase can be applied to each node independently in each partition, i.e., without any com- For completeness, in this section we provide a brief munication between processors. As illustrated in Figure overview of the (sequential) CISAMR algorithm [34, 1c, after the completion of this phase, ≈ 50% of the el- 35], which is the basis of the parallel mesh genera- ements originally cut by the interface are transformed tion algorithm presented in this manuscript. This non- into conforming tetrahedrons (remaining nonconform- iterative algorithm transforms a structured background ing tetrahedrons are shown in black). mesh composed of either quadrilateral elements in 2D ◦ Face-swapping (3D): Certain relocation patterns or tetrahedral elements in 3D into an adaptively-refined during the r-adaptivity phase of 3D CISAMR could conforming mesh. As schematically shown in Figure1, lead to the formation of sliver or cap shaped tetrahe- this process involves the use of customized versions of drons with exceedingly high aspect ratios. Clearly, the h-adaptivity, r-adaptivity, face-swapping, and element presence of such poor-quality elements in the final con- 4 Bowen Liang et al.

Adaptiviely refined Particles elements

z z Background mesh x y x y

(a) (b) Nonconforming elements Conforming (dark color) elements

z z Elements Subdividing subjected to x y elements with x y r-adaptivity hanging nodes

(c) (d)

Fig. 1: Process of constructing a conforming mesh using CISAMR: (a) initial structured background mesh and morphologies of two embedded particles; (b) adaptively refined mesh after performing SAMR along material interfaces; (c) deformed mesh after performing r-adaptivity and face-swapping; (d) final mesh after the completion of the sub-tetrahedralization phase forming mesh could be detrimental to the accuracy of (see [35] for more details). This algorithm is designed the corresponding FE approximation, in particular dur- such that it does not disturb the conformity of ele- ing the recovery of gradients along material interfaces. ment edges in the face-swapped block with those of Given the fixed arrangement of elements in the initial neighboring blocks. It must be noted that sliver-shaped structured mesh, sliver/cap-shaped elements can easily tetrahedrons always appear as a pair of two elements be eliminated via a non-iterative face-swapping algo- in two neighboring blocks. Thereby, element connec- rithm. For this purpose, one must only visit each cubic tivities are only affected in pairs of neighboring blocks block (composed of five tetrahedral elements) that in- originally containing such poor-quality elements after cludes a sliver or cap shaped tetrahedron, perform the face-swapping, which is another key feature of CISAMR face-swapping, and introduce an internal cut to create that minimizes the communication cost in the parallel seven new tetrahedrons, all with proper aspect ratios implementation. Scalable parallel implementation of CISAMR: A non-iterative mesh generation algorithm 5

◦ Subdividing nonconforming elements: Finally, if the new and existing particles overlap. For more de- as shown in Figure 1d, all remaining nonconforming tails regarding this virtual packing algorithm refer to elements (cf. black tetrahedrons in Figure 1c) and ele- [36]. ments with hanging nodes (generated during the SAMR After the virtual reconstruction of the microstruc- phase) are subdivided to create the final conforming ture, a quadtree/octree-based continuous domain over- mesh. As discussed in [34,35], specific patterns are used decomposition algorithm is employed to initiate the do- for cutting elements during this phase to ensure lowest main partitioning for parallel CISAMR mesh genera- possible aspect ratios for resulting sub-elements after tion. The objective is to achieve similar computational subdivision. In 3D, edge angles of tetrahedron faces and costs (load balancing) for each partition during the mesh and a scalar invariant function based on nodal coordi- generation process while minimizing areas of shared in- nates are utilized for selecting a cut on such faces to terfaces between them. This can be achieved by solving automatically guarantee the conformity of cuts between a multi-objective optimization problem for the domain neighboring elements without additional book keeping. decomposition problem, which is defined as [10]: Given This allows performing the subdivision phase by vis- a continuous domain Ω = (Ω, ∂Ω), partition Ω into n iting each element independently, which is crucial to subsets, Ω1,Ω2, ..., Ωn such that minimizing its communication cost in the parallel im- ( ) plementation. It must be noted that CISAMR guaran- max(l(Ωi)) minimize P tees that element aspect ratios of the final conforming (k∂Ωik) mesh do not exceed 3 in 2D and 5 in 3D. ( ) (1) ∪iΩi = Ω subject to . Ωi ∩ Ωj = ∅ for i 6= j

3 Partitioning of heterogenous domains In the equation above, the maximum load measure func- tion, max(l(Ωi)), and the total area of shared parti- P In this section, we present a new partitioning scheme tion faces, (k∂Ωik), for all partitions Ωi are simul- for parallel CISAMR, which is specifically tuned for taneously minimized. While (1) is a rather straight- meshing materials with heterogeneous microstructures. forward optimization problem, it requires an a priori This algorithm is tightly integrated with a virtual mi- assessment of the load measure function l(Ωi) associ- crostructure reconstruction technique recently introduced ated with each partition, which could be a challeng- by the authors in [36]. In this method, a set of hierar- ing task. In CISAMR, conforming elements can be gen- chical bounding boxes (BBoxes) is utilized to virtually erated for each embedded particle in a heterogenous pack arbitrary-shaped particles in the domain, followed microstructure rather independently. Note that refined by an optimization phase involving the selective reloca- background elements and hanging nodes caused by a tion and/or elimination of particles to simulate target particle could affect the mesh structure in the vicinity statistical microstructure descriptors (e.g., volume frac- of its neighboring particles. Therefore, we must check tion and spatial arrangement of inclusions). for such pre-existing refinement effects when performing In the proposed partitioning scheme, we reutilize the SAMR for each particle. Although this does not consid- BBox representation of particles (already used in the erably affect the computational time associated with virtual packing phase of the reconstruction algorithm) the mesh generation for each individual particle, the to determine the optimized arrangement/size of par- search in the background mesh to identify pre-refined titions by balancing their load measures. As schemat- elements and existing hanging nodes would be compu- ically shown in Figure2, three levels of BBoxes are tationally demanding process. assigned to each particle during the packing process As discussed in Section2, the 3D CISAMR algo- to approximate its morphology. Intersections between rithm relies on four major phases for transforming back- BBoxes of a new particle and those of existing particles ground elements into conforming elements along parti- determine whether it can be added to the microstruc- cle interfaces: SAMR, r-adaptivity, face-swapping, and ture such that none of the particles are overlapping with element subdivision. After studying the computational each another. To minimize the computational cost, this cost associated with each step, we determined that the process begins with checking for intersections using en- number of SAMR levels has the most prominent im- larged BBoxes (cf. Figure 2a) to quickly identify rela- pact on the computational cost of CISAMR in each tive locations existing particles surrounding a new par- partition. Note that face-swapping is only needed in a ticle. Checking overlaps between primary BBoxes (cf. small fraction of elements along material interfaces dur- Figure 2b) and then secondary BBoxes (cf. Figure 2c) ing the execution of CISAMR; thus can be overlooked is done in the next steps to more accurately determine while estimating the load measure function. Perform- 6 Bowen Liang et al.

Secondary Primary 3 BBox BBox 2

4 Existing Overlapping region particle of primary BBoxes of particles

1 New particle

6 New 5 particle Secondary BBox Primary Enlarged BBox Enlarged BBox BBox (a) (b) (c)

Fig. 2: Schematic process of the BBox-based packing algorithm: (a) Creating hierarchical BBox representation of a particle; (b) checking intersections between the enlarged BBox of a new particle and primary BBoxes of existing particles; (c) checking intersections between secondary BBoxes of new and existing particles to more accurately determine if they overlap ing the r-adaptivity and element subdivision phases re- schematically shown in Figure 3a, as the first step, we quires visiting all element cuts by the interface, thus reutilize primary and secondary BBoxes of each particle their computational cost is linearly proportional to the to, respectively, to determine its location with respect surface area of each particle. The cost of performing to underlying sub-regions and approximate its surface SAMR, on the other hand, does not change linearly area. Note that secondary BBoxes can be considered as with increasing the number of refinement levels, as not pixelated (2D) or voxelated (3D) representation of the only a larger number of elements are effected during actual morphology of an inclusion, thus the sum of their this process but also a number of data structures must diagonal distances is a proper approximation of the sur- be used to keep track of elements being refined, as well face area. A continuous domain over-decomposition is as hanging nodes formed on edges of their neighboring then carried out by subdividing the domain into m elements. equally sized small rectangular/cuboid shaped blocks Without going further into implementation details Ωj such that m > n , where n is the targeted number for the SAMR phase (see [35] for a more in-depth dis- of partitions. In this process, a quadtree (2D) or an oc- cussion), we define the load measure function l(Ωi) in tree (3D) data structure is used to iteratively refine the sub-region Ωi that includes N particles as domain into smaller sub-regions and store embedded particles in leaves of the corresponding tree. N X l(Ω ) = RM kA k . (2) i k k The final n partitions are created by solving (1), k=1 which first requires evaluating the load measure func- where Rk and kAkk are the number of refinement lev- tion l(Ωj) for all m rectangular/cuboid sub-regions gen- els and the surface area of particle k, respectively. Also, erated after over-decomposing the domain. The process M determines how increasing the number of refinement begins by estimating l(Ωj) based on refinement lev- levels affects the load measure, which is a function of els Rk and surface areas kAkk of particles using (2). number of elements visited for each levels of refinement Solving (1) is then a computationally in-expensive task, and the numerical cost associated with insertion/search which reduces the number of sub-regions from m to n in data containers used duting the SAMR phase (details by merging adjacent pairs with the least load measure. will be discussed in Section 4.2). Our numerical study To better elucidate this process, an example is shown revealed that while M = 1 is an appropriate value in in Figure 3b, where four initial equally-sized rectangu- 2D, M = 1.5 better reflects the increase in the load lar sub-regions (m = 4) are reduced to three (n = 3) measure caused by increasing the number of SAMR lev- by merging the two sub-regions on the right. This op- els in 3D. timized arrangement of partitions is achieved assuming The proposed quadtree/octree-based continuum do- the same number of refinement levels across all material main over-decomposition algorithm relies on solving the interfaces, meaning load measures estimated for initial multi-objective optimization problem given in (1) to de- sub-regions merely depends on the sum of surface areas termine the final size and arrangement of partitions. As of particles confined in them. Scalable parallel implementation of CISAMR: A non-iterative mesh generation algorithm 7

Primary BBoxes Partition #2 Partition #3

Partition #3 Partition #1

Secondary BBoxes Partition #1 Partition #2

(a) (b) (c)

Fig. 3: (a) Using primary and secondary BBoxes of particles to assign particles to each sub-region of an over- decomposed domain and evaluating corresponding load measure functions; (b) optimized arrangement of partitions after merging the two sub-regions on the right, considering the same level of SAMR for all particles; (c) optimized arrangement of partitions after merging the two sub-regions on the left considering 3 and 1 levels of SAMR for blue and gray particles, respectively

Algorithm 1 (Quadtree/octree based domain partitioning)

function partition(Ω, n, Sp, Sr) m, Ω ← over decompose(Ω, n) . quadtree/octree over-decomposition of domain Ω into m sub-regions Ωj (m > n) SBBox ← get BBox(Sp) . create a set of BBoxes representing morphologies of particles in set Sp j Sp ← map particle(SBBox, Sp, Ωj ) . map each particle in Sp to corresponding sub-region Ωj overlapping with that j j Lj ← load measure(Ωj , Sp, Sr) . use secondary BBoxes of particles Sp overlapping with Ωj and their . corresponding SAMR levels from Sr to estimate load measure while (m > n) do . while number of over-decomposed sub-regions m is larger than target number of partitions n Ωa, Ωb ← subregion locator(Lj , Ωj ) . find two neighboring sub-regions with minimum combined load measure Ωj , m ← update subregions(Ωa, Ωb, Ωj ) . update arrangement and number of sub-regions by merging Ωa and Ωb end while rankj ← assign rank(Ωj ) . assign a rank (processor) for each of the final partitions Ωj neighborj ← neighbor locator(Ωj ) . locate neighboring partitions of each partition and their shared interfaces j output(Ωj , rankj , neighborj , Sp) . output size, location and rank of partition, its neighbors, and confined particles end function

A different case scenario of determining optimized 100) using the algorithm presented above. Note that partitions based on the same 2 × 2 arrangement of ini- while the optimization phase of the partitioning algo- tial sub-regions is depicted in Figure 3c, where this time rithm allows the formation of non-rectangular (non- the two sub-regions on the left are merged. While un- cuboid in 3D) partitions by merging multiple neighbor- derlying microstructures in both Figures 3b and 3c are ing sub-regions of the over-decomposed domain, here all identical, the difference is that in the latter, the num- final partitions are rectangular shaped. This constraint ber of refinement levels assigned to smaller particles does not allow a perfect load balance between all parti- (shown in blue) is three times of that assigned to the tions, but has deliberately been made due to three key two larger particles. According to (2), the number of re- advantages: (i) reducing the computational cost associ- finement levels affects the load measure in each block, ated with the partitioning phase, which must be carried which in turn effects the final arrangement of partitions out sequentially; (ii) facilitating the parallel construc- after performing the optimization. The pseudo code for tion of initial structured background mesh for each rect- the partitioning algorithm presented in this section is angular/cuboid shaped partition by all processors; (iii) provided in Algorithm1. minimizing the communication cost between neighbor- ing processors during the parallel CISAMR mesh gen- Figure4 shows the virtually reconstructed microstruc- eration process. The last item, which is the key ben- tural model of a particulate composite partitioned into efit of maintaining rectangular/cuboid shaped parti- three different partition arrangements (n = 20, 40, and 8 Bowen Liang et al.

(a) 20 partitions (b) 40 partitions (c) 100 partitions

Fig. 4: Using the proposed partitioning algorithm to subdivide the 2D microstructural model of a particulate composite into various numbers of partitions tions, is partially achieved by minimizing lengths/areas constraint imposed on the shape of partitions allows of shared interfaces between two neighboring partitions. using a simple algorithm for constructing a structured However, the principal reason becomes evident in Sec- mesh composed of rectangular (2D) or tetrahedral (3D) tion4, where we discuss the computational cost asso- elements as the starting point in the CISAMR algo- ciated with merging mesh nodes and mapping hanging rithm (cf. Figure5b). This scalable process is carried nodes along shared interfaces. These advantages out- out by each processor independently, meaning no com- weigh the tradeoff of less balanced partitions, result- munication is needed between them at this stage. It ing in a lower overall computational time for the par- must be noted that unlike majority of existing paral- allel construction of massive conforming meshes using lel meshing algorithms, both tasks described here do CISAMR. not involve sequentially reading any input file (geomet- rical model or background mesh), which considerably reduces the overall computational time. 4 Parallel CISAMR algorithm As shown in Figure5b, the structured sub-mesh constructed for each partition includes a layer of ghost In this section, we introduce the parallel CISAMR al- elements along its shared interfaces with adjacent par- gorithm for generating massive 2D and 3D conforming titions. At the end of the mesh generation process, this meshes. To facilitate the discussion, the meshing pro- additional layer of ghost elements, which extends be- cess is schematically shown in Figure5 for a simplified yond dimensions of each partition and overlaps with 2D microstructure. A detailed explanation of each step neighboring partition dimensions, must be eliminated. of the proposed parallel algorithm is presented next, Although ghost elements may seem redundant and in- which allows nearly 100% sequential code reuse with a deed slightly increase the computational cost associated minimized communication cost between processors. with mesh generation in each processor, they signif- icantly reduce the communication time (≈ 50%) be- tween processors in parallel CISAMR. These advan- 4.1 Construction of background sub-meshes tages will become more evident as we describe subse- quent steps of this parallel mesh generation algorithm, The parallel implementation of CISAMR begins with including the SAMR and r-adaptivity phases discussed assigning each sub-region of the partitioned domain to in the following Section 4.2). a processor, as shown in Figure5a. For meshing a vir- tually reconstructed heterogenous microstructure, each processor separately reads an input file characterizing 4.2 SAMR and r-adaptivity phases morphologies of inclusions confined within or cutting boundaries of its corresponding partition. Each parti- Owing to the presence of ghost elements in background tion is then responsible for independently creating a sub-meshes, both SAMR and r-adaptivity phases of structured background mesh with the designated el- the CISAMR algorithm can be executed independently ements size for its partition. The rectangular/cuboid in each partition (cf. Figure5c). First, the sequen- Scalable parallel implementation of CISAMR: A non-iterative mesh generation algorithm 9

Partition #2 Ghost layers

Partition #3

Partition #1

(a) (b)

One-sided hanging node

(d) (c)

Fig. 5: Parallel CISAMR mesh generation process: (a) subdividing the domain into 3 partitions; (b) parallel construction of initial structured sub-meshes with ghost elements for each partition; (c) performing SAMR and r- adaptivity phases, followed by merging/mapping mesh and hanging nodes on shared interfaces of partitions through communication between processors; (d) eliminating ghost elements and subdividing remaining nonconforming elements and elements with hanging nodes to build the final conforming mesh tial SAMR code is reused by each processor to adap- perience consistent relocation directions without any tively refine background elements of the corresponding communication. The reason is the emergence of iden- partition in the vicinity of material interfaces. The r- tical relocation patterns for boundary nodes due to the adaptivity is then carried out by visiting nodes of all fixed arrangement of elements in structured background refined elements cut by material interfaces to determine sub-meshes and the use of ghost elements as a probe if they maintain their location or be snapped to the in- across partition boundaries. This consistency also fa- terface (100% sequential code reuse). Note that during cilitates merging duplicate nodes along the partition this process, each node is only allowed to relocate along boundaries when processors are later communicating one of the vertical/horizontal element edges connected with one other, which will be discussed in the following to that, meaning 4 or 6 directions for 2D quadrilateral section. or 3D tetrahedral background meshes, respectively. The It must be noted that the ability to independently benefit of using ghost layers in initial structured sub- execute the SAMR phase in each partition highly accel- meshes is realized at this stage, as we have access to the erates this process and leads to a considerable reduction interface morphology and the mesh structure to cor- in the overall mesh generation time. In the numerical rectly determine the relocation direction of each node examples presented in Section5, we show that paral- across actual boundaries of partitions (cf. yellow nodes lelizing this phase would in fact lead to a super-linear in Figure5c). In other words, all shared nodes along (more than ideal) speedup with respect to the num- physical interfaces between neighboring partitions ex- ber of partitions/processors. In order to better under- 10 Bowen Liang et al. stand the reason behind this significant speedup, which for the simple case scenario shown in Figure 1b, we must is much more considerable than other steps of paral- have access to element edges with hanging nodes gen- lel CISAMR (i.e., r-adaptivity, face-swap, and element erated after applying SAMR to the red particle when subdivision), it is worth taking a closer look at imple- interacting the blue particle with the background mesh. mentation details of the SAMR phase. This gives rise to a book keeping issue similar to that As the first step of the CISAMR algorithm, perform- discussed in the paragraph above when dealing with a ing SAMR requires a pre-processing phase during which massive microstructural model with thousands of par- one must determine relative locations of background ticles: repeated insertion/search in an excessively large mesh nodes and elements with respect to each material data container to identify elements that are already re- interface. As noted previously, we use NURBS paramet- fined or hold hanging nodes on their edges. By execut- ric functions to characterize the morphology of each ing the SAMR phase in each partition independently in particle. Therefore, determining the relative location the parallel framework, this insertion/search process is (inside, outside, or on the surface) of each node with re- limited to a considerably smaller container associated spect to its interface or identifying whether an element with each processor, resulting in a significant reduction edge intersects this interface requires solving a nonlin- in the computational cost. ear geometrical problem. For a massive problem such as a heterogenous microstructural model composed of thousands of particles, which would also require a huge 4.3 Communicating duplicate/hanging nodes initial background mesh for discretizing the domain, this process could be highly computationally demand- After performing r-adaptivity, we must merge the du- ing. Further, in order to avoid redoing this process, one plicate mesh nodes along shared partition interfaces, must store relative locations of nodes/elements in an which are shown in yellow in Figure5c. It is also possi- appropriate data structure (e.g., a red-black tree) to ble that a number of hanging nodes are generated only later access them in subsequent steps of the CISAMR on one side of the shared interface between two neigh- algorithm. Although we employ efficient techniques such boring partitions (cf. green nodes in Figure5c). Such as using a quadtree/octree search to quickly identify hanging nodes must be mapped to the shared edge/face background nodes/elements in the vicinity of a particle belonging to the neighboring partition to ensure the and discard those far away, the insertion/search within creation of a conforming mesh after the completion of this data container becomes computationally expensive the element subdividing phase. In 2D parallel CISAMR, as its size grows. Parallelization of the SAMR phase merging duplicate nodes and mapping one-sided hang- significantly reduces the cost of insertion/search pro- ing nodes along shared interfaces of neighboring parti- cess and the size of data containers, as a limited num- tions is the only communication required between pro- ber of particles are assigned to each partition and no cessors. In 3D, another marginal communication phase communication is needed with neighboring partitions. would be necessary after subdividing tetrahedral ele- An equally important aspect that slows down the ments cut by the interface, which will be discussed in SAMR phase when implemented sequentially is the need Section 4.4. to constantly check whether underlying elements are The rectangular/cuboid shape of each partition, to- previously refined by neighboring particles when inter- gether with the fact that a structured mesh is used as acting a new particle with the background mesh. In the starting point for constructing the conforming mesh order to facilitate the discussion, we refer back to the in each partitions, considerably reduces the computa- adaptively refined background mesh in the vicinity of tional cost associated with the communication phase. the two particles shown in Figure 1b. Assume that the Although this structured pattern is slightly disturbed red particle is first interacted with the mesh to perform after performing r-adaptivity, with the aid of ghost ele- SAMR and then the blue particle is added. In this sit- ments, duplicate nodes on the shared interface between uation, we must know whether background elements two partitions undergo identical relocations on either intersecting the latter and their neighboring elements side. Therefore, when visiting a node on a partition’s require SAMR or if they are already refined because of boundary, no search algorithm is needed to verify the the existing neighboring particle. Moreover, as noted in existence of a duplicate node on the mirroring face of Section2, one of the constraints of the SAMR phase is the adjacent partition or to classify it as a hanging node that no element edge can have more than one hanging that is only created on one side. node. In other words, an element must be further re- Instead, we first sort all nodes on actual boundaries fined if two hanging nodes are generated on one of its (and not ghost boundaries) of each partition based on edges by two particles that are in close proximity. Thus, their coordinates in the order of x, y, and z coordinate Scalable parallel implementation of CISAMR: A non-iterative mesh generation algorithm 11

Material interface values. Assume S1 and S2 refer to sorted node sets be- longing to adjacent faces shared between two partitions. Visiting each array of both node sets simultaneously, if S1(i) = S2(i) then visited nodes are duplicated and must be merged. Otherwise, we check whether S1(i) is identical to the next member of the second node set, Mesh i.e., whether S (i) = S (i + 1). If this is the case then Interface Conforming 1 2 node node S2(i) is a hanging node generated in only one of the sub-tetrahedrons partitions but does not belong to the neighboring par- tition; thus must be added to the corresponding edge. Fig. 6: One of the sub-tetrahedralization case scenar- Otherwise, S1(i + 1) = S2(i) indicates that S1(i) is a ios in 3D CISAMR for a regular tetrahedron, requiring hanging node that does not exist on the other side. the creation of new nodes at intersection points of the Note that this low-cost communication algorithm con- material interface with non-conforming element edges siderably reduces the overall parallel meshing time and is one of the main reasons for imposing the constraint of rectangular/cuboid shape for partitions, which com- final conforming mesh is already constructed. The situ- pensates for the lack of a perfect load balance between ation is completely different in 3D, as after performing processors. r-adaptivity and face-swap, on average 50% of origi- nally nonconforming background tetrahedrons still in- tersect with material interfaces. Sub-tetrahedralizing 4.4 Face swapping and subdividing such elements requires creating new nodes at intersec- tions of material interfaces with non-conforming edges As explained in Section2, a small subset of tetrahedral (cf. Figure6). Thus, a second communication phase elements subjected to r-adaptivity might be highly de- is required after subdividing elements in 3D parallel formed into cap or sliver shaped elements with high as- CISAMR, during which interface nodes across shared pect ratios in 3D CISAMR. A non-iterative face-swapping faces of neighboring partitions must be merged. The algorithm is applied to (originally cubic) blocks contain- communication process is similar to that explained in ing such elements, which replaces the initial 5 tetrahe- Section 4.3, meaning we first sort interface nodes based drons (one being highly deformed) with 7 new tetra- on their coordinates on each side of shared faces and hedrons, all with good aspect ratios. That fact that then use a one-to-one mapping to marge them. Note this phase is limited to each block containing cap/sliver that the computational cost associated with this sec- shaped tetrahedrons and one of its neighboring blocks ond communication phase is much smaller than that of obviates the need for passing information beyond the the major communication performed after r-adaptivity, ghost layer of elements, which in turn eliminates the as the number of nodes that must be merged is much need for any communication during this process. After smaller and we no longer need to identify hanging nodes completing the face-swap phase (only needed in 3D), that could only exist on one side. we discard ghost elements and proceed to subdividing the remaining elements that are either still intersect- ing material interfaces or have hanging nodes on their 4.5 PBC and cohesive elements edges. This process is rather straightforward and car- ried out by visiting such elements in each partition inde- Analyzing the micromechanical behavior of materials pendently (cf. Figure5d). Specific rules on how to cut often involves using the periodic boundary condition elements will be followed during the subdivision phase (PBC) to avoid unrealistic stress concentrations along to ensure that resulting conforming sub-elements have domain boundaries [37]. Also, approximating the failure lowest possible aspect ratios (see [34,35] for details). response of composite materials requires implementing In 2D CISAMR, all remaining non-conforming quadri- cohesive elements along the interface between embed- lateral elements of the background mesh after the com- ded inclusions and the surrounding matrix to simulate pletion of r-adaptivity are diagonally cut by material the debonding process [38]. Imposing PBC in CISAMR interfaces. Thus, although interior nodes might be gen- is a rather straightforward task, where as discussed in erated when using the double-diagonal sub-triangulation [35], applying similar levels of SAMR to blocks of el- rule in such elements (see [34] for details), no new node ements adjacent to the boundary for a periodic mi- is generated on their edges during the subdivision phase. crostructural model automatically leads to the construc- This means no further communication is needed be- tion of a periodic mesh. In parallel CISAMR, PBC can tween processors after completing this phase and the easily be imposed by establishing an extra set of com- 12 Bowen Liang et al.

Algorithm 2 (Parallel CISAMR algorithm for processor rank j) j function PARALLEL CISAMR(Ωj , neighborj , Sp, Sr, h) Mj ← background mesh(Ωj , h) . generate background mesh of size h with a ghost layer of elements for Ωj j for i := 1 to size(Sp) do . loop over all particles overlapping with Ωj (identified during partitioning phase) Ri ← Sr(Pi) . determine the number of h-adaptive refinement levels Ri corresponding to particle Pi i i Snode, Sel ← mesh interactor(Mj , Pi) . identify locations of background nodes/elements relative to Pi interface for k := 1 to Ri do i Mj ← SAMR(Mj , Sel, Pi) . Recursively apply SAMR to background elements near Pi interface i i Snode, Sel ← mesh updator(Mj , Pi) . update locations of nodes/elements relative to Pi interface after SAMR end for end for j for i := 1 to size(Sp) do i i Mj ← r adaptivity(Mj , Snode, Sel, Pi) . apply r-adaptivity to nodes of background elements intersecting Pi interface end for j Sbnode ← node sorter 1(Mj ) . sort all nodes on boundary of mesh Mj associated with processor rank j j Mj ← communicator 1(Sbnode, neighborj ) . merge/map boundary nodes with those of neighboring partitions if (3D CISAMR) then j for i := 1 to size(Sp) do i Mj ← face swap(Mj , Sel) . Perform face-swap to eliminate cap/sliver tetrahedrons with high aspect ratios end for end if Mj ← ghost eliminator(Mj ) . eliminate ghost layer of elements from mesh Mj j for i := 1 to size(Sp) do i i Mj ← subdivide(Mj , Sel, Snode) . Subdivide elements cut by material interfaces or have hanging nodes end for if (3D CISAMR) then j Snew bnode ← node sorter 2(Mj ) . sort new (interface) boundary nodes created during sub-tetrahedralization j Mj ← communicator 2(Snew bnode, neighborj ) . merge new boundary nodes with those of neighboring partitions end if end function

munications between partition faces located on paral- then to the inclusion phase. Afterwards, two different lel domain boundaries, during which similar equation nodes with the same coordinates are created and used numbers are assigned to periodic nodes. The process is to construct cohesive elements. The parallel CISAMR identical to merging mesh nodes and mapping hanging algorithm described in this section is summarized in the nodes across shared interfaces of adjacent partitions by pseudo code provided in Algorithm2. sorting them based on their coordinates.

Also, cohesive elements/nodes are simply inserted 5 Numerical examples along material interfaces after completing r-adaptivity and before starting the major (first) communication Four example problems are presented in this section to phase in parallel CISAMR. In order to insert such ele- demonstrate the performance and scalability of paral- ments in a conforming mesh generating using CISAMR, lel CISAMR for constructing massive 2D and 3D con- at each point, two nodes with the same coordinate must forming meshes for composite materials (in some cases, be generated along material interfaces, each correspond- > 108 elements). For each example, the scalability test ing to one of the material phases. The only special treat- is broken down into the speedup associated with dif- ment needed in the parallel implementation is when ferent phases of the CISAMR algorithm to show their merging such nodes (i.e., located on material inter- performance in the parallel implementation. In addi- faces) shared between two adjacent partitions during tion to the parallel construction of meshes, we present the communication phase. We can still implement the the FE approximation of the micromechanical behav- same sorting and communication algorithm to marge ior in each example problem, including linear elasticity, mesh nodes and map one-sided hanging nodes. How- thermo-elasticity, and continuum ductile damage simu- ever, we must consider the fact that two nodes co-exist lations. It must be noted that by being implemented in at the same coordinate along material interfaces, one an existing parallel FE code, same partitions used by of which belonging to the matrix phase and the other parallel CISAMR for mesh generation in this work can to an inclusion. After sorting these node pairs, we first be reused for approximating the field. This is a key communicate (merge) nodes belong to the matrix and advantage while performing a parallel simulation, as Scalable parallel implementation of CISAMR: A non-iterative mesh generation algorithm 13

Processor #33

Processor #31

Processor #32

50

Fig. 7: First example problem: 800 µm×800 µm virtually reconstructed RVE of an SiC particle reinforced aluminum composite, where insets show portions of the conforming mesh generated using parallel CISAMR the sequential partitioning is carried out only once for along each material interface, with some particles in the entire modeling process (meshing and FE analysis). close proximity subjected to additional refinement if in- Further, throughout this process, no input file is read tersecting the same background element. The resulting sequentially (particle morphologies are read in parallel conforming mesh has more than 21 million elements by designated processors) and initial structured sub- and 28 million DOFs, two small portions of which are meshes are generated for each partition independently depicted in the insets of Figure7. Note that different by its corresponding processor. colors for element edges in this figure correspond to dif- ferent partitions used for the mesh generation. The FE model generated using parallel CISAMR is 5.1 2D particulate composite used to approximate the linear elastic micromechanical response of the particulate composite RVE subjected to In this example, we use parallel CISAMR to generate a prescribed displacement of uy = 1 µm in the vertical conforming meshes for the representative volume ele- direction. The elastic modulus and Poisson’s ratio of the ment (RVE) of an SiC reinforced aluminum composite aluminum matrix are EAl = 72.4 GPa and νAl = 0.33, with more than 1.12 × 104 embedded particles. Sim- respectively, while those of embedded SiC particles are ilarly to all other examples presented hereafter, this ESiC = 415 GPa and νSiC = 0.16. The FE approxi- 800 µm×800 µm RVE (cf. Figure7) is virtually recon- mation of the von Mises stress field in this particulate structed using the BBox-based packing/optimization composite RVE is depicted in Figure8. This simulation algorithm briefly described in Section3 (refer to [36] is performed using the same 64 partitions used for the for details). Note that the non-uniform spatial arrange- parallel mesh generation (no repartitioning). As seen in ment of embedded SiC particles and variations in their the insets of this figure, the stress concentrations along size distribution leads to a non-uniform arrangement the particle-matrix interfaces are accurately captured of load-balanced partitions for the parallel mesh gen- by the adaptively refined conforming elements in the eration, which was previously shown in Figure4 for a mesh constructed using CISAMR. quadrant of this RVE. Figure9 presents scalability test results for this ex- In order to perform the scalability study, in addi- ample, broken down into different phases of the CISAMR tion to the sequential CISAMR (1 processor), parallel algorithm to better show the impact of parallelization meshes are constructed using 4, 16, 64, 256, and 1,000 on their performance. The speedup curves in these plots processors. A minimum of 2 levels of SAMR is applied yield the ratio of the computational time associated 14 Bowen Liang et al.

von-Mises stress (MPa) 11 404

Fig. 8: First example problem: FE approximation of the von Mises stress field in the particulate composite RVE shown in Figure7, where insets show sites of stress concentrations along particle-matrix interfaces with the sequential CISAMR (or its constituent phases) only 16 processors for the parallel mesh generation in to that of the parallel algorithm for various numbers this example, the huge computational cost associated of processors. Each curve is also compared with the with the sequential algorithm reduced to less than 6 ideal speedup curve, which is a linear curve with unity hours. slope representing very good scalability of any parallel The reason for the super-linear speedup of parallel algorithm. As shown in Figure 9a, parallel CISAMR CISAMR in this example can be explained by compar- yields a super-linear (more than optimal) speedup by ing the speedup of the SAMR phase versus that of re- increasing the number of processors used for mesh gen- maining phases of this algorithm (i.e., r-adaptivity, sub- eration in this example. In other word, the scalability triangulation, and communication). As shown in Figure of this algorithm in 2D is even better than the ideal 9b, while the speedup associated with the SAMR phase case scenario expected for a parallel algorithm. Thus, is super-linear with an slope of approximately 2, the re- for example, using 4 partitions to perform the simula- maining three phases still yield a nearly ideal speedup. tion leads to more than 4 times reduction in the over- The latter is not exactly a linear speedup, in part due all computational time. Also, note the second speedup to the communication cost, although the fact that the curve provided in Figure 9a, in which the computa- slope of this curve is almost unity indicates the neg- tional cost of the sequential mesh partitioning is added ligible time spent on the communication phase. Note to the parallel mesh generation time. This cumulative that another reason for a small deviation from a linear curve shows minimal difference compared to the iso- speedup is the presence of ghost elements in each parti- lated parallel CISAMR curve, indicating the negligible tion, which imposes additional computational cost per time needed for the domain partitioning compared to processor during the mesh generation process; a seem- the overall time spent on the parallel mesh generation. ingly disadvantageous feature at first glance that proves It is worth mentioning the entire time (including to be an overall net advantage by significantly reduc- partitioning) needed to generate the massive mesh shown ing the communication cost. Note that the impact of in Figure7 on 1,000 processors (Intel ® Xeon® x5650, the computational burden imposed by ghost elements Ohio Supercomputer Center) was only 18.9 seconds. On on reducing the speedup becomes slightly more pro- the other hand, using one processor on the same ma- nounced when using 1,000 partitions, as the ratio of chine to generate this mesh sequentially led to approx- total area spanned by the ghost layer to the actual area imately 30 days of computing time. Note that by using of each partition becomes sufficiently large. As shown in Scalable parallel implementation of CISAMR: A non-iterative mesh generation algorithm 15

106 computational cost of r-adaptivity, sub-triangulation, Ideal speedup and inter-processor communication is less than SAMR. 105 Parallel CISAMR This is due to using at least two levels of SAMR for CISAMR + partitioning 4 each particle in this example, which affects a large num- 10 18.9 seconds ber of background elements and thereby accrues a huge 103 computational cost that is significantly dropped in the eedup

S p parallel implementation. 102 > 30 days

1 10 < 6 hours 5.2 Corroded steel sheet

0 10 1 2 3 100 10 10 10 This second example problem demonstrates the appli- Number of processors cation of parallel CISAMR for constructing conform- (a) ing meshes for a St14 steel sheet with several micro- 106 scopic holes caused by severe localized corrosion. The Ideal speedup microstructure of this 800 µm × 800 µm corroded sheet 105 SAMR phase is illustrated in Figure 10, which consists of more than Remaining phases 4 104 1.2 × 10 corrosion pits. Note that some of the NURBS curves representing each pit are deliberately overlapped 103 with one another to mimic merged corrosion pits, re- eedup 0

S p sulting in C -continuous material interfaces (sharp edges) 102 in the macrostructural model. Figure 10 also shows the optimized arrangement of 80 partitions used for the 1 10 parallel mesh generation, together with portions of the conforming mesh generated using at least two levels of 100 1 2 3 100 10 10 10 SAMR for all material interfaces. This mesh, which con- Number of processors sists of approximately 24 million elements and more (b) than 31 million DOFs, also shows the ability of paral- lel CISAMR to non-iteratively handle sharp geometric Fig. 9: First example problem: (a) speedup of paral- features of material interfaces. lel CISAMR for meshing the 2D particulate composite The FE simulation of the ductile damage response RVE shown in Figure7, which also considers the impact of the corroded steel sheet subject to a tensile trac- of the sequential partitioning phase; (b) speedups of tion in the vertical direction applied on the top edge the SAMR phase and subsequent phases (r-adaptivity, is illustrated in Figure 11. Using the phenomenological sub-triangulation, and communication) of the parallel continuum ductile damage model presented in [39] for CISAMR algorithm this simulation, the scalar damage parameter Ω distin- guishes undamaged (0) and fully damaged (1) regions of Figure 9a, this effect visibly lowers the speedup rate of the material. Using the von Mises yield condition with the parallel CISAMR, although a super-linear speedup an isotropic hardening law, this model considers both is still maintained. the elasto-plastic behavior and damage response of the steel. Defining the yield surface f(σ) as As discussed in Section 4.2, the expected super- linear speedup of the SAMR phase is attributed to pl f(σ) = q − σY (¯ε ) = 0, (3) two key factors: (i) significant reduction in the size of eq data containers and therefore reduced computational pl q refers to the von Mises stress,ε ¯eq is to the equiva- cost associated with insertion/search operations while pl lent plastic strain, and σY (¯εeq) is the yield function. determining background nodes/elements locations with The microscopic stress tensor σm is then related to the respect to material interfaces; (ii) similar effect when damage parameter Ω as tracking previously refined elements and hanging nodes while performing h-adaptivity for each particle. A com- σm = (1 − Ω)σeff, (4) parison between Figures 9a and 9b shows that the over- all scalability curve of CISAMR in this example is sim- where σeff is the effective stress tensor. ilar to the scalability curve of the SAMR phase (super- The damage initiation criterion is characterized us- pl linear), demonstrating once again that the collective ing the strain rate ε¯˙eq and the stress triaxiality η, after 16 Bowen Liang et al.

Processor #30 Processor #32

Processor #31

100

Fig. 10: Second example problem: 800 µm × 800 µm RVE of a corroded steel sheet, together with the arrangement of 80 partitions used by parallel CISAMR to generate a conforming mesh. Portions of the resulting conforming mesh (generated using 40 partitions) are depicted in insets of the figure, showing the ability of parallel CISAMR to handle sharp features of material interfaces

to quantify the damage evolution as

1 pl pl u¯˙ = Lε¯˙eq. (5) In the equation above, L is a characteristic length pa- rameter that minimizes mesh dependency effects. Based on the experimental data provided in [40], the follow- 0 ing parameters are calibrated and used in the simula- tion shown in Figure 11: Est = 210 GPa, νst = 0.3, 0 pl pl ε¯eq = 0.33,u ¯ = 0.4 µm. The yield function σY (¯εeq) is given in Table1.

σY (MPa) 580 650 750 770 pl ε¯eq 0.0 1.5e-4 4.0e-3 1.01e-2

Table 1: Second example problem: calibrated yield func- tion used in the continuum ductile damage model to simulate the failure response of corroded steel sheet.

Fig. 11: Second example problem: FE simulation of the damage pattern at the failure point in the corroded Parallel CISAMR scalability test results for this ex- steel sheet, where the dashed yellow curve highlights ample, together with the impact of the sequential parti- the major crack leading to failure tioning on the speedup of this algorithm, is presented in Section 12a. Similar to the previous example problems, a super-linear speedup is achieved and the partition- ing effect on the overall computational cost is shown which an effective plastic displacementu ¯pl is employed to be minimal. Studying the speedups of each phase of Scalable parallel implementation of CISAMR: A non-iterative mesh generation algorithm 17

105 Ideal speedup Parallel CISAMR 104 CISAMR + partitioning 87.8 seconds 103 58.5 minutes eedup

S p 102 > 10 days

101

100 100 101 102 103 Number of processors z (a) y 106 x Ideal speedup 105 SAMR phase Fig. 13: Third example problem: virtually reconstructed Remaining phases 104 RVE of a cross-ply CFRP with more than 200 embed- ded carbon fibers

3

eedup 10 S p 102 5.3 3D carbon fiber reinforced polymer

101 In this example, we implement the 3D parallel CISAMR algorithm for modeling and simulating the microme- 100 chanical behavior of the virtually reconstructed RVE 100 101 102 103 Number of processors of a cross-ply carbon fiber reinforced polymer (CFRP). (b) As shown in Figure 13, the cubic macrostructural model (l = 100 µm) is composed of two plies with more than Fig. 12: Second example problem: (a) speedup of par- 200 embedded fibers. Portions of the conforming mesh allel CISAMR for meshing the 2D corroded steel sheet generated for this problem using parallel CISAMR are shown in Figure 10, which also considers the impact shown in Figure 14, where only one level of SAMR is ap- of the sequential partitioning phase; (b) speedups of plied along the interface of each fiber This mesh consists the SAMR phase and subsequent phases (r-adaptivity, of more than 34 million tetrahedral elements, which cor- sub-triangulation, and communication) of the parallel responds to > 21 million DOFs in the FE model. CISAMR algorithm The conforming mesh generated using CISAMR is utilized to simulate the thermo-mechanical response of the CFRP RVE subject to an increase of ∆u = 100 ◦C in the temperature. The material properties used for the epoxy matrix in this simulation are Em = 4.35 GPa, −6 ◦ νm = 0.36, and αm = 43.92 × 10 / C (coefficient of parallel CISAMR also reveals a similar trend as the ex- thermal expansion). For the transversely isotropic car- ample presented in Section 5.1: a super-linear speedup bon fibers [41], material properties are E1 = 233.1 GPa for the SAMR phase and a nearly linear speed up for (longitudinal fibers direction), E2 = 23.1 GPa (trans- r-adaptivity and sub-triangulation phases, while also verse fibers direction), G1 = 9.0 GPa, G2 = 8.3 GPa, −6 ◦ taking into account the communication cost. This ob- ν1 = 0.2, ν2 = 0.4, α1 = −0.54 × 10 / C, and −6 ◦ servation once again shows that by applying two levels α2 = 10.08×10 / C. All faces of the domain are sub- of refinement, the time spent on the SAMR phase dom- jected to free displacement boundary conditions, while inates the overall computational cost. It is worth men- six DOFs are constrained along the bottom face to elim- tioning that for the case of 1,000 partitions, the entire inate rigid body motions. The approximated thermal- modeling process (partitioning and parallel mesh gen- induced displacement field, as well as the distribution of eration) in this example was finished in 87.8 seconds. the Von Mises stress in the fiber and matrix phases are 18 Bowen Liang et al.

speedup, although 3D parallel CISAMR still shows an ideal scalability (in fact, slightly better than a linear speedup). The reason for this behavior can be under- stood by comparing the speedups of the SAMR phase with remaining four phases of this algorithm, i.e., r- adaptivity, face-swap, sub-tetrahedralization, and com- munication phases (major and minor). As shown in Fig- ure 16b, similar to the 2D examples, the SAMR phase still yields a super-linear speedup and the remaining phases also yield a nearly linear speedup. The key differ- ence is the considerably lower slope of the super-linear speedup of SAMR in this 3D example (1.21 vs. ≈ 2 in 2D examples), which has two main reasons: ◦ Reason 1: In the current example, we have only used one level of SAMR along material interfaces ver- (a) sus two levels in previous examples. This requires a much smaller data container to keep track of refined el- ements and hanging nodes throughout the h-adaptivity process. Thus, the speedup achieved by parallelization is less pronounced due to the lower cost of the inser- tion/search operations in this data container for the sequential case, which is used as the reference point. ◦ Reason 2: Compared to 2D CISAMR, the ratio of the computational cost associated with the numerical calculations for locating material interfaces within the background mesh and subdividing tetrahedrons to that of the insertion/search in corresponding data containers during SAMR is much higher in 3D. Interacting a 3D NURBS surface characterizing the morphology of a ma- terial interface with the background mesh requires solv- (b) ing more computationally demanding nonlinear prob- lems to determine relative locations of nodes/elements Fig. 14: Third example problem: portions of the con- with respect to that, compared to similar operations forming mesh generated using parallel CISAMR cor- in 2D. Further, depending on being an orthocentric or responding to inboxes with similar colors in the RVE a regular tetrahedron, a background element is subdi- shown in Figure 13 vided into 6 to 8 sub-tetrahedrons during the SAMR process, respectively. Compared to subdividing a rect- angular background element into 4 sub-elements in 2D illustrated in Figure 15. Note that the high-quality con- SAMR, a larger portion of the overall computational forming mesh generated using CISAMR can properly cost of the SAMR phase must be dedicated to this pro- capture sites of stress concentration within the CFRP cess in 3D. Since the super-linear speedup observed in microstructure (stress contours are presented without the SAMR phase is primary a result of significant time any smoothing). reduction in insertion/search operations, the reduction Variations of the parallel CISAMR speedup versus in the ratio of the computational cost of data handling the number of processors for this example, with and to that of the entire SAMR process in 3D leads to a without taking into account the partitioning cost, are lower speedup, although it is still super-linear. depicted in Figure 16a. Note that the overall compu- Given the lower super-linear speedup rate of the tational time associated with the latter on 512 proces- SAMR phase in this 3D example, when combined with sors was less than 13 minutes, while one processor re- the nearly linear speedup of subsequent phases, parallel quired more than 4 days to generate the same mesh se- CISAMR yields an ideal but not super-linear scalability quentially. The key difference between speedup curves (cf. Figure 16a). This is in contrary to the previous two in this case and those of the 2D examples presented examples, where the cumulative speedup was closer to previously is that we no longer achieve a super-linear that of the SAMR phase (superlinear). This behavior Scalable parallel implementation of CISAMR: A non-iterative mesh generation algorithm 19

0.43 2.61

0.0 0.03 ( ) (MPa) z z

x y x y

(a) (b)

2.01

0.03 (MPa) z

x y

(c) (d)

Fig. 15: Third example problem: (a) displacement field and (b–d) von Mises stress fields in the fibers and matrix phases subject to a uniform temperature jump of ∆u = 100 ◦C. in the current example is not only attributed to the re- the length of l = 800 µm is composed of an epoxy ma- duced speedup rate of SAMR (Reason 2) but also to the trix (Em = 3.9 GPa, νm = 0.39) and 1,418 embedded fact that by applying only one SAMR level, the corre- silica particles (ESiC = 71.7 GPa, νSiC = 0.17). Fig- sponding computational cost is a much smaller portion ure 17 also illustrates the partitioned domain for the of the total time spent on the parallel mesh generation parallel mesh generation using 40 processors. Note that process (Reason 1). in order to minimize the difference between estimated load measures, resulting partitions have larger sizes in regions with lower particle densities. 5.4 3D particulate composite Portions of the massive conforming mesh generated using parallel CISAMR for discretizing the particulate In this final example problem, we demonstrate the ap- composite RVE shown in Figure 17 are illustrated in plication of parallel CISAMR for meshing the 3D par- Figure 18. Using a fine background mesh and applying ticulate composite RVE shown in Figure 17. This vir- at least two levels of SAMR along particle-matrix inter- tually reconstructed cubic microstructural model with faces, this mesh consists of approximately 172 million 20 Bowen Liang et al.

3 10 800 Ideal speedup 800 Parallel CISAMR

CISAMR + partitioning 102 12.8 minutes eedup 99.7 minutes S p 101

106 hours 800 12.6 hours

100 1 2 3 100 10 10 10 Number of processors (a) 104 Ideal speedup z SAMR phase x y 103 Remaining phases

Fig. 17: Fourth example problem: virtually recon- 2

eedup 10 structed 3D particulate composite (epoxy matrix, silica

S p particles) RVE, together with the arrangement of op- timized partitions for the parallel CISAMR simulation 101 using 40 processors

100 1 2 3 the negligible impact of the sequential partitioning phase 100 10 10 10 Number of processors on the overall computational cost of the mesh genera- (b) tion using parallel CISAMR. However, unlike the 3D example presented in Section 5.3, which demonstrated Fig. 16: Third example problem: (a) speedup of par- a linear speedup, here once again we achieve a super- allel CISAMR for meshing the cross-ply CFRP RVE linear speedup for parallel CISAMR. The reasons for shown in Figure 13, which also considers the impact of this behavior are similar to those provided for that ex- the sequential partitioning phase; (b) speedups of the ample to justify its linear scalability. By using two lev- SAMR phase and subsequent phases (r-adaptivity, face- els of refinement along particle-matrix interfaces, in the swapping, sub-tetrahedralization, and communication) current example a larger portion of the computational of the parallel CISAMR algorithm cost is dedicated to the SAMR phase and therefore to the computationally demanding insertion/search pro- cess in corresponding data containers, which more sig- tetrahedral elements and 98 million DOFs. The mesh is nificantly affect the overall scalability curve. Therefore, then used to simulate the linear elastic micromechan- as confirmed by Figure 20b, the super-linear speedup ical behavior of this composite material subject to a of the SAMR phase has a larger slope compared to macroscopic displacement jump [[uM]] = 0.02 µm in the that of the previous example, indicating a better scal- z-direction along the top boundary. The resulting stress ability for this phase. However, as discussed in Section fields in the direction of z-axis (σzz) in the matrix and 5.3, due to the more computationally intensive pro- particles are shown in Figure 19 for a simulation per- cesses associated with the mesh-interface interaction formed using 64 partitions that were already utilized for and sub-tetrahedralizition processes during 3D SAMR, the parallel mesh generation. Note how the adaptively this slope is lower than those of 2D examples. refined conforming elements generated using CISAMR Adding the computational cost of SAMR to that of along silica particle interfaces, some of which have very subsequent phases of parallel CISAMR (including the high curvatures, enable capturing sites of stress concen- inter-processor communication), we can still achieve a trations in this complex microstructural model. super-linear speedup for constructing the massive con- Similar to what was observed previously, speedup forming mesh in this example. Note that given the ex- curves presented in Figure 20a for this example shows ceedingly large size of the problem and the memory Scalable parallel implementation of CISAMR: A non-iterative mesh generation algorithm 21

est feasible number of processors to build this mesh. It is worth noting that using 512 processors led to a computational time of 18.5 minutes for the entire mesh generation process in this example.

6 Conclusion

The parallel implementation of CISAMR mesh genera- tion algorithm was presented for 2D and 3D problems, which enables nearly 100% code reuse from its sequen- tial non-iterative algorithm. As a pre-processing phase, a computationally inexpensive partitioning domain al- z gorithm was introduced, which is specifically designed for modeling heterogeneous material microstructures. y x After over-decomposing the domain, a set of hierar- (a) chical bounding box representations of inclusion mor- phologies, together with the number of refinement lev- els along each interface, are employed to determine the load measure in each sub-region. A multi-objective op- timization problem is then solved to determine the final arrangement of partitions by balancing their load mea- sures under the constraint of rectangular/cuboid shapes for them. A structured sub-mesh is then independently con- structed by the processor assigned to each partition, that includes a ghost layer of elements to minimize the communication cost. The parallel mesh generation proceeds by performing the SAMR and r-adaptivity phases of CISAMR in each partition independently. Be- fore moving to the final phase(s) of CISAMR (sub- triangulation in 2D and a combination of face-swap and sub-tetrahedralization in 3D), a computationally inex- pensive communication phase is carried out to merge/map z mesh nodes and hanging nodes along shared interfaces x y of neighboring partitions. A second marginal commu- nication phase is required in 3D CISAMR to merge (b) new nodes generated after the sub-tetrahedralization Fig. 18: Fourth example problem: (a) small portion of of elements. Several example problems were provided the conforming mesh generated using parallel CISAMR to demonstrate the application of parallel CISAMR for for the 3D particulate composite microstructure shown generating massive conforming meshes for various ma- in Figure 17; (b) conforming elements for discretizing terials systems with complex microstructures, together silica particles corresponding to the inbox in figure (a) with the FE simulation of their micromechanical behav- ior. The scalability tests conducted for these examples showed that, at worst, parallel CISAMR yields a linear limitation, it was not feasible to generate the mesh us- speedup (ideal scalability). However, for higher SAMR ing CISAMR sequentially. Thus, the reference point for levels, we can even achieve a super-linear speedup (won- evaluating speedup values in Figure 20 is the parallel derful scalability) for this non-iterative mesh generation simulation carried out on 8 processors, i.e., the small- algorithm. 22 Bowen Liang et al.

77.9

8.0 zz (MPa) z z

x y x y

(a) (b)

38.0

8.3 zz (MPa)

z z

x y x y

(c) (d)

Fig. 19: Fourth example problem: FE approximation of the stress field in the z-direction, σzz in different phases/regions of the 3D particulate composite RVE shown in Figure 17 when the microstructure is subject to a displacement jump of [[uM]] = 0.02 µm. The subfigures show σzz in (a) silica particles; (b) three slices of the domain; (c) portion of the matrix phase; (d) smaller portion of the matrix corresponding to the inbox in figure c

Acknowledgement References

1. R. Espinha, K. Park, G.H. Paulino, and W. Celes. Scal- able parallel dynamic fracture simulation using an extrin- sic cohesive zone model. Computer Methods in Applied Mechanics and Engineering, 266:144–161, 2013. This work has been supported by the Air Force Office 2. T. Tu, H. Yu, L. Ramirez-Guzman, J. Bielak, O. Ghat- of Scientific Research (AFOSR) under grant number tas, K.L. Ma, and D.R. O’hallaron. From mesh gener- FA9550-17-1-0350. The authors also acknowledge par- ation to scientific visualization: An end-to-end approach tial support from the Ohio State University Simulation to parallel supercomputing. In Proceedings of the 2006 ACM/IEEE conference on Supercomputing, page 91. Innovation and Modeling Center (SIMCenter), as well ACM, 2006. as the allocation of computing time from the Ohio Su- 3. George Karypis and Vipin Kumar. Metis–unstructured percomputer Center (OSC). graph partitioning and sparse matrix ordering system, Scalable parallel implementation of CISAMR: A non-iterative mesh generation algorithm 23

104 18.5 minutes teenth Annual ACM Symposium on Parallel Algorithms Ideal speedup and Architectures, pages 339–347. ACM, 2007. Parallel CISAMR 8. M.C. Rivara, C. Calderon, A. Fedorov, and N. Chriso- 3 10 CISAMR + partitioning choides. Parallel decoupled terminal-edge bisection method for 3D mesh generation. Engineering with Com- puters, 22(2):111–119, 2006. 9. M.S. Shephard, J.E. Flaherty, H.L. de Cougny, C. Oztu- 102 eedup ran, C.L. Bottasso, and M.W. Beall. Parallel automated

S p 43.1 hours adaptive procedures for unstructured meshes. Parallel 191 minutes Computing in CFD, 807:6–1, 1995. 101 10. N. Chrisochoides. Parallel mesh generation. In Numeri- cal Solution of Partial Differential Equations on Parallel Computers, pages 237–264. Springer, 2006. 100 11. R. Lohner and J.R. Cebral. Parallel advancing front grid 100 101 102 103 generation. In International Meshing Roundtable, Sandia Number of processors National Labs. Citeseer, 1999. (a) 12. M. Saxena and R. Perucchio. Parallel FEM algorithms 4 based on recursive spatial decomposition I. Automatic 10 mesh generation. Computers & Structures, 45(5-6):817– Ideal speedup 831, 1992. SAMR phase 3 13. G. Karypis and V. Kumar. Multilevel k-way partition- 10 Remaining phases ing scheme for irregular graphs. Journal of Parallel and Distributed computing, 48(1):96–129, 1998. 14. B. Hendrickson and T.G. Kolda. Graph partitioning 102 models for parallel computing. Parallel Computing, eedup 26(12):1519–1534, 2000. S p 15. K. Andreev and H. Racke. Balanced graph partitioning. Theory of Computing Systems, 39(6):929–939, 2006. 101 16. B. Hendrickson and R. Leland. The Chaco users guide. Version 1.0. Technical report, Sandia National Labs., Albuquerque, NM, 1993. 0 10 17. Y.A. Teng, F. Sullivan, I. Beichl, and E. Puppo. A data- 100 101 102 103 parallel algorithm for three-dimensional Delaunay trian- Number of processors gulation and its implementation. In Proceedings of the (b) 1993 ACM/IEEE conference on Supercomputing, pages 112–121. ACM, 1993. Fig. 20: Fourth example problem: (a) speedup of paral- 18. J. Galtier and P.L. George. Prepartitioning as a way to lel CISAMR for meshing the particulate composite RVE mesh subdomains in parallel. In 5th International Mesh- shown in Figure 17, which also considers the impact of ing Roundtable. Citeseer, 1996. 19. R. L¨ohner,J. Camberos, and M. Merriam. Parallel un- the sequential partitioning phase; (b) speedups of the structured grid generation. Computer Methods in Ap- SAMR phase and subsequent phases (r-adaptivity, face- plied Mechanics and Engineering, 95(3):343–357, 1992. swapping, sub-tetrahedralization, and communication) 20. R. L¨ohner. A parallel advancing front grid generation of the parallel CISAMR algorithm scheme. International Journal for Numerical Methods in Engineering, 51(6):663–678, 2001. 21. S.N. Muthukrishnan, P.S. Shiakolas, R.V. Nambiar, and K.L. Lawrence. Simple algorithm for adaptive refinement version 2.0. 1995. of three-dimensional finite element tetrahedral meshes. 4. S. Balay, S. Abhyankar, M.F. Adams, J. Brown, P. Brune, AIAA Journal, 33(5):928–932, 1995. K. Buschelman, L. Dalcin, A. Dener, V. Eijkhout, W.D. 22. L.P. Chew. Guaranteed-quality triangular meshes. Tech- Gropp, D. Kaushik, M.G. Knepley, D.A. May, L.C. nical report, Cornell University, 1989. McInnes, R.T. Mills, T. Munson, K. Rupp, P. Sanan, 23. J.R. Shewchuk. Delaunay refinement algorithms for B.F. Smith, S. Zampini, and H. Zhang. PETSc Web triangular mesh generation. , page, 2018. 22(1-3):21–74, 2002. 5. Y. Ito, A.M. Shih, A.K. Erukala, B.K. Soni, 24. M.B. Chen, T.R. Chuang, and J.J. Wu. Efficient paral- A. Chernikov, N.P. Chrisochoides, and K. Naka- lel implementations of near Delaunay triangulation with hashi. Parallel unstructured mesh generation by an High Performance Fortran. Concurrency and Computa- advancing front method. and Computers tion: Practice and Experience, 16(12):1143–1159, 2004. in Simulation, 75(5-6):200–209, 2007. 25. G.E. Blelloch, G.L. Miller, and D. Talmor. Develop- 6. T. Tu, D.R. O’hallaron, and O. Ghattas. Scalable parallel ing a practical projection-based parallel Delaunay algo- octree meshing for terascale applications. In Proceedings rithm. In Proceedings of the Twelfth Annual Symposium of the 2005 ACM/IEEE conference on Supercomputing, on Computational Geometry, pages 186–195. ACM, 1996. page 4. IEEE Computer Society, 2005. 26. S.H. Lo. A new mesh generation scheme for arbitrary pla- 7. B. Hudson, G.L. Miller, and T. Phillips. Sparse parallel nar domains. International Journal for Numerical Meth- delaunay mesh refinement. In Proceedings of the Nine- ods in Engineering, 21(8):1403–1426, 1985. 24 Bowen Liang et al.

27. M.C. Rivara. Algorithms for refining triangular grids suitable for adaptive and multigrid techniques. Inter- national Journal for Numerical Methods in Engineering, 20(4):745–756, 1984. 28. M.C. Rivara, N. Hitschfeld, and B. Simpson. Terminal- edges Delaunay (small-angle based) algorithm for the quality triangulation problem. Computer-Aided Design, 33(3):263–277, 2001. 29. M.T. Jones and P.E. Plassmann. Parallel algorithms for the adaptive refinement and partitioning of unstructured meshes. In Scalable High-Performance Computing Con- ference, 1994., Proceedings of the, pages 478–485. IEEE, 1994. 30. H.L. De Cougny and M.S. Shephard. Parallel refine- ment and coarsening of tetrahedral meshes. Interna- tional Journal for Numerical Methods in Engineering, 46(7):1101–1125, 1999. 31. T. Coupez, H. Digonnet, and R. Ducloux. Parallel mesh- ing and remeshing. Applied Mathematical Modelling, 25(2):153–175, 2000. 32. P.L. George. Tet meshing: construction, optimization and adaptation. In 8th International Meshing Roundtable, pages 133–141. Citeseer, 1999. 33. D.A. Field. Laplacian smoothing and Delaunay triangu- lations. Communications in Applied Numerical Methods, 4(6):709–712, 1988. 34. S. Soghrati, A. Nagarajan, and B. Liang. Conforming to interface structured adaptive mesh refinement: new technique for the automated modeling of materials with complex microstructures. Finite Elements in Analysis and Design, 125:24–40, 2017. 35. Anand Nagarajan and S. Soghrati. Conforming to inter- face structured adaptive mesh refinement: 3D algorithm and implementation. Computational Mechanics, pages 1–26, 2018. 36. M. Yang, A. Nagarajan, B. Liang, and S. Soghrati. New algorithms for virtual reconstruction of heterogeneous microstructures. Computer Methods in Applied Mechan- ics and Engineering, 338:275–298, 2018. 37. V.P. Nguyen, M. Stroeven, and L.J. Sluys. Multiscale continuous and discontinuous modeling of heterogeneous materials: a review on recent developments. Journal of Multiscale Modelling, 3(04):229–270, 2011. 38. K. Park and G.H. Paulino. Cohesive zone models: A criti- cal review of traction-separation relationships across frac- ture surfaces. Applied Mechanics Reviews, 64(6):060802, 2011. 39. H. Hooputra, H. Gese, H. Dell, and H. Werner. A com- prehensive failure model for crashworthiness simulation of aluminium extrusions. International Journal of Crash- worthiness, 9(5):449–464, 2004. 40. Vuong N. Van D. The behavior of ductile damage model on steel structure failure. Procedia engineering, 142:26– 33, 2016. 41. Z.H. Karadeniz and D. Kumlutas. A numerical study on the coefficients of thermal expansion of fiber reinforced composite materials. Composite Structures, 78(1):1–10, 2007.