Parallel Tetrahedral

Célestin Marot

October 2, 2020

Thesis submitted in partial fulfillment of the requirements for the Ph.DDegree in Engineering Sciences

Thesis Committee Advisor Jean-François Remacle UCLouvain Jury Christophe Geuzaine ULiège Philippe Chatelain UCLouvain Jonathan Lambrechts UCLouvain Hang Si WIAS Jeanne Pellerin Total Chairperson Thomas Pardoen UCLouvain

Institute of Mechanics, Materials and Civil Engineering

i

Remerciements

Je tiens tout d’abord à remercier Jean-François Remacle, pour la confiance qu’il m’a accordé durant ces quatre années. La plupart des professeurs n’ auraient pas parié sur moi: je n’avais vraiment d’excellentes notes que pour les projets qui consistaient à créer des programmes informatiques. Jean-François a reconnu cette qualité de développeur, sans porter d’importance à ma moyenne somme toute moyenne. Il m’a alors proposé de faire un doctorat sous son aile. L’oiseau en question est un bon vivant, ayant une passion pour le rock et qui, même s’il n’a jamais l’air vraiment sérieux, arrive à superviser efficacement une douzaine de chercheurs. Il faut dire que l’ERC (pour European Research Council) advanced grant que celui-ci venait de dégoter lui a permis d’avoir un nid de cigognes remplissant quasiment l’Euler1. Je remercie J-F pour tous ces moments, toutes ces conférences, toutes ces découvertes, tout ce savoir dont j’ai pu profiter, et pour son enthousiasme et son intérêt envers ma recherche.

Je remercie grandement Pierre-Alexandre, Matthieu et David. Commençons d’abord par les deux énergumènes du bureau d’en face: P-A et Matthieu. P-A est un garçon sociable, qui est toujours accueillant et aime bavarder. Je le remercie d’avoir écouté si attentivement quand je décrivais les problèmes de maillage que je rencontrais durant ce doctorat. Ce dernier partage son bureau avec Matthieu, entre qui ils peuvent parler une bonne demi-journée pendant que j’écoute dis- crètement depuis mon bureau de l’autre coté du couloir. Puisqu’on laisse nos portes ouvertes, je peux fixer Matthieu des yeux et lui asséner un regard interro- gateur dès que j’entends que mon prénom intervient dans leur conversation. Je suis parfois obligé de passer le pas de leur porte pour pouvoir asséner le même regard interrogateur à P-A. Pour entretenir la bonne humeur de mes deux com- pères, je ne manquerai pas de lâcher une bonne blague bien graveleuse. Même s’il ne daignera pas le concéder, Matthieu est sans doute celui qui rigole leplus à mes blagues, après quoi il me reproche souvent leur caractère déplacé. Je le remercie donc pour sa relative hilarité qui me permet de garder une estime hy- pertrophiée de mon humour. P-A n’est pas en reste pour ce qui est du rire, le sien peut s’entendre à des kilomètres. Avec un peu de chance, ce son d’alarme rameute David, l’irréductible chercheur du bureau d’à côté. Ne vous méprenez pas, même si David est déjà marié, et même s’il vous dit que “de son temps, les chercheurs mettaient une tarte pour leur arrivée, leur départ et leur anniver- saire” (ce que j’approuve totalement), David n’a commencé sa thèse que deux ans avant moi. Son visage imberbe confirme qu’il n’est pas si vieux que ça.Si je me permets d’être dur avec ces braves gaillards, c’est parce qu’ils savent que

1notre labo de recherche, nommé d’après le célèbre mathématicien Leonhard ii je suis rarement sérieux, et que ce sont des réels amis avec qui je ne me suis jamais pris la tête.

Je tiens aussi à remercier mes collègues de bureau: Ange et sa bonne humeur communicative, Ruiyang et son caractère mystérieux qui me fait sourire. Je remercie également Alexandre et Arthur, avec qui j’ai partagé des très bons moments lors des conférences à Stuttgart et Buffalo. Ensuite, viennent mes autres (anciens) collègues: Kilian, Valentin, Philippe, Jovana, Maxence, Chris- tos, Nathan, Ruili, Alexis, Amaury. Je vous remercie tous pour les parties de whist, de rikiki et de belote.

Bien sûr, je remercie les membres du jury pour leur attention, leurs commen- taires et leurs questions qui ont toujours été très pertinentes et justes. Pour les citer: Christophe Geuzaine, Jonathan Lambrechts, Jeanne Pellerin, Philippe Chatelain, Hang Si, mon promoteur Jean-François Remacle et le président du jury Thomas Pardoen. Je remercie plus particulièrement Jeanne, avec quij’ai écrit mon premier vrai papier. Je souhaite aussi remercier Christophe pour son temps passé sur et pour m’avoir aidé concernant l’intégration de mes travaux dans son logiciel.

Je remercie également mes potos, Martin et Baptiste, toujours présents pour discuter et s’amuser même si on n’entretiens aucun contact pendant plusieurs mois. On peut toujours s’inviter sur un coup de tête et passer, comme toujours, de bons moments. Mention bonus pour Lucas et David, qui feraient bien de montrer leur tête après cette pandémie, et Mr Mond de l’autre coté du Mond.

Enfin, je remercie mes parents pour m’avoir élevé et accompagné toutau long de ma vie et ma sœurette pour son affection. Je remercie aussi mes deux filleuls, Briac et Eliot ainsi que mon neveu Renan pour être aussi mignons et attachants. Je remercie aussi ma belle famille qui m’accueille toujours extrême- ment chaleureusement. Finalement, je remercie la meilleure de toute, mon amour, Julie, qui m’a toujours soutenu, partage ma vie et me supporte tous les jours. Ton sourire vaut mille doctorats pour moi. Cette thèse ferait 20 pages en plus si je devais citer tout ce pourquoi je dois te remercier, je vais donc con- denser ça en trois mots: je t’aime.

Célestin Marot Contents

Contents iii

List of publications vii

1 Introduction 1 1.1 The ...... 2 1.2 The different types of mesh ...... 3 1.3 Delaunay-based tetrahedral mesh generation ...... 6 1.4 The need for speed ...... 12 1.5 Previous work ...... 13

2 21 2.1 Row-major and column-major order ...... 21 2.2 Locality of a curve ...... 23 2.3 Mapping floating-points to k-bit integers ...... 24 2.4 Bending space ...... 27 2.5 Z-order curve ...... 29 2.6 2D Hilbert curve ...... 31 2.7 3D Hilbert curve ...... 41 2.8 Other Implementation of Hilbert ordering ...... 44 2.9 SFC partitioning ...... 46

3 Delaunay Triangulation 51 3.1 Bowyer-Watson algorithm ...... 52 3.2 Algorithm overview ...... 53 3.3 Mesh data structure ...... 56 3.4 Fast spatial sorting ...... 60 3.5 Improving Cavity: spending less time in geometric predicates 62 3.6 Improving DelaunayBall: spending less time computing ad- jacencies ...... 65 3.7 About a ghost ...... 68

iii iv CONTENTS

3.8 Serial implementation performances ...... 69

4 Parallel Delaunay 73 4.1 Related work ...... 73 4.2 A parallel strategy based on partitions ...... 74 4.3 Partitioning and re-partitioning with Moore curve ...... 76 4.4 Ensuring termination ...... 77 4.5 Data structures ...... 79 4.6 Critical operations ...... 80 4.7 Parallel implementation performances ...... 81 4.8 Addendum: research journey ...... 85

5 Mesh refinement 89 5.1 Small and medium size test cases on standard laptops ...... 91 5.2 Large mesh generation on many core machine ...... 92

6 Mesh Improvement 99 6.1 Mesh improvement operations ...... 101 6.2 Small Polyhedron Reconnection ...... 105 6.3 Growing SPR cavity ...... 112 6.4 Mesh improvement schedule ...... 114

7 Mesh generator’s performance 119 7.1 Benchmarks ...... 123

8 Conclusion 129

Bibliography 131 Preface

This manuscript mostly targets researchers and developers—in fields related to computational geometry and more particularly mesh generation—seeking for good ideas in order to accelerate their programs. Therefore, this manuscript does not intend to teach all the little intricacies of modern computers. It is as- sumed that the reader already knows the meaning of cache levels, RAM, regis- ters, SIMD, threads etc. and is familiar with the C language and its compilation process. However, most of this manuscript can still be understood with a good understanding of common data-structure and algorithms. The math level re- quired is kept pretty low. Simple images and code examples are preferred over heavy mathematical notations. This manuscript is mainly based on previously published work, or worktobe published. More specifically, Chapter 3, Chapter 4 and Chapter 5, on Delaunay, parallel Delaunay and mesh refinement are, for the most part, a transcription of our paper entitled “One machine, one minute, three billion tetrahedra”[2][81]. The content of Chapter 6 and Chapter 7 is mainly taken from two papers:

• Reviving the Search for Optimal Tetrahedralizations[4][83]

• Quality tetrahedral mesh generation with HXT[3][82]. The topic tackled in Chapter 2 is the efficient generation of Hilbert curve based ordering. It is placed just after the introduction because the Hilbert curve is a cornerstone of this thesis, on which several building blocks will be constructed in the subsequent chapters. Looking back, implementing the 3D Hilbert curve efficiently has been one of the most wonderful programming ex- perience one can have. We hope to transmit our love for this beautiful beast.

v

List of publications

[1] Célestin Marot, Jeanne Pellerin, Jonathan Lambrechts, and Jean-François Remacle. Toward one billion tetrahedra per minute. Procedia Engineering, page 5, 2017.

[2] Célestin Marot, Jeanne Pellerin, and Jean-François Remacle. One machine, one minute, three billion tetrahedra. International Journal for Numerical Methods in Engineering, 117(9):967–990, 2019.

[3] Célestin Marot and Jean-François Remacle. Quality tetrahedral mesh gen- eration with HXT, aug 2020.

[4] Célestin Marot, Kilian Verhetsel, and Jean-François Remacle. Reviving the Search for Optimal Tetrahedralizations. In Proceedings of the 28th Interna- tional Meshing Roundtable, Buffalo, New York, USA, February 2020. Zenodo.

[5] Jean-François Remacle and Célestin Marot. Multi-threaded mesh genera- tion. In 13e colloque national en calcul des structures, Giens, Var, France, May 2017. Université Paris-Saclay.

vii

Chapter 1

Introduction

For more and more companies, working in increasingly different fields, nu- merical simulation is now an essential tool which replaces costly or high-risk or simply impossible (a climate simulator, for example) real-life experiments. Millions of simulations of molecular dynamics and chemical reactions, of fluid dynamics, of multibody systems, of material resistance, deformation and rup- ture, of manufacturing systems, of cosmology and much more are carried out every day throughout the world. If a company wants to create a mechanical part through a casting process, chances are that the design of the object has been assisted by a computer-aided design (CAD) system. The resistance of the modeled object can be tested numerically. The molding process can also be simulated, which allows the optimum position of the necessary air vents to be obtained to avoid air bubbles inside the mold[121, 129]. The entire temperature field in the cast material can also be obtained, for the casting and thecooling phases[18]. The temperature gradient in the material during its cooling pro- cess can influence its strength[120], so the insight gained with simulation can be crucial. However, computer simulations are not used to their true value in foundries, mainly due to the lack of qualified engineers[117, 96]. On the other hand, the automotive and aeronautics industries make extensive use of numer- ical simulation, which has proven its effectiveness in these sectors for many years.[119]. Engineers optimize a model through repeated testing and modifi- cations. With recent advances in simulation and manufacturing technologies, this design process cycle can be fully digitally automated, including automated model modifications. This technique that allows numerical simulations todic- tate the design of a product is called topology optimization. Topology optimiza- tion is still in its infancy and is still mostly used only at the early design stage to help discover new design possibilities. However, it will certainly be an integral part of everyday life for most design offices once the technology has matured.

1 2 CHAPTER 1. INTRODUCTION

1.1 The finite element method

Most continuous physical phenomena behave in ways that can be expressed with partial differential equations (PDEs). PDEs describe how the variations of variables, including space and time, are linked together. For example, one of the simplest physical partial differential equation is the heat equation inan isotropic and homogeneous medium, without radiation or any heat source: ∂T = α∇2T ∂t   ∂2T ∂2T ∂2T = α + + ∂x2 ∂y2 ∂z2 This equation links the temporal variation of temperature at any point with the spatial variation of temperature around that point. If a domain, boundary conditions and initial conditions are specified, it is possible to simulate, at least approximately, the evolution of each variable through time, in all regions of the domain. The most popular method for simulating PDEs is the finite ele- ment method. The finite element method (FEM) subdivides the domain into smaller parts called finite elements. In simplified terms, FEM creates a local ap- proximation of the PDE on elementary shapes—triangles or quadrilaterals in 2D, tetrahedra, pyramids, prisms or hexahedra (topological ) in 3D—and stitches those smaller problems together into a global system of equations that covers the whole domain. FEM is a versatile tool that can be applied to a variety of different problem settings. For example, Figure 1.1 shows the result of a sim- ulation of a car collision with the finite element method. Studying or analyzing a phenomenon with FEM is often referred to as finite element analysis (FEA). In contrast with the domain in continuous geometric space, the set of all finite ele- ments forms a numerical domain with a finite number of points. This particular space discretization that FEM uses is called a mesh. One of the great advantage of the finite element method over some simpler methods, like the finite dif- ference method (FDM), is its ability, thanks to the mesh, to describe complex boundaries of the domain accurately, without any additional change to the fi- nite element formulation. Therefore, a robust PDE solver using FEM works on a variety of different geometries, only the mesh, the boundary conditions and initial conditions needs to be modified. However, creating high-quality meshes that respect the boundaries of a domain is an incredibly complicated process. The size, shape and the general layout of elements inside a mesh greatly in- fluence the reliability of a finite element simulation. If a dihedral angleonthe outer surface of an element is too large, it can induce tremendous errors in a simulation, to the point of making it unusable[111]. A mesh generator must therefore carefully ensure sufficient mesh quality, the quality of a mesh being 1.2. THE DIFFERENT TYPES OF MESH 3 its propensity not to cause numerical errors, in order to obtain reliable numer- ical simulations in industrial applications.

Figure 1.1: Visualization of how a car deforms in an asymmetrical crash. The simulation was carried out using the finite element method.

1.2 The different types of mesh

Types of element Elements of a 3D FEM mesh are either tetrahedra, hexa- hedra, prisms or pyramids. A tetrahedral mesh is a mesh that contains only tetrahedra and a hexahedral mesh is a mesh that contains only hexahedra. In 2D, elements are either quadrilaterals or triangles.

Conformal or non-conformal A mesh is conformal if the pairwise intersec- tion of any two entities is either a lower-dimensional entity with the same points or is empty. Conformal meshes are almost always preferred in FEM, because they allow for a simpler FEM formulation, increasing both the simula- tion speed and accuracy.

Unstructured, structured, block-structured, semi-structured In contrast to an unstructured mesh, the elements of a structured mesh follow a repetitive pattern: the local topology/connectivity is constant. Structured meshes gener- ally have the topology of a grid, of quadrilaterals in 2D and of hexahedra in 3D. In a multi-block structured mesh or block-structured mesh, the domain is first decomposed into a series of blocks, that have the topology of squares or cubes, and each block is meshed using a structured grid of quadrilateral or hexahe- dral elements (see Figure 1.2b). As a comparison, an unstructured triangular 4 CHAPTER 1. INTRODUCTION mesh is shown in Figure 1.2a. Because the connectivity in structured parts of a mesh is regular, it makes the data management easier and it allows for more lightweight mesh data structures. Computations on structured meshes can therefore be more memory efficient, simpler and faster. Some 3D meshes contain a structured, hexahedral part, and an unstructured, tetrahedral part. To get a conformal mesh, the connection between these two types of elements is realized with pyramids. A mesh can also be fully hexahedral but have struc- tured and unstructured parts. Finally, a mesh can be structured in only one direction, i.e. semi-structured, as is often the case when meshing the so-called boundary layer1 in computational fluid dynamics (CFD). Those semi-structured parts are made of prismatic layers. In the direction perpendicular to the layers of prisms, the mesh is actually structured.

(a) unstructured triangular mesh (b) multi-block structured quadrilat- eral mesh

Figure 1.2: Two different 2D meshes of the same domain

The advantage of unstructured tetrahedral meshes What makes tetrahedral meshes popular is that they are far simpler to generate. Indeed, not all vol- umes can be meshed with hexahedra. For example, if there is an odd number of quadrilateral facets on the surface mesh, then the volume cannot be meshed with hexahedra without modifying the surface mesh. In any case, finding a way to fill a volume with high-quality hexahedra is generally extremely diffi- cult. Moreover, to really have an advantage over tetrahedral meshes, hexahe- dral meshes better be structured. Generating a structured mesh is even more complicated and generally requires repeated slow human interventions. Hexa- hedral meshes without a regular structure are not intrinsically better than tetra- 1the boundary layer is a zone near a boundary where there are high gradient of temperature or momentum normal to the boundary surface 1.2. THE DIFFERENT TYPES OF MESH 5 hedral meshes[105]. Some simulation even work far better with tetrahedral meshes[126].

Figure 1.3: Tetrahedral meshes created with the tetrahedral mesh generator described in this thesis.

Instead of wasting time and money to create the perfect mesh, engineers have to find a balance between the time savings that a better mesh wouldpro- vide to a simulation and the additional time it requires to create that mesh. The 6 CHAPTER 1. INTRODUCTION speed gain that an hexahedral mesh can provide is usually not worth the hassle. High-quality tetrahedral meshes can indeed be created automatically, without requiring any human intervention once a proper CAD model has been created, and are suitable for most simulations. Examples of tetrahedral meshes that were created automatically using our tetrahedral mesh generator are shown in Figure 1.3.

1.3 Delaunay-based tetrahedral mesh generation

The most effective techniques for creating high quality unstructured tetrahe- dral meshes are based on the theoretical foundations provided by the Delaunay triangulation. A triangulation in Rd, here, is a d-dimensional simplicial com- plex. In other words, it is a set of simplices, i.e. a generalization of the notion of a triangle or to arbitrary dimensions, where the intersection of any two simplices is either empty or a lower dimensional simplex with the same vertices. The Delaunay triangulation is a particular type of simplicial complex which is the dual of the Voronoi diagram[43, 93]. The Voronoi Diagram of a set d of n points S = {p1, p2, . . . , pn} is a partition of into a set of n Voronoi cells, shown in different colors on Figure 1.4. Each Voronoi cell Ci, i ∈ {1, 2, . . . , n}, is associated to a point pi called the seed or the generator of that Voronoi cell. The cell Ci is a region of space consisting of all points closer to pi than to any other point of S.

Figure 1.4: A 2D Voronoi diagram and its dual, the Delaunay triangulation.

As the Delaunay triangulation is the dual of the Voronoi diagram, if neigh- boring points are defined as points whose d dimensional Voronoi cells share a 1.3. DELAUNAY-BASED TETRAHEDRAL MESH GENERATION 7 d − 1 dimensional face, then the Delaunay triangulation of a point set is the triangulation that connects each pair of neighboring points with an edge. The triangulation is shown in dotted lines on Figure 1.4. Properties of the Delaunay triangulations derive directly from this definition. If a point set S in Rd is in general position, which in this case means that no groups of d + 2 points lie on the same hypersphere, then the Delaunay tetrahedralization DT(S) is unique. If points are not in general position, it is however possible to apply a symbolic perturbation that use the indices of the points such that they never appear col- inear, coplanar or cospherical [44]. This technique that copes with degenerate cases in geometric algorithm, in arbitrary dimensions, is called the simulation of simplicity by Edelsbrunner. When the simulation of simplicity is used properly, the following circumsphere property holds true for the Delaunay triangulation of any input point set:

Property 1 Let a point set S and a simplex τ with vertices from S. If the circum- sphere of τ contains no points of S other than the vertices of τ, then τ is part of the Delaunay triangulation: τ ∈ DT(S). Inversely, if the circumsphere of τ contains another point of S, then τ∈ / DT(S).

Therefore, when inserting a new point pk+1 into DTk(S), only the simplices whose circumsphere contain pk+1 need to be modified. The set of simplices that needs to be modified forms a star-shaped polyhedral cavity C(DTk, pk+1) 2 whose kernel contains pk+1 . There is only one way to retriangulate that poly- hedron to include the new point pk+1 while maintaining the Delaunay prop- erty: each boundary facet of the polyhedron needs to be connected to pk+1, forming new simplices. Those new simplices form the Delaunay ball B(DTk, pk+1). Because pk+1 is in This process of insertion of a point inside a Delaunay trian- gulation is illustrated in 2D on Figure 1.5. The incremental algorithm that builds a Delaunay triangulation by repeated insertions is called the Bowyer-Watson al- gorithm, as it was discovered independently by Bowyer [25] and Watson [124]. Modern implementations of the Bowyer-Watson algorithm work as follow:

1. the simplex τ that contains the new point pk+1 can be found efficiently using a space-partitioning data-structure like an octree. If the order of the inserted points is cleverly chosen such that two consecutive points are close to each other, there is no need for a coarser space-partitioning data-structure than the mesh itself. One can walk toward τ, going from one simplex to the next (see Chapter 3 for a more detailed description of the WalK).

2Formally, a polygon P is star-shaped if there exists a point z such that for each point p of P the segment zp lies entirely within P. The set of all points z with this property (that is, the set of points from which all of P is visible) is called the kernel of P. 8 CHAPTER 1. INTRODUCTION

2. the set of simplices whose circumsphere contains pk+1 is found with a breadth-first search starting from τ. The polyhedral hole that they fill is the cavity.

3. the new simplices are created by connecting the boundary facets of the cavity to pk+1. 4. the adjacencies between newly created simplices can be computed thanks to a hash table or a simple array if the number of new simplices is small enough.

Chapter 3 details the Bowyer-Watson algorithm more thoroughly for the 3D case. For example, it explains how to handle the robustness issues that can arise from floating point computations as well as how to handle the insertion of points outside of the current convex hull which are hence not contained in any tetrahedron. The data structures and implementations details are also given.

p pk+1 k+1

C(DTk, pk+1) B(DTk, pk+1)

A(DTk, pk+1) A(DTk, pk+1)

Figure 1.5: The insertion of a point pk+1 into a Delaunay triangulation. On the left, a cavity C(DTk, pk+1) is formed from the simplices whose circumsphere contain pk+1. On the right, the cavity is modified to include the new point while maintaining the empty circumsphere property for all simplices.

To construct a Delaunay triangulation, one can construct the Voronoi dia- gram directly and then compute its dual. Let ∂αi be the bisecting hyperplane between a point p and the seed pi of a neighboring cell. The intersection of all the possible half-spaces αi, obtained by removing the side of ∂αi that does not contain p, is the Voronoi cell of p. However, we cannot know in advance which points will be seeds of neighboring cells, except for the point closest to p which is necessarily in a neighboring cell. An algorithm therefore has 1.3. DELAUNAY-BASED TETRAHEDRAL MESH GENERATION 9 to make educated guesses, using the k closest points for constructing the in- tersection of k half-spaces to compute a current approximation of the cell C˜, which is possibly too large. Then, most other points can usually be filtered out pretty quickly because their associated bisecting hyperplane does notin- tersect C˜. An algorithm can iteratively clip C˜ using the bisector with the next closest point which has not yet been filtered out. In practice, algorithms that construct the Voronoi diagram using such an iterative half-space clipping al- gorithm[78, 101] are only used for very small datasets, or when the points are uniformly distributed such that two neighboring cells have seeds that are nec- essarily close. If there is a guarantee that the seeds of all neighboring cells are among the k-nearest neighbors, the Voronoi diagram can be obtained ex- tremely quickly on the GPU[97]. However, the complexity of those algorithms skyrockets if the seed of a neighboring cell is not close to p. In addition, when used for constructing Delaunay triangulations, the computation of the dual re- quires an heavy conversion between Voronoi and mesh data-structures which greatly impacts the performance of this strategy.

Another possibility for constructing Delaunay triangulations consist in build- ing the 4D convex hull of the points lifted to the paraboloid, i.e. where a fourth coordinate x2 + y2 + z2 is added to each point[59, 43]. With the help of Figure 1.6, we can visualize the equivalence. Notice that the intersection of any plane with a paraboloid f(x, y) = x2 + y2 forms an ellipse. If we project that ellipse onto the xy plane, or more simply if we look at it from the top or bottom, the el- lipse becomes a circle. Therefore, testing whether a point dl, which is the point d lifted on the paraboloid, is above or below the plane defined by threeother lifted points {al, bl, cl} is equivalent to testing if d is in the circumcircle of the triangle {a, b, c}. The facets that are on the lower part of the convex hull hence correspond to triangles that have an empty circumcircle. This equivalence be- tween the Delaunay triangulation and the convex hull of the points lifted to the paraboloid allows to use any convex hull algorithm for creating Delaunay triangulations. Property 1 totally defines the Delaunay triangulation, meaning that any triangulation of a point set that has this property is the unique De- launay triangulation of that point set. It is therefore also the dual of a Voronoi diagram and it is topologically equivalent to the lower part of the convex hull of the points lifted to the paraboloid.

It is often interesting to interpret an d+1-dimensional convex hull algorithm as an d-dimensional Delaunay triangulation algorithm. For example, the gift wrapping algorithm applied to the parabolic lifting map grows a Delaunay tri- angulation, starting from an initial simplex, without ever deleting any simplex. 10 CHAPTER 1. INTRODUCTION

z

bl dl

al

cl

c d a b x

y

Figure 1.6: The function f :(x, y) → (x, y, x2 + y2) lifts any point to the paraboloid: p → pl = f(p). Testing whether a point d is inside or outside the circumcircle of the triangle {a, b, c} is equivalent to testing whether or not the lifted point dl is below the plane on which the lifted triangle {al, bl, cl} lies.

Instead of adding points in a predetermined order, the gift wrapping algorithm adds the point on the paraboloid which forms the minimum turning angle with a selected facet of the last simplex. In other words, if applied directly to the non-lifted points, the point added is the one which, connected to a facet ofthe last simplex, forms a simplex with the smallest circumscribed sphere[41]. With those simplifications and optimizations that are specific to Delaunay triangu- lations, the algorithm is not really akin to gift wrapping anymore, but it still creates only Delaunay simplices that will never be deleted. Algorithms that directly construct simplices whose circumsphere is empty, one after the other, are called incremental construction algorithms by Cignoni et al.[36]. However, we find this term confusing, since the Bowyer-Watson algorithm is also incre- mental in its own way. We hence call these algorithms simplex-preserving algorithms.

A last method for building Delaunay triangulation is to build a non-Delaunay or quasi-Delaunay tetrahedralization and repair it using the star splaying al- gorithm of Shewchuk in order to recover the empty circumsphere property. In simplified terms, star splaying modifies the adjacency of a point suchthat adjacent facets form a convex cone when lifted to the paraboloid. We can only strongly recommend Shewchuk’s paper[112] to the interested reader who whishes to fully understand this method. Star splaying is very costly if a lot of points are locally non-Delaunay, i.e. inside the circumsphere of another sim- 1.3. DELAUNAY-BASED TETRAHEDRAL MESH GENERATION 11 plex. Hence, it is generally only used for transforming quasi-Delaunay trian- gulation, where only few points are locally non-Delaunay. Nonetheless, star splaying is, to our knowledge, the only local improvement algorithm that can construct Delaunay triangulation in arbitrary dimensions.

Other methods are available in two dimensions, like Fortune’s sweep-line algorithm[46] or the divide and conquer algorithm of Shamos and Hoey[107], but they currently cannot be extended in 3D without complexity issues. Here is a little summary of all the different types of method that can be used tocreate n-dimensional Delaunay triangulations relatively efficiently:

• the Bowyer-Watson algorithm inserts points in a predefined order, main- taining a valid convex triangulation through the whole process.

• Any convex hull algorithms can be used to create a Delaunay triangu- lation. Indeed, if the point set has been lifted to the paraboloid (see the parabolic lifting map on Figure 1.6), the lower part of the convex hull directly corresponds to the Delaunay triangulation.

• Iterative half-space clipping algorithms creates the Voronoi diagram by intersecting half-spaces to construct each Voronoi cell. The Delaunay triangulation is simply the dual of the Voronoi diagram and is hence ob- tained by connecting the Voronoi sites whose Voronoi cells share a facet.

• Simplex-preserving algorithms iteratively find simplices that have empty circumspheres. For example, the gift wrapping algorithm applied to the lifted points, in addition to being a convex hull algorithm, is also a simplex- preserving algorithm.

• Local improvement algorithms start with an arbitrary triangulation of the point set and apply local modifications around simplices that are lo- cally non-Delaunay to recover the Delaunay property. In 2D, simple edge flips work for transforming any triangulation into a Delaunay triangu- lation, resulting in Lawson’s algorithm[72]. In higher dimensions, star splaying is the most effective way of recovering a Delaunay triangula- tion.

Delaunay tetrahedralizations (3D) are relatively easy to construct and usually are high quality meshes, except for certain elements called slivers[111]. Because generating Delaunay meshes is fast, it is generally faster to create a Delaunay mesh and then modify it to remove bad elements than it is to create a high quality mesh from scratch. A Delaunay-based mesh generator therefore inserts 12 CHAPTER 1. INTRODUCTION points into a Delaunay mesh, while retaining the Delaunay properties as far as possible, thanks to a part of the program called the Delaunay kernel. This thesis, which aims at improving the speed of tetrahedral mesh generation therefore focuses on several critical parts: • Chapter 2 introduces a crucial mathematical object, the Hilbert curve, which is a theoretical cornerstone for the performance improvement de- scribed in subsequent chapters. Indeed, the Hilbert curve is at the heart of our parallelization strategy, while also being relevant for the perfor- mance of point location inside the Delaunay kernel. • Chapter 3 gives the implementation details of a very efficient sequential Delaunay kernel using the Bowyer-Watson algorithm. • Chapter 4 addresses the parallelization of the Delaunay kernel using the theoretical foundation provided by the Hilbert curve. • Chapter 5 tackles the integration of the parallel Delaunay kernel in a parallel mesh generator. • Chapter 6 details efficient parallel quality optimizations that allow tore- move slivers and are integrated in a full-fledged mesh generator. • Chapter 7 compares the performance of our mesh generator with refer- ence open-source implementations. The rest of this introduction explains why we need to speed up tetrahedral mesh generation and details the current state of the art.

1.4 The need for speed

With the right meshes, a parallel finite element method exhibit almost per- fect weak scalings. This means that a simulation that is twice bigger ontwo CPUs instead of one does not last longer. As a matter of fact, popular libraries are able to run large finite element simulations on millions of CPU andGPU cores and obtain million-fold speedups[13, 12, 45]. The communications that happen during a FEM simulation between different mesh entities are totally predictable. The amount of work per element is also pretty much constant. Hence, algorithms are devised that divide the domain in thread partitions in a way that minimizes the overhead induced by the communications and balances the workload between CPU cores. In contrast, at the beginning of a 3D Delaunay-based mesh generation pro- cess, when only the meshes of the surfaces have been created, making more 1.5. PREVIOUS WORK 13 partitions than the number of distinct volumes is almost impossible. When a coarse mesh has been created, and dividing elements between CPU cores be- comes possible, multiple issues arise. The first big issue is that we cannot know for sure where a mesh modification will need to be applied. It is therefore impossible to make perfect partitions in advance and synchronization primi- tives are always needed to prevent conflicts. Furthermore, the workload ona partition can only be estimated very roughly: refining or optimizing a certain region of the mesh might be far more demanding than for another region. For these reasons, Delaunay-based mesh generation usually does not scale well, even with less than a hundred threads. Because simulations get bigger and bigger with the availability of more and more computing power, the need for faster high-quality mesh generation meth- ods becomes more and more crucial. Mesh generation is now a clear bottleneck in the typical FEA pipeline. Hence, we conducted a four-year research whose goal was to parallelize and optimize all stages of tetrahedral mesh generation.

1.5 Previous work

The heart of Delaunay-based tetrahedral mesh generation is the Delaunay ker- nel. The parallelization of the Delaunay kernel is a problem that has received a lot of attention in the last 30 years. However, very little, if any, attentionhas really been paid to the parallelization of the other components of a tetrahedral mesh generator. Therefore, in this section dedicated to the state of thearton parallel tetrahedral mesh generation, we only present previous work concerning the parallelization of the 3D Delaunay kernel. Some parallelization strategies can easily be extended to accelerate any local mesh operation like mesh im- provement operations, others cannot. Parallel tetrahedral mesh generator that are not based on Delaunay triangulation exists. For example, some generators are based on the advancing front technique[79, 125], These methods are either slower or generate lower quality elements. It is notably easier to parallelize mesh generators based on the subdivision of octree cells into tetrahedra[38].

Lock based strategies The Bowyer-Watson is the fastest algorithm for creating Delaunay triangula- tions. However, it is not the simplest to parallelize. Parallelizing a single point insertion of the Bowyer-Watson (B-W) algorithm is very difficult and generally inefficient because there are only about twenty tetrahedra to be modified and the algorithms to find or to operate on those tetrahedra cannot be parallelized efficiently for such tiny inputs. Therefore, to parallelize the B-W algorithm, 14 CHAPTER 1. INTRODUCTION several insertions must be performed simultaneously by different threads. The key to parallelizing multiple insertions is to avoid race conditions—which hap- pen when a thread needs to access a tetrahedron which is modified by another thread—with the least overhead possible. The size of the cavities is however not fixed: they can repeatedly cover the whole mesh whatever the number of points already inserted. Hence no implementation of the B-W algorithm can guarantee a perfect parallelization. However, from our empirical studies, cav- ities have an average size of 20 tetrahedra if the points are well distributed. The simplest strategy to parallelize multiple insertions simply avoids racecon- ditions through the use of locks or any other mutual exclusion mechanisms. Locks are associated to partitions or parts of the mesh that might be modified. A thread must acquire a lock to modify or read that part of the mesh, ensuring that there is no concurrent access to shared resources. Because a thread might need multiple locks to insert a point, it is important that the threads do not all wait for a lock to be released by another thread, provoking a deadlock. Lock conflicts can be handled with priority locks where each thread is given a unique priority. If the acquiring thread has a lower priority than the locking thread, it aborts its point insertion and releases all its lock instead of waiting. The retreat- ing thread will try to insert the point at another moment. This strategy is used both in CGAL[16]3 and Geogram[73]. They report speedups of 5 on 8 threads and we measured a speedup between 5 and 10 on 64 threads. Actually, there is a point where adding more threads to either programs makes them run slower. Blandford et al.[20] and Foteinos and Chrisochoides[47] do not use priorities: any thread that tries to acquire a lock which is already in use by another thread releases all its locks and the point to insert is placed at the end of the insertion queue. We cannot test those programs, as they are not open-source, but their performance does not seem to be particularly better than the one of CGAL and Geogram. For example, Foteinos and Chrisochoides report speedup of 13.22 over the sequential version of CGAL on 48 threads. Another implementation was devised by Kohout et al., containing 3 different strategies69 [ ]: • the one of CGAL and Geogram, using priority locks. • an optimistic method, where threads wait to acquire locks but there is a deadlock detection mechanism in a critical section which detects loops in the “thread x waiting for thread y” directed graph. A thread has to retreat only if there is a possibility of deadlock. • a burglary method, which works like the optimistic method, but there is no retreat possible. Threads that were in conflict will all finish theirin- 3few people know that there is a parallel Delaunay triangulation algorithm in CGAL. Not to be confused with the sequential algorithm which is also in CGAL. 1.5. PREVIOUS WORK 15

sertion, one after the other. Because the cavity changes in the meantime, the algorithm sometimes abandons the Delaunay properties in favor of a reduced waiting time. The final mesh is therefore quasi-Delaunay.

Kohout et al. give the speedups obtained with different number of threads but not any timings, which make us question why the clock speed of the different processors used are given…Again, the code is not provided.

The big advantage of lock-based parallelization strategies is their simplic- ity: just execute the plain old B-W algorithm. The other big advantage is their versatility. As mentioned earlier, tetrahedral mesh generation does not only refine a mesh through Delaunay insertions. After the refinement, ameshim- provement step is necessary to eliminate certain tetrahedra of bad quality. This improvement of the mesh is achieved by local modifications around the bad tetrahedra. Lock-based strategy can be easily adapted to parallelize these local modifications as well. Because the mesh is shared instead of being split, lock- based strategies are usually only suitable for shared memory architectures. In- deed, as can be seen from the results, the synchronization overhead which is inevitable when using locks already hurts the scalability of these methods on shared-memory systems. On distributed memory systems, the synchronization overhead would be even bigger.

Divide and Conquer strategies Recall that finding the Delaunay tetrahedralization is equivalent to findinga 4D convex hull where a parabolic lifting map : (x, y, z) ∈ R3 → (x, y, z, x2 + y2 + z2) ∈ R4 has been applied to the points in 3D. Because it is possible to merge two convex hulls into one[92, 9], it is also possible to apply a divide and conquer strategy for constructing Delaunay tetrahedralizations. Most divide and conquer algorithms divide a point set into two subsets, by bisecting it with a plane. The goal is to create the Delaunay triangulation of the subsets onthe left and right of the plane independently.

Cignoni et al. devised an algorithm, called DeWall that first builds all the De- launay simplices that intersect the bisecting plane p, using a simplex-preserving algorithm[36]. These simplices form a simplex wall that divides the domain in two parts. Because it is guaranteed that the circumsphere of those simplices does not contain any point, the Bowyer-Watson or any Delaunay algorithm can be used on either side, and the simplex wall will never be modified. The left and right domain can also be further divided by building another simplex wall. However, the construction of the simplex wall is not efficient compared to B-W 16 CHAPTER 1. INTRODUCTION insertions. Indeed, finding the next point to add to form a new simplex ofthe simplex wall theoretically has an O(n) computational complexity with n = |S|. If the DeWall algorithm is used from top to bottom in 3D, i.e. dividing until all the points are in simplex walls, DeWall theoretically runs in O(nm), where m is the number of tetrahedra. In the worst case, it can thus run in O(n3) in 3D.

Blelloch et al. use quite a similar approach[22]. Instead of building an n dimensional simplex wall, they build an n−1 dimensional facet wall. That facet wall is made of all the final Delaunay facets that crosses the bisecting plane p. To find those Delaunay facets, the whole point set is lifted to the paraboloid and projected onto the bisecting plane which has been extended in the new dimension introduced by the parabolic lifting map. The lower convex hull of the projected points gives the edges of the final Delaunay triangulation that cross the bisecting plane. It is possible to adapt the Bowyer-Watson algorithm pretty easily, using ghost triangles (see Section 3.7) such that it is able to start from the facet wall and inserts points on one side of the facet wall. Lee et al. use the exact same method, but instead of using Bowyer-Watson, they use a simplex-preserving algorithm on both sides in parallel[102]. Having to solve a d − 1 dimensional convex hull problem as a sub-routine to build a facet wall is not ideal, but it is theoretically not worse than building a simplex wall.

The most recent D&C approach rely on merging, but require more theoretical background. Let a point set S in R3 with a plane p in its middle, that bisects S into two parts S1 and S2, respectively the set of point on the left and on the right of the plane p. If we create the Delaunay tetrahedralization of S1, DT(S1), we known that the circumsphere of any tetrahedron in DT(S1) does not contain any point in S1, by Property 1. It is also obvious that a tetrahedron of DT(S1) is only in the Delaunay tetrahedralization of the whole point set, DT(S), if its circumsphere also does not contain any point in S2, and hence does not contain any point of S. If the circumsphere of a tetrahedron of DT(S1) does not intersect the bisecting plane p, it certainly does not contain any point on the other side of p. Therefore, all the tetrahedra( ofDT S1) whose circumsphere does not intersect p are in DT(S). The same can be said for the tetrahedra inthe Delaunay tetrahedralization of S2. The complicated part consists in finding the other tetrahedra of DT(S), whose circumscribed sphere intersects the bisecting plane p. Those missing tetrahedra form a simply connected polyhedral cavity. Let’s create a third point set SB, which contains every point that is not in a final tetrahedron. Chen et al.[32] proved three properties which allows to find those missing tetrahedra: 1.5. PREVIOUS WORK 17

• If a tetrahedron τ ∈ DT(SB) is also in DT(S1) or in DT(S2), then it is a tetrahedron of DT(S).

• If a tetrahedron τ ∈ DT(SB) intersects p, then it is also a tetrahedron of DT(S).

• All missing tetrahedra of DT(S) can be found using the two properties above.

This interesting theoretical result is used by Funke and Sanders in order tobuild a promising Delaunay triangulation program[54]. Their implementation is fully parallel. Yet, for uniformly distributed point sets, they only obtain a speedup of 260 on 2048 cores. Their program only inserts up to one million points per second on the same number of cores. For non-uniform point sets, they attain an even worse speedup of only 18 due to the lack of load-balancing in their current implementation, and hence obtain throughput that can be reached by fast sequential implementations.

Lo uses an innovative strategy, quite different from other ”divide and con- quer” strategies because the partitions are not fixed[76]. The space is divided into a grid of small cells. A larger grid determines different zones that will be associated to each thread. A zone is therefore a group of cells. Each thread will first insert the points of its zone, to create a triangulation completely indepen- dent from the others. At this point, many tetrahedra of the final triangulation whose circumsphere intersects the zone boundary are missing in every inde- pendent triangulations. Only the tetrahedra whose circumscribed sphere does not intersect the zone boundary are actually finalized. The core of Lo’s tactic is to enlarge the zone by adding layers of cells if a sphere circumscribed to a tetrahedron of the original zone intersects the border of the new zone. The points contained in the added layer of cells are added, creating new tetrahe- dra. Some of these tetrahedra contain at least one point from the original zone and do not intersect the border of the new zone. They are therefore part of the final triangulation. The process of enlarging the zone is continued untilall tetrahedra containing a point from the original zone are finalized. It is then sufficient to merge the different triangulations made by different threads, by keeping all finalized tetrahedra. Duplicated tetrahedra that cross the boundary of original zones need to be detected. Because the zones can grow and require different parts of the point set, Lo’s strategy is more difficult to adapt todis- tributed memory architectures. Nonetheless, on a shared-memory machine, Lo reports an impressive 11-fold speedup on 12 threads in the perfect scenario. He however reports a throughput of only up to 2 million tetrahedra per second, 18 CHAPTER 1. INTRODUCTION which is a testimony to the slowness of its sequential implementation. The re- sults must therefore be taken with a grain of salt since, once again, the code has not been made public.

Divide and conquer strategies are very attractive from a theoretical point of view, since they allow to compute different parts of the mesh completely independently, and can therefore be used on distributed memory machines. However, from a practical point of view, none of them have really proved to be fast, and none of them seem to scale up to thousand cores. In addition, a disadvantage of divide and conquer methods is that they rely on theoretical properties of Delaunay triangulation, and are thus difficult to adapt to other stages of tetrahedral mesh generation, like the mesh improvement step.

Other strategies GPUs can also be used to construct Delaunay triangulations[29, 87]. The strategy used on GPU is totally different than on CPUs. Indeed, GPUs have slower cores and, hence, absolutely need a high level of parallelism to perform well. Cao et al.[29] devised an algorithm that inserts points in parallel in mul- tiple iterations. At each iteration, threads are each assigned a tetrahedron that contains (at least) a point to insert. A thread will simply insert the point by split- ing the tetrahedron in 4 smaller tetrahedra. At this point the tetrahedralization is non-Delaunay. The algorithm then applies bistellar flips to all locally non- Delaunay faces to obtain a quasi-Delaunay tetrahedralization. A bistellar flip is simply the local operation illustrated on Figure 1.7. Whenever the circumsphere of a tetrahedron contains a point of an adjacent tetrahedron through a facet f, the facet f is said to be locally non-Delaunay. When a bistellar flip modifies f without creating an inverted tetrahedron, it deletes non-Delaunay facets and creates only Delaunay facets. However, because some facets cannot be flipped without creating an invalid tetrahedralization, only a quasi-Delaunay tetrahe- dralization can be obtained after the flipping step. Afterwards, the algorithm begins an new insertion iteration on the GPU, and so on until all points are in- serted. Finally, the mesh is sent back on the CPU, where a star-splaying imple- mentation will transform the quasi-Delaunay mesh into the unique Delaunay tetrahedralization of the point set. Creating an effective meshing strategy on the GPU is a feat in itself. However, the results do not justify a migration of the Delaunay kernel from the CPU to the GPU. Their program is only 9 times faster than the sequential version of CGAL, on their largest input point set. The algo- rithm could perhaps be more efficient on larger inputs, if only the memory of the GPUs was not so limited ! Another big disadvantage of GPU implementa- tions is that they usually require a lot of tweaking to perform well on a variety 1.5. PREVIOUS WORK 19 of different GPU architectures. Coding on the GPU is hard, building a full3D mesh generator, that interacts with a CAD system efficiently, is fast on any GPU, can handle huge meshes and is easy to extend and maintain is an utopia at the moment.

2-3 ‚ip

3-2 ‚ip

Figure 1.7: The 2-3 and 3-2 flips

Simplex-preserving algorithms are probably the simplest Delaunay triangu- lation algorithms to parallelize. Cignoni et al.[35] devised an algorithm called ParInCoDe where the space is partitioned using a regular grid and a thread sim- ply finds all the tetrahedra that are at least partially contained in its assigned grid cells using a simplex-preserving algorithm. A tetrahedron is only kept by a thread if its upper–left–front most vertex is contained in grid cell, toavoid duplicates. Results are not great on CPUs, because simplex-preserving algo- rithms are slow, but we genuinely do not know why this technique has never been attempted on GPUs. It would probably be our method of choice ifwewere to try to make a GPU implementation.

Chapter 2

Hilbert curve

The Hilbert curve is a fascinating object, which is at the center of thisthesis about parallel tetrahedral mesh generation. It is pretty simple to construct in various ways, most of which are terribly inefficient. This chapter will gradually teach you how to build Hilbert curves, more specifically how to compute an Hilbert index, in a robust and efficient way. It is also an opportunity torecall the basics of floating-points errors and performance optimization. All the code listings shown in this chapter are available as C source file at https://git. immc.ucl.ac.be/hextreme/hilbert.

2.1 Row-major and column-major order

Figure 2.1: row-major ordering on a 2k × 2k grid, with k = 1, 2, 3, 4, 5

Figure 2.2: column-major ordering on a 2k × 2k grid, with k = 1, 2, 3, 4, 5.

21 22 CHAPTER 2. HILBERT CURVE

When storing a multi-dimensional array in memory, one must map k-dimensional integers to unidimensional ones. The two most basic choices of mapping are given by the column-major ordering and row-major ordering. Given a 2D index (i, j) where i ∈ [0, m[ is the index of the row and j ∈ [0, n[ is the index of the column, a row-major ordering will map (i, j) to R(i, j) = i·n+j while a column-major ordering will map (i, j) to C(i, j) = j · m + i. In any number of dimensions, the row-major ordering is actually the lexicographic ordering, first incrementing the last dimension, then the previous and soon, all the way to the first dimension. That is how the C programming language, among many others, store multidimensional data. The row-major ordering is the co-lexicographic ordering, which is the same as the lexicographic order with a reverse tuple of coordinates (i, j, k, . . .) → (. . . , k, j, i). One key advan- tage of these two mapping schemes is that locality in one of the dimensions is preserved. Row-major order keeps consecutive element of the same row con- secutive. The same hold true for consecutive element of the same column with the column-major ordering scheme. Because modern computer can access a block of contiguous data faster than scattered data, and because they can also do Single Instruction on Multiple Data (SIMD), these two mapping schemes are the most efficient for a majority of “number crunching” algorithms. Indeed, most algorithms access elements either one column after the other, in which case the column-major ordering is the most efficient, or row after row, in which case the row-major ordering is the most efficient. We can plot the row-major or column-major ordering of a 2k × 2k grid in a unit square as curves that pass by the center of each grid cell, as done in Figure 2.1 and 2.2. These curves are space-filling curves, meaning that the curves can be refined, by increasing k, to make them pass by any point of the unit square. If k = ∞, a space-filling curve basically covers the whole unit square, hence its name. More specifically, if you have 2D points in the unit square, with floating- point coordinates (x, y), the row-major curve with k = ∞ will every point in the lexicographic order. Actually, we do not need to refine the curve that much to obtain the same lexicographic ordering. We just need to increase k enough such that every point is contained in its own separate grid cell. The idea of representing the mapping from 2D floating points to 1D integers asa 2D curve that visit grid cells in a particular order is of big importance for the rest of this chapter. There exist plenty of other space-filling curves that map each cell of a 2k × 2k grid to a single 2k-bit integer. The nice thing about 2k × 2k grids is that the row and column indices i and j both use the full range of a k-bit integer. Because m = n = 2k, it means that doing in + j or jm + i is equivalent to stitching the k bits of i and j together: R(i,j) = i<

Furthermore, each bit of the binary representation of i and j has a clear mean- ing. The most significant bit of i, indicates if the cell is on the top half of the grid. The second most significant of i bit is set if the cell is on the top half of that half and so on in a recursive manner. With k , the row index i is en- tirely defined. Same goes for the index j and the columns: each bit recursively defines if the cell is in the right or left half. InSection 2.5 and 2.6, space-filling curves that exploit this recursive property will be detailed.

2.2 Locality of a curve

Mesh generation algorithms generally do not work one direction after the other, so using the lexicographic or co-lexicographic ordering does not make much sense. The need for a specific ordering comes from a part of our Delaunay tetrahedralization implementation, the topic of Chapter 3. In short, the Delau- nay algorithm that we are using incrementally inserts points inside a mesh, so as to progressively refine it. One part of the algorithm, called the WalK, issig- nificantly faster if points that are inserted consecutively are geometrically close to each other. The full details of the WalK are given in Section 3.2. What mat- ters here is that the WalK motivates the need to find an ordering of points in an n-dimensional space such that consecutive points are also geometrically close to each other. Finding such an ordering is similar to the traveling salesman problem (TSP) in that the sum of distances between consecutive points should somewhat be minimized. Some space-filling curves (SFC) are commonly used as heuristics for the TSP because of their great locality properties. Coincidentally, the order given by those locality-preserving SFC can also be used to create par- titions, as explained at the end of this chapter, that are useful for parallel mesh generation algorithms. The notion of locality of the space-filling curves analyzed in this chapteris best defined on the mapping from a point on a space-filling curve to apointon 2k ×2k ×...×2k n-dimensional grid. A mapping f : I → Rn, is called α-Hölder continuous on the interval I, if there is a constant C > 0 for all parameters c1, c2 ∈ I, such that

α ||f(c1) − f(c2)|| ⩽ C|c1 − c2|

The euclidean norm is assumed in the following because other norms arenotof interest in the context of the WalK. Clearly, the row-major curve is not Hölder 24 CHAPTER 2. HILBERT CURVE continuous because it “jumps” from (i, 2k − 1) to (i + 1, 0) :

k k k c1 = i · 2 + 2 − 1 c2 = (i + 1) · 2 k f(c1) = (i, 2 − 1) f(c2) = (i + 1, 0)

The distance in 2D space can be infinite (with k = ∞) for a unit distance on the curve. However, there exists multiple recursively defined space-filling curves in 2D that are Hölder continuous with exponent α = 1/2. The ratio between the distance on these curves and the corresponding squared distance in 2D space is, hence, bounded by a constant: ||f(c ) − f(c )||2 1 2 <= C2 |c1 − c2| For example, the Hilbert curve which is the main subject of this chapter and is shown in Figure 2.9, is continuous with α = 1/2 and C2 = 6. The H-order curve, shown in Figure 2.13, has an even better locality: C2 = 4. The measure of 2 1 the constant C , for a n -Hölder continuous n-dimensional curve is also called the Worst-case Locality (WL) by Haverkort and van Walderveen[62]. If a curve 1 2 ∞ is not n -Hölder continuous, then C = WL = . The worst-case locality is a good indicator of the locality-preserving properties of a curve: points that are close to each other on a curve with a good (low) worst-case locality are necessarily also geometrically close to each other in 2D. However, there is no guarantee about the inverse mapping f−1 : Rn → I. Points that are geometri- cally close in 2D space are not necessarily close to each other on the curve. In fact, it is impossible to find a mapping f−1 which is Hölder continuous: mov- ing in 2D space can always cause arbitrarily large jumps on a 1D curve[15]. For some other measure of locality on f−1, the optimal mapping correspond to a spiral[127]. Points that are geometrically close to each other in 2D are only close on the spiral curve near its center.

2.3 Mapping floating-points to k-bit integers

0 1 2 3 2k − 1

vmin vmax

k Figure 2.3: Splitting [vmin, vmax] equally into 2 − 1 intervals.

Consider a function F(v, vmin, vmax, k) which, given a floating point num- k ber v in an interval [vmin, vmax], splits the interval equally into 2 (k ∈ N) 2.3. MAPPING FLOATING-POINTS TO K-BIT INTEGERS 25 smaller intervals and give the index n ∈ [0 . . . 2k − 1] of the interval within which v falls. Then j = F(x, xmin, xmax, k) and i = F(y, ymin, ymax, k) give the two k-bit integer indices (i, j) of the cell of a 2k × 2k grid within which the point (x, y) falls. This function can easily be constructed as a composition ofa linear map from v ∈ [vmin, vmax] to w ∈ [0, 2k[ and the floor function ⌊w⌋, which will round w to the nearest integer n less than or equal to w. However, the linear map between the closed interval [vmin, vmax] and the half-closed in- terval [0, 2k[ is anything but obvious in practice. Using a linear map between k two closed intervals [vmin, vmax] and [0, 2 ], like in Listing 2.1, is incorrect. If k v = vmax, than this function will output 2 , which does not correspond to any interval on our 1D grid.

Listing 2.1: An incoRRect function. It should return the index of the interval within k which v fall when dividing [vmin, vmax] equally into 2 smaller intervals. It fails for v = vmax. int getCell1D(double v, double vmin, double vmax, int k) { return (v - vmin) / (vmax - vmin) * (1ULL<

Handling the case v = vmax differently will not solve the problem in prac- tice. Indeed, let   v − vmin ˆv = ieee754 , vmax − vmin then ˆv can equal one even if v is less than vmax, due to the roundoff errors. However, ˆv will never be greater than 1 because 1 has an exact floating-point representation. Therefore, what we need to do is to find the largest floating- point number K less than 2k, and multiply ˆv by K rather than 2k. The largest double precision floating-point number below a certain positive number α can be found either by using the ̀nextafter ̀ function, available in C99 math library, or interpret the binary representation of the α as an integer and decre- ment it, as is shown in Listing 2.2. Using this technique, Listing 2.3 shows a correct function to convert any real number v in a 1D interval [vmin, vmax] to an integer n ∈ [0 . . . 2k − 1]. This function is quite compute intensive, especially the division operation which consume between 10 and 40 CPU cycles. To save some time when con- verting a lot of coordinates to integer values, we would like to precompute the constant factor K f = . vmax − vmin 26 CHAPTER 2. HILBERT CURVE

Listing 2.2: Printing the largest double precision number below 28 in three different ways. #include #include #include

5 int main( void ){ printf("%.30f\n", 255.99999999999998); printf("%.30f\n", nextafter(256, 0));

union {double f; int64_t d; } alpha = {256.0}; 10 alpha.d--; printf("%.30f\n", alpha.f); } /* commands to compile and launch the program + output :

15 thesis/03−Hilbert/code$ make −s nextbefore256 && ./nextbefore256 255.999999999999971578290569595993 255.999999999999971578290569595993 255.999999999999971578290569595993 */

Listing 2.3: CoRRect function that returns the index of the interval within which v fall k when dividing [vmin, vmax] equally into 2 smaller intervals. int getCell1D(double v, double vmin, double vmax, int k) { unsigned n = 1U<=0 }

that includes the division. But this will again leads to incorrect results in prac- tice because arithmetic with floating-point numbers is not associative:

v − vmin (v − vmin)f ≠ K due to roundoff errors. vmax − vmin

k The solution is simple: decrease the value of f until (vmax − vmin)f < 2 . This lead to the code on Listing 2.4. Now, we can correctly and efficiently assign each n-Dimensional point to a corresponding set of n k-bit integers. 2.4. BENDING SPACE 27

Listing 2.4: An efficient iterative version of Listing 2.3. The division is notably done outside of the loop. void mapCell1D(double* v, int* map, int vLength, double vmin, double vmax, int k) { double vwidth = vmax - vmin; unsigned n = 1U<

/* With IEEE754 and n=2^k, the while can be changed in if, because n=2^k does not induce any more errors compared to the case where n=1 and : f = 1/vwidth has a max. relative error of .5e (e: machine epsilon) 10 => f * vwidth <= roundIEEE754(1 + .5e) = 1 */ while(vwidth * f >= (double) n) f = nextafter(f, 0.0);

#pragma omp parallel for simd 15 for(int i = 0; i < vLength; i++) { map[i] = (v[i] - vmin) * f; } }

2.4 Bending space

0 1 2 3 4 5 6 7

vmin vcenter vmax

Figure 2.4: Intervals on the segment [vmin, vmax], but with a center that does not vmin+vmax correspond to 2 .

Actually, the algorithms explained in this chapter do not require a uniform grid. It is often advantageous to have grids that are more refined in some parts. As an examples, consider a strategy to put the center of the intervals, i.e. the point where the intervals (2k−1 − 1) and 2k−1 meet, at any coordinate

vcenter = vmin + mid01 · (vmax − vmin)

with mid01 ∈ [0, 1], as shown in Figure 2.4. In fact, it is just equivalent to the equal sized intervals in Figure 2.3, but with a mapping f(v) applied to the co- ordinates beforehand. Coordinates below vcenter are scaled down and shifted such that f(vmin) = vmin and f(vcenter) = 0.5. Coordinates above vcenter are scaled up and shifted such that f(vcenter) = 0.5 and f(vmax) = vmax. An exam- ple of the extended function Fe(v, vmin, vmax, k, mid01) that allows to specify the center of the 1D grid is given as C code in Listing 2.5. Note that if mid01 is set to 0.5, it becomes strictly equivalent to Listing 2.4. An other interesting way to map floating-point coordinates to integers is simply to interpret their binary representation as the integer index itself. This 28 CHAPTER 2. HILBERT CURVE

Listing 2.5: A more versatile version of Listing 2.4, that allow us to specify a middle point. Intervals on both sides of the middle are scaled up or down to remain between vmin and vmax. void mapCell1D(double* v, int* map, int vLength, double vmin, double vmax, int k, double mid01) { unsigned n = 1U<

double f0 = 0.5*n / (vcenter - vmin); double sub0 = vmin; double f1 = 0.5*n / (vmax - vcenter); 10 double sub1 = 2*vcenter - vmax; while((vmax - sub1) * f1 >= (double) n) f1 = nextafter(f1, 0.0);

#pragma omp parallel for simd 15 for(int i = 0; i < vLength; i++) { if(v[i] < vcenter) map[i] = (v[i] - sub0) * f0; else map[i] = (v[i] - sub1) * f1; 20 } }

leads to an irregular grid with exactly one cell for each possible pair of floating- point number. Fuetterling et al. call this structure the linear floating-point quad- tree[53]. Coincidentally, they use it both for constructing space-filling curves and for accelerating Delaunay triangulations, which are two cornerstones of this thesis.

Finally, a tuple of indices (i, j, k) can be associated to a leaf of most tree structures based on recursive subdivision of the space. For example, take a perfect binary space partitioning tree with k levels of recursion, consecutive integers i and i + 1 (i ∈ [0 . . . 2k − 2]) can be easily assigned to neighboring cells. Each subdivision corresponds to a certain bit of i and i + 1, just like in regular grids. Similarly, a pair of integers can be assigned to cells of a quadtree and a triplet of integers to an octree. What is important for the purpose of this chapter is that two tuples of coordinates which differ only by one must correspond to neighboring cells. Two tuples differ only by one if only one of their coordinates is different and differs only by one. Here, grid-like structures that have this property are called pseudo-grids. 2.5. Z-ORDER CURVE 29

Figure 2.5: 5 iteration of the Z-order curve

2.5 Z-order curve

The Z-order, also called Morton code, is a simple function thatmaps n-dimensional data with integer coordinates between 0 and 2k to a single dimension integer between 0 and 2nk while preserving most of the locality. The corresponding Z-order curve is also called Lebesgue curve. As shown in Figure 2.5, it draws a mirrored Z shape. It has to do with how we display the grid. The first axis, heading to the right, corresponds to columns, the second axis, oriented up- wards, corresponds to rows. To draw a correct Z shape, we should either orient our axis differently or use a canonical ordering that differs from the row-major ordering. As a reminder from Section 2.1, the index R(i, j) into which the row-major curve will visit a cell i, j in a 2k × 2k grid is simply given by stitching the bits of i and j together. We define the operator ⊔, that stiches bits together: if there are n bits on the right hand side of ⊔, the left hand side is multiplied by 2n. The rightmost ⊔ operation is applied first. With this notation, let iq and jq be the qth most significant bit of i and j, then the row-major ordering is given by:

k R(i, j) = i · 2 + j = i0 ⊔ i1 ⊔ ... ⊔ ik−1 ⊔ j0 ⊔ j1 ⊔ ... ⊔ jk−1 . As we also explained in Section 2.1, the most significant bit of j and i tell us if we are on the right or the left and at the top or the bottom. Thetwo bits stitched together, i0 ⊔ j0, give us the main quadrant into which (i, j) is located in our 2k × 2k grid, indexed in row-major order. The pair of second most significant bit i1 ⊔ j1, gives us the quadrant within the main quadrant into which (i, j) is located, also in row-major order. Recursively, we can dive down an imaginary quadtree until we reach the leaf cell (i, j), after k iterations. Because each quadrant is ordered first along the first dimension and thenthe second, the imaginary quadtree that we where thinking about is equivalent to a binary space-partitioning tree, where we recursively divide our in half, first along rows, then along the columns. That binary space-partitioning (BSP) tree is illustrated on Figure 2.6 for k = 2. The binary path that leads to the leaf (i, j) of this perfect binary tree naturally corresponds to an integer, which is 30 CHAPTER 2. HILBERT CURVE

also given by interlacing all the bits of i and j. The traversal of all the leafs of our binary tree gives the Z-order curve:

Z(i, j) = i0 ⊔ j0 ⊔ i1 ⊔ j1 ⊔ ... ⊔ ik−1 ⊔ jk−1

Note that any BSP tree naturally defines an ordering of cells. Interestingly, for k-d tree, the ordering totally corresponds to the Z-order, except that the coordinates i, j of a cell are only truly visible when the subdivisions are perfect halves.

(i, j)

i0 ?

0 1

j0 ?; j0 ?;

0 1 1st level quadrants 0 1 nd i1 ? i1 ?; 2 level i1 ?; i1 ?

0 1 0 1 0 1 0 1

j1 ?; j1 ?; j1 ?; j1 ?; j1 ?; j1 ?; j1 ?; j1 ?;

0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1

(0, 0) (0, 1) (1, 0) (1, 1) (0, 2) (0, 3) (1, 2) (1, 3) (2, 0) (2, 1) (3, 0) (3, 1) (2, 2) (2, 3) (3, 2) (3, 3)

Figure 2.6: A quadtree where quadrants are ordered in row-major order is equivalent to a binary tree where we begin by dividing the first dimension, then the second, and recursively. The ordering of the leaf nodes is the Z-order.

Listing 2.6: This function returns the Z-order index of the grid cell (i, j) #ifndef __BMI2__ #error "need __BMI2__ for _pdep_u64()" #endif #include 5 #include

uint64_t zorder(int i, int j) { return _pdep_u64(i, UINT64_C(0x5555555555555555)<<1) | 10 _pdep_u64(j, UINT64_C(0x5555555555555555)); } 2.6. 2D HILBERT CURVE 31

Implementation The Z-order of a point (x, y), with a given bounding box and a given recursion level k is very simple to compute, especially when using the ̀_pdep_u64 ̀ intrinsic, with the proper mask to space out the bits in i and j, as shown in Listing 2.6. If ̀_pdep_u64 ̀ is unavailable, one can use a table with the Moser– de Bruijn sequence and shifts to accomplish the same result almost as efficiently.

2.6 2D Hilbert curve

Now that we are so familiar with the Z-order that we see it simultaneously as an ordering, a curve, a tree and a function that interlaces the bits of two indices, we are ready to tackle the main part of this chapter: Hilbert curves.

Base pattern The Z-order curve contains jumps and, hence, is not Hölder continuous. TheZ- order curve is somewhat based on a recursive application of the row-major or- dering, which gives it its Z shape. To try to improve the curve, we can use a base pattern, which gives the ordering of four quadrants, that draws another shape. For example, in Figure 2.7, we define a base pattern that draws a C-shape. The quadrants with row-major indices {0, 1, 2, 3} have the indices {1, 0, 2, 3} with our new C-shaped base pattern. The different permutations of {0, 1, 2, 3} uniquely define every possible base pattern in 2D. Base patterns are therefore identified by this sequence. To that C-shaped base-pattern that we created corresponds an analog to the Z-order curve that we will call the C-order curve. The index of a cell (i, j) along the C-order curve is given by:

C(i, j) = f(i0 ⊔ j0) ⊔ f(i1 ⊔ j1) ⊔ ... ⊔ f(ik−1 ⊔ jk−1) with f(0) = 1 f(1) = 0 f(2) = 2 f(3) = 3

The full C-order curve that results from this definition is shown inFigure 2.8. Actually, every possible 2D base pattern is a reflection and/or a rotation ofthe row-major and C-shaped base pattern shown on Figure 2.7. An even number of permutations of {0, 1, 2, 3} gives a Z-shaped base pattern, an odd number of permutations gives a C-shaped pattern. Therefore, when changing only the2D 32 CHAPTER 2. HILBERT CURVE base pattern, the Z-order and C-order curves are the two only different space- filling curves that we can create (modulo rotation and reflection). Notethat the C pattern has a superior locality. Indeed, the base pattern connects only adjacent cells. Therefore, an increment of 1 on the C-shaped base pattern cor- responds to a change of 1 bit in the standard row-major pattern. This means that f : {0, 1, 2, 3} → {1, 0, 2, 3} is an inverse Gray code and f−1 : {0, 1, 2, 3} → {1, 0, 2, 3} is a Gray code1. However, the full C-order curve is still not continu- ous.

2 3 2 3

0 1 1 0

(a) The standard base pattern (b) A custom C-shaped {1, 0, 2, 3} {0, 1, 2, 3} given by the row-major base pattern. order.

Figure 2.7: Two different base patterns

Figure 2.8: 5 iteration of our custom-made C-order curve

Transformations To have even better locality properties than the C-order curve, we would like our whole curve to go from adjacent cell to adjacent cell. The Hilbert curve, shown on Figure 2.9, does exactly that, whatever the level of recursion. It has excellent locality properties: two cells that are close to each other on the curve are also close to each other on the grid[15, 64, 62]. In addition, it keeps the same recursive structure as the Z-order and C-order curve: refining the curve one iteration further does not change the ordering of previous grid quadrants

1A Gray code is an ordering of the binary numeral system such that two successive values differ in only one bit. The most well-known kind of Gray code is the binary-reflected Gray code. Here, f = f−1, so both are Gray code and inverse Gray code simultaneously. 2.6. 2D HILBERT CURVE 33

Figure 2.9: 5 iterations of the 2D Hilbert curve. and sub-quadrants. With a quad-tree representation of the curve, it means that the part of the tree that has been built is not changed by further iterations. The novelty introduced by the Hilbert curve is that a transformation is applied to each quadrants at each level of recursion. The base pattern {0, 1, 3, 2} is usually chosen, but a Hilbert curve can be found from any C-shaped base pattern. The {0, 1, 3, 2} pattern has the particularity of being the inverse of the unique binary reflected Gray code sequence. Both can be created automatically and efficiently in any dimension. The set of automorphisms of a square, which is the set of possible transfor- mations that can be applied to quadrants of our grid, are presented in Figure 2.10. To each 2 × 2 transformation matrix M, we associate a pair of numbers {a, b} which are simply given by:     a 1 = M . b 2 The different possible transformations of our square correspond to the differ- ent symmetries of a square. Hence, the group which expresses the composition of all pairs of transformations is the dihedral group D4[30]. The Caley table that dictates how pairs of transformations are composed is given in Table 2.1. Following the row-major ordering, the different transformations of each quad- rant are given by {2, 1}, {1, 2}, {−2, −1}, {1, 2} for the Hilbert curve, as shown in Figure 2.11a. A remarkable property of each of these 4 transformations is that they form a subgroup within D4. Indeed, the possible configuration that can be reached by composing these 4 transformations all have an even number of minus signs. The resulting subgroup is the Klein four-group[30].

Implementation To gradually show how a function that gives this ordering can be implemented, we made three versions of the same ̀mortonToHilbert ̀ function that trans- forms a Z-order index to a Hilbert index. The first version, given in Listing 2.7 is really straightforward: there are two arrays, ̀base_pattern ̀ and ̀transformation, ̀ which contain the base pattern and transformations that define our Hilbert 34 CHAPTER 2. HILBERT CURVE

    1 0 −1 0 0 1 0 1

{1, 2} {−1, 2}

(a) Identity (b) Vertical reflection     1 0 0 1 0 −1 1 0

{1, −2} {2, 1}

(c) Horizontal reflection (d) Axis permutation     0 1 −1 0 −1 0 0 −1

{2, −1} {−1, −2}

(e) 90o rotation (f) 180o rotation     0 −1 0 −1 1 0 −1 0

{−2, 1} {−2, −1}

(g) 270o rotation (h)

Figure 2.10: All possible transformations of the {0, 1, 3, 2} base pattern, expressed as axis permutation and reflections. Each transformation is a symmetry of the square. curve. There are also two auxiliary functions: ̀composeTransform ̀ and ̀applyTransform ̀, which compose transformations and apply a transforma- tion to the row-major index of a quadrant, respectively. The ̀mortonToHilbert ̀ function simply iterates through each pair of bits of the Z-order, applies the current transformation to the pair of bits to get the current quadrant and then compose the transformation of the quadrant with the current transformation. This first version is quite slow: the way pairs of integers are composed andthe 2.6. 2D HILBERT CURVE 35

2 3 2 3 ◦ e = a = a = a = b = ba = ba = ba = {1, 2}{2, −1}{−1, −2}{−2, 1}{−1, 2}{2, 1}{1, −2}{−2, −1} e e a a2 a3 b ba ba2 ba3 a a a2 a3 e ba3 b ba ba2 a2 a2 a3 e a ba2 ba3 b ba a3 a3 e a a2 ba ba2 ba3 b b b ba ba2 ba3 e a a2 a3 ba ba ba2 ba3 b a3 e a a2 ba2 ba2 ba3 b ba a2 a3 e a ba3 ba3 b ba ba2 a a2 a3 e

Table 2.1: Caley table of the dihedral group D4

{−2, −1} {1, 2} {−1, −2} {1, 2}

{2, 1} {1, 2} {−1, −2} {1, 2}

(a) (b)

Figure 2.11: The four transformations that: (a) are recursively applied to each quadrant of the Hilbert and Moore curve (b) are applied at the first iteration only to quadrants of the Moore curve way transformations are applied to a quadrant index are simply not efficient. The second version of the ̀mortonToHilbert ̀ function, given in List- ing 2.8, does not compute any transformation of quadrants. Instead, the index along the transformed base pattern is given in the ̀base_pattern ̀ lookup- table (LUT) for each possible transformation. Transformations are now sim- ply referenced by a number between 0 and 3. Likewise, another lookup-table named ̀configuration ̀ contains all the possible results for the composi- 36 CHAPTER 2. HILBERT CURVE

Listing 2.7: The first version of the ̀mortonToHilbert ̀ function, which returns the Hilbert index given the Z-order index unsigned base_pattern[4] = {0,1,3,2}; // given in row−major order int transformation[4][2] = {{2,1}, {1,2}, {-2,-1}, {1,2}}; // also row−major

// given a transformation ’a’ and a transformation ’b’, applies ’b’ to ’a’ 5 static inline void composeTransform(int a[2], int b[2]) { int copy[2] = {a[0], a[1]}; int first = ABS(b[0])-1; int second = ABS(b[1])-1; a[0] = SIGN(b[0]) * copy[first]; 10 a[1] = SIGN(b[1]) * copy[second]; }

// applies a tranformation to a quadrant row−major index static inline int applyTransform(int transfo[2], int quadrant) { 15 int ij_k[2] = {quadrant>>1, quadrant & 1}; int first = ABS(transfo[0])-1; int second = ABS(transfo[1])-1; return (ij_k[first] ^ (transfo[0] < 0) )<<1 | (ij_k[second] ^ (transfo[1] < 0) ); 20 }

// convert a 1D integer coordinate along the 2D Z−order curve // in a 1D integer coordinate along the 2D Hilbert curve uint64_t mortonToHilbert(uint64_t zorder, int k) { 25 int config[2] = {1,2}; uint64_t hilbert = 0;

for (int i=k-1; i>=0; i--) { unsigned quadrant = applyTransform(config, zorder>>(2*i) & 3); 30 hilbert = hilbert<<2 | base_pattern[quadrant]; composeTransform(config, transformation[quadrant]); }

return hilbert; 35 }

tion of the current transformation with the transformation of a quadrant. The ̀mortonToHilbert ̀ function does not call auxiliary functions anymore, it is simpler and more efficient. The two lookup-tables only contain numbers going from 0 to 3. Therefore, these two 4 × 4 tables could really be encoded in 64 bits. Listing 2.8 is at least 11× faster than Listing 2.7, as shown in Table 2.2. The last version of the code, given in Listing 2.9 is fully optimized. Instead of extracting a pair of bits from the Z-order to get a quadrant, it extracts 12 bits at a time, giving the location of the cell within a 64 × 64 sub-grid in row- major order. Compared to Listing 2.8, there is still a current transformation, which is a number from 0 to 3 stored in ̀config ̀, but instead of the 4 possible quadrants, we have 4096 possible cell positions. It is possible to compute a 4 × 4096 ̀configuration ̀ table and a 4 × 4096 ̀base_pattern ̀ table. A 2.6. 2D HILBERT CURVE 37

Listing 2.8: The second version of the ̀mortonToHilbert ̀ function (introducing lookup-tables), which returns the Hilbert index given the Z-order index // base_pattern[i][j] is the hilbert index given: // * i: the current configuration // * j: the quadrant in row−major order unsigned base_pattern[4][4] = { 5 {0,1,3,2}, // config 0 = {1,2} ) {0,3,1,2}, // config 1 = {2,1} n {2,3,1,0}, // config 2 = {−1,−2} ( {2,1,3,0}};// config 3 = {−2,−1} U

10 // configuration[i][j] is the next configuration given: // * i: the current configuration // * j: the quadrant in row−major order unsigned configuration[4][4] = { {1,0,3,0}, // ) => U) 15 // n) {0,2,1,1}, // n => nn // )( {2,1,2,3}, // ( => (U // (n 20 {3,3,0,2} // U => )( }; // UU

// convert a 1D integer coordinate along the 2D Z−order curve // in a 1D integer coordinate along the 2D Hilbert curve 25 uint64_t mortonToHilbert(uint64_t zorder, unsigned k) { unsigned config = 0; uint64_t hilbert = 0;

for (int i=k-1; i>=0; i--) { 30 unsigned quadrant = zorder>>(2*i) & 3; hilbert = hilbert<<2 | base_pattern[config][quadrant]; config = configuration[config][quadrant]; }

35 return hilbert; }

base pattern, here, is actually a pattern over 6 iterations of the Hilbert curve,and a transformation is the composition of the six transformations that corresponds to a 12-bit position. These two lookup-tables are actually used in Listing 2.9, but they are hided inside a single lookup-table named ̀hilbertLUT ̀, contained in the file LUT_6iter.h. Each value in the ̀hilbertLUT ̀ table is a 14-bit number. The two most significant bits represent a transformation, the 12 least significant bits correspond to the ordering given by the extended base pattern that span 6 iterations. Merging the two tables into one saves some memory space. The ̀hilbertLUT ̀ is only 32Kb big, which fits in the last level of cache ofany modern computer. In addition, ̀config ̀ is already shifted to the left by12 dirty bits to save an instruction. If we use a 4-bit configuration like in Listing 2.8, we unnecessarily shift a value back and forth to the right and to theleft, 38 CHAPTER 2. HILBERT CURVE

Listing 2.9: The optimized version of the ̀mortonToHilbert ̀ function (6 iterations at a time), which returns the Hilbert index given the Z-order index #include "LUT_6iter.h"

static inline uint64_t mortonToHilbert(uint64_t zorder, int k) { uint16_t config = 0; 5 uint64_t hilbert = 0;

int shift; for (shift=2*k-12; shift>0; shift-=12) { config = hilbertLUT[(config & ~0xFFF) | (zorder>>shift & 0xFFF)]; 10 hilbert = hilbert<<12 | (config & 0xFFF); }

// final step must be shifted the other way config = hilbertLUT[(config & ~0xFFF) | (zorder<<(-shift) & 0xFFF)]; 15 hilbert = hilbert<<12 | (config & 0xFFF);

return hilbert >>= -shift; }

which is equivalent to discarding the lower bits:

// same as v=hilbertLUT[config][zorder>>shift & 0xFFF] uint16_t v = *(hilbertLUT + (config<<12 | (zorder>>shift & 0xFFF))); hilbert = hilbert<<12 | (v & 0xFFF); config = v >> 12;

This final optimization was inspired bya blog entry2.

Listing 2.7 Listing 2.8 Listing 2.9 Time [s] 2.00 0.18 0.03 Speedup 1 11.11 66.67

Table 2.2: Timings for computing the 16 777 216 Hilbert indices associated to each possible coordinates in a 212 × 212 grid. The three versions of the same ̀mortonToHilbert ̀ function given in Listing 2.7, 2.8 and 2.9 were tested.

The timing for computing ̀mortonToHilbert(zorder(i, j), 4096), ̀ for each of the three versions of the ̀mortonToHilbert ̀ function, are shown in Table 2.2. Because of the small additional constant time that the ̀zorder ̀ function adds, Listing 2.9 is actually a little more than 6 times faster compared to Listing 2.8, exactly as expected. Indeed, the later version does 6 iterations at a time, uses one instruction less per iteration and has not much bandwidth requirement as its lookup-table also fits easily in the last level of Cache memory.

2at http://threadlocalmutex.com/ 2.6. 2D HILBERT CURVE 39

The Moore curve

Figure 2.12: 5 iterations of the 2D Moore curve

In 2D, the Hilbert curve is the only curve that visits only neighboring cells and that can be created by applying the same set of four transformations on each quadrant, one iteration after the other. However, if we allow ourselves to change the set of transformations from one iteration to another, we can create a whole zoo of space-filling curves with the Hilbert property. One particularly interesting choice is to apply the transformations {−1, −2}, {1, 2}, {−1, −2}, {1, 2} (following the row-major ordering of quadrant) at the first iteration only. The following iterations employ the standard Hilbert transformations {2, 1}, {1, 2}, {−2, −1}, {1, 2}. This special modification of Hilbert’s rule forms what is called the Moore curve, a closed alternative to the Hilbert curve. Indeed, in contrast with the Hilbert curve, which has a clear beginning and a clear end, the Moore curve forms a loop: the curve ends next to where it started. Let M ∈ [0 . . . 22k] be the index of a cell cM along the Moore curve, the indices

(M + 1) mod 22k (M − 1) mod 22k correspond to cells that share an edge with cM, even at the beginning and the end of the curve. This property is used in Section 2.9, to create partitions along the Moore curve that can contain its start and its end without being discon- tinuous. Notice that the transformations of quadrants during the first itera- tion all have an even number of minus signs: {−1, −2}, {1, 2}, {−1, −2}, {1, 2}. The possible transformations that can be reached by composing these trans- formations therefore also contain an even number of negative signs, forming a subgroup which is the same Klein four-group as for the rest of the (Hilbert) re- cursions. The transformations {−2, 1}, {−2, 1}, {2, −1}, {2, −1} also form a Moore curve but these transformations lead to a wider number of possible configura- tions when composed with the subsequent transformations of the Hilbert curve. Implementation-wise, the Hilbert and Moore curve only differ for the first it- eration. Listing 2.10 is a very slight modification of Listing 2.8 that gives the Moore curve. 40 CHAPTER 2. HILBERT CURVE

Listing 2.10: The ̀mortonToMoore ̀ function returns the Moore index given the Z- order index // hilbert index = base_pattern[config][quadrant] unsigned base_pattern[5][4] = { {0,1,3,2}, {0,3,1,2}, {2,3,1,0}, {2,1,3,0}, {0,1,3,2}}; // config 4 = {1,2} ) Moore 5 // next_config = configuration[current_config][quadrant] unsigned configuration[5][4] = { {1,0,3,0}, {0,2,1,1}, {2,1,2,3}, {3,3,0,2}, {2,0,2,0} // ) Moore => nn 10 }; // uu

// convert a 1D integer coordinate along the 2D Z−order curve // in a 1D integer coordinate along the 2D Hilbert curve uint64_t mortonToMoore(uint64_t zorder, unsigned k) { 15 unsigned config = 4; uint64_t moore = 0;

for (int i=k-1; i>=0; i--) { unsigned quadrant = zorder>>(2*i) & 3; 20 moore = moore<<2 | base_pattern[config][quadrant]; config = configuration[config][quadrant]; }

return moore; 25 }

Other space filling curves

Figure 2.13: 5 iterations of the H-order curve

There are plenty of other possible space-filling curves, even whencon- strained to 2k × 2k grids. For example, the H-order curve shown in Figure 2.13 has even better locality properties than the Hilbert and Moore curves. How- ever, it is slower to compute, primarily because the recursive rule is not applied to quadrant, but to triangular patches. Looking closely at the curve with k = 5, the diagonals that separate the quadrants into quasi-independent triangles can be seen. Figure 2.14 shows the Sierpiński curve, just for your viewing pleasure. Bader wrote a very good book (in our humble opinion) that gives examples of different space-filling curves together with their applications[15]. 2.7. 3D HILBERT CURVE 41

Figure 2.14: 5 iterations of the Sierpiński curve

2.7 3D Hilbert curve

Almost every algorithm that has been reviewed so far extends naturally to 3 dimensions. For example, updating Listing 2.6 to obtain a three-dimensional Z-order index gives:

uint64_t zorder(int i, int j, int k) { return _pdep_u64(i, UINT64_C(0x5555555555555555)<<2) | _pdep_u64(j, UINT64_C(0x5555555555555555)<<1) | 5 _pdep_u64(k, UINT64_C(0x5555555555555555)); }

The only challenging part is to find a 3D base pattern and a set oftransfor- mations of all octants that will give us a 3D version of the Hilbert curve. But first, an important question arises: “How many three-dimensional Hilbert curves are there?”. This question is answered by the fantastic work of Haverkort on3D Hilbert curves whose title is literally that question[61]. The 2D Hilbert curve has multiple different combinations of properties that make it unique in 2D.De- pending on what set of properties are deemed essential for a 3D Hilbert curve to really be a Hilbert curve, Haverkort distinguishes a total of 10 694 807 possible three-dimensional Hilbert curves (modulo rotations, reflections, translations, scalings and reversals). Haverkort, in addition to applying simple reflections to octants of a curve, also applies a function χ(t) that either reverses or keeps the direction of the curves in an octant. To keep the same simplicity as in 2D for our implementation of the 3D curve, applying such a function χ(t) is not desired. This limits our search to what Haverkort calls order-preserving curves. Further- more, it seems only logical to keep only the curves for which the next cell is one of the four closest neighboring cells. Those Hilbert curves have better local- ity than others in general and are obviously optimal when looking only at the locality of two consecutive cells. Haverkort differentiates those curves as be- ing face-continuous. There are 920 face-continuous, order-preserving curves[61]. Among those, we are particularly interested in curves that restrict the possible transformations to a subgroup of the full octahedral group, also called hyperoc- 42 CHAPTER 2. HILBERT CURVE

3 tahedral group B3, which is the group of symmetries of a cube . The 48 possi- ble transformations are illustrated for an arbitrary base pattern in Figure 2.15. Transformations with an even number of reflections (the number of negative

2 alternating group A4 subgroup 1 3 −1, 2, 3 1, −2, 3 1, 2, −3 −1, −2, 3 1, −2, −3 −1, 2, −3 −1, −2, −3

1, 2, 3

2, 1, 3

1, 3, 2

3, 2, 1

3, 1, 2

2, 3, 1

Figure 2.15: The 48 possible transformations of a 3D base pattern, corresponding to symmetries of the cube. Transformations that form an alternating group A4 are shown in dark yellow. signs) and an even number of permutations form a sub-group corresponding to the alternating group A4. Haverkort employs the name coordinate-shifting curves, for curves that have only even number of permutations in the transfor- mation of each octant. However, he does not specify any name for curves with an even number of reflections. Interestingly, all transformations of the five face-continuous, order-preserving, coordinate-shifting 3D Hilbert curves have an even number of minus signs. Therefore, using one of these 5 curves, the same strategy as in Listing 2.8 and 2.9 can be applied, limiting the number of possi- ble transformations to 12 instead of 48. The base patterns and transformations

3https://en.wikipedia.org/wiki/Octahedral_symmetry 2.7. 3D HILBERT CURVE 43 that define those five curves are given in Table 2.3. In practice, a lookup-table (LUT) can contain at most four iterations at once such as to fit in the last level of cache memory. Indeed, 4 iterations represent 84 = 4096 different possible positions, multiplied by the 12 transformations, multiplied by 2 bytes (needed to represent a specific combination of a transformation and a position inthe LUT), already equals 98 304 bytes. Name Transformations Base pattern (Haverkort notation) (in the base pattern order) {2, 3, 1}, {3, 1, 2}, {1, 2, 3}, {−3, −1, 2}, Ca00.chI {3, −1, −2}, {1, 2, 3}, {−3, 1, −2}, {2, −3, −1} {2, 3, 1}, {3, 1, 2}, Ca00: {1, 2, 3}, {−3, −1, 2}, Ca00.cc.h4.I3.I3 {−1, −2, 3}, {−3, 1, −2}, {0, 1, 3, 2, 7, 6, 4, 5} {−3, 1, −2}, {2, −3, −1} {2, 3, 1}, {3, 1, 2}, {1, 2, 3}, {2, −3, −1}, Ca00.cTI {2, 3, 1}, {1, 2, 3}, {−3, 1, −2}, {2, −3, −1} {2, 3, 1}, {3, 1, 2}, Ca00.c4I {3, 1, 2}, {−1, −2, 3}, aka “Butz” {−1, −2, 3}, {−3, 1, −2}, {−3, 1, −2}, {2, −3, −1} Si00: {3, 1, 2}, {1, 2, 3}, {0, 5, 1, 4, 7, 6, 2, 3} {2, 3, 1}, {2, 3, 1}, Si00.cc.LT.I3.II {2, −3, −1}, {−3, 1, −2}, {1, 2, 3}, {2, −3, −1},

Table 2.3: Base patterns and transformations that define thefive face-continuous, order- preserving, coordinate-shifting 3D Hilbert curves[61]

3D Moore curve The set of transformations that are only applied at the first iteration ofthe Hilbert curve to make it loop can also be found in 3D but they vary depend- ing on the base pattern used. The different transformations that create Moore curves with the base pattern Ca00 or Si00 are given in Table 2.4. 44 CHAPTER 2. HILBERT CURVE

Transformations Base pattern (in the base pattern order) Ca00: {2, 3, 1}, {3, 1, 2}, {0, 1, 3, 2, 7, 6, 4, 5} {1, 2, 3}, {−3, −1, 2}, {3, −1, −2}, {1, 2, 3}, {−3, 1, −2}, {2, −3, −1} Si00: {3, 1, 2}, {1, 2, 3}, {0, 5, 1, 4, 7, 6, 2, 3} {2, 3, 1}, {2, 3, 1}, {2, −3, −1}, {−3, 1, −2}, {1, 2, 3}, {2, −3, −1},

Table 2.4: The transformations that need to be applied at the first iteration of theMoore curve depending on the base pattern used

2.8 Other Implementation of Hilbert ordering

Sorting points following the ordering given by a space-filling curve (SFC) is the most popular usage of SFC. Computing the Hilbert/Moore indices with the method in Listing 2.9 is lightning fast: a vectorized parallel computation of all indices is only two times slower than a parallel computation of the bounding box on the same points. In comparison, sorting the same points by their 1D indices is approximately 1000× slower using a top-notch parallel sorting algo- rithm. Some papers claim that they can speedup the 2D points sorting process by computing only the necessary bits of the Hilbert indices, i.e. increase k lo- cally until only a few points fall in the same grid cell[118, 28]. In fact, those implementations use an analog to a MSD 2 bits radix sort. For each iteration of the curve, they compute the corresponding quadrant into which each point falls, which gives 2 bits per point. Then, they sort points following the order of the quadrants, as a MSD 2-bit radix sort does. Each bucket is sorted along sub-quadrants, and so on recursively, until there is no need to refine anymore because few points are left in the current bucket. This technique has the advan- tage of not needing to store Hilbert indices. It is therefore possible to compute extremely refined Hilbert orderings without big memory requirement. How- ever, when computing the Hilbert indices with k = 32 with our method, yield- ing 232 = 4 294 967 296 cells per dimensions, points are rarely grouped together in the same cell and the Hilbert only take 64 bits per point. The advantage of adaptive methods therefore becomes really shifty. Indeed, comparison based sorting algorithms will not take more time if the Hilbert indices are highly re- fined. Bucket sort and any MSD radix sort will also stop once the buckets orbins 2.8. OTHER IMPLEMENTATION OF HILBERT ORDERING 45 contain less than two elements, leaving unneeded bits unused. In practice, their method has no real advantage, because the times it takes to compute Hilbert indices is negligible compared to the time it takes to sort points. Furthermore, adaptive methods have a big issue: 2-bit MSD radix sort is very difficult to parallelize efficiently. Our method, however, is totally parallel: the for loop computing Hilbert indices is trivially parallelized and the most efficient parallel sorting algorithm can be freely chosen for sorting points following their associ- ated Hilbert index. In addition, our mesh generation library needs to store the full Hilbert/Moore indices anyway, because our partitioning method use them, as explained in Section 2.9. Lookup-tables are actually rarely used in the wild when computing Hilbert indices. If you only occasionally want to calculate the Hilbert index of a single point, which can be useful in some cases, using a look-up table can degrade the performance of the program because the table needs to be loaded in cache memory first and it will invalidate cache entries used by the rest of thepro- gram. In addition, most codes available on the net are not limited to 2D or 3D. They have a general function that takes n k-bit integers representing the posi- tion of a cell in an n-dimensional 2k × ... × 2k grid, and return a kn-bit integer representing the Hilbert index. Because the number of possible configurations quickly explodes when n gets bigger than 4, using lookup-tables in higher di- mension becomes impractical from a memory bandwidth standpoint. Indeed, there are 2n · n! symmetries of the hypercube of order n, corresponding to the hyperoctahedral group Bn. Hence, using lookup-tables is only efficient in prac- tice for n ⩽ 4, where the tables can still fit into fast cache memory and when enough Hilbert indices needs to be computed such that cache invalidation is not a problem. Therefore, many algorithms compute what is stored in our lookup tables on the fly. Fortunately, it can be done super efficiently, although notasef- ficiently as loading a value from the highest level of cache. Butz’s algorithm uses inverse binary reflected Gray code for computing the base pattern andbi- nary operations that yield the sub-quadrants transformations in any number of dimensions[27, 26]. Most implementations use the same procedurally gen- erated combination of base pattern and transformations as Butz[71, 60, 61]. In addition, most implementation do not store the current transformation. The transformation are directly applied to the k-bit integer coordinates (i, j, . . .). All transformations, as given in Figure 2.10 for the 2D case, can be decomposed in axis permutations and reflections. Therefore, transformations of a quadrant at iteration q can be done very efficiently:

1. Axis permutation can simply done by permuting iq→k and jq→k.

2. Reflection of rows or columns is done by negating all bitsof iq→k or jq→k. 46 CHAPTER 2. HILBERT CURVE

In a 2004 paper, Skilling is able to save some instructions, by combining the generation of the base pattern, as given by Butz, with the transformations[114]. Instead of mapping n bits to their Gray code k times, Skilling only applies Gray code once to a full nk-bit integer. Skilling figured out how the transformations need to be modified in order to get the same result. It appears that someop- erations for finding the base pattern and applying transformations cancel each other out, as Skilling’s algorithm use even fewer instructions than the improved version of Butz algorithm given by Lawder[71]. However, as good as Skilling’s code is for the general n-dimensional case, our method based on lookup-tables is a lot faster in 2D and 3D when computing the Hilbert or Moore indices of big point set. Mesh generation typically use totally inefficient methods toget the Hilbert ordering. Most often, they do not map floating-points coordinates to integers. Therefore, they repeatedly compute the coordinates of the center and determine if a point is in a quadrant using comparisons. Afterwards, rotations and reflections are applied to the coordinates in a simple manner, sometimes us- ing transformation matrices. Such an implementation is very inefficient. Later in this thesis, the repeated use of the Moore curve in our mesh generation soft- ware will be explained. A bad Moore curve implementation would have com- pletely prevented our strategy from being effective. Indeed, even if Hilbert in- dices are most commonly used as a sort key, our partitioning method requires that we calculate many more indices than we sort points.

2.9 SFC partitioning

Cutting a space-filling curve (SFC) into p sections gives p partitions of the space. Indeed, given a SFC divided into p distinct parts, a partition ti,(i ∈ [0 . . . p−1]) is simply defined as the set of grid cells which are visited bythe ith section of the curve. The partition is thus completely defined by its first SFC indexand its length on the space-filling curve. Given p partitions, the partition ti begins kn at index Mi ∈ [0 . . . 2 ] and ends at index Mj − 1, with j = (i + 1) mod p. When p partitions, have been created, knowing if a point q of Moore index Mq is inside a partition is extremely simple. Considering that the C standard specifies that unsigned integer wraps around, uint64_t(x + y) = (x + y) 64 mod 2 , if Mi and Mj are stored in a single uint64_t (requiring kn < 64), then these two inequalities are totally equivalent:

uint64_t(Mq − Mi) < Li = uint64_t(Mj − Mi) k kn (Mq − Mi) mod 2 n < (Mj − Mi) mod 2 . 2.9. SFC PARTITIONING 47

In addition, these inequalities can only be true if the point q is in the partition ti. To know if a point q is inside a partition ti can therefore be done with only two basic operation if both the start Mi and the length Li of the partition are stored: one subtraction and one comparison.

Moore partitions

129 94 145 109 98 120 68 85 133 89 181 155 94 166 50 137 173 86 10 143 78 190 16 30

60 204 23 164 34 4 start 219 0 188 end 246 255 228 7 235 17 242 start end 236 207 0 255 (a) (b)

129 94 145 109 98 120 68 85 133 89 181 155 94 166 50 137 86 173 10 143 78 190 16 30

60 204 23 164 34 4 start 219 188 end 246 228 7 235 17 242 start end 236 207

partition partition #points #points start end start end 1st thread 7 85 5 1st thread 4 49 5 2nd thread 86 132 5 2nd thread 50 108 5 3rd thread 133 189 5 3rd thread 109 187 5 4th thread 190 6 5 4th thread 188 3 5

(c) (d)

Figure 2.16: Partitions of the Moore curve with a uniform underlying grid (left) and a non uniform grid, created from non-uniform intervals as explained in Section 2.4 (right). In both cases, each partition contains 5 points. Indeed, the starting and ending Moore indices of each partition are defined in a way that ensure a fair distribution of points. 48 CHAPTER 2. HILBERT CURVE

nk Let M ∈ [0 . . . 2 ] be the index of a cell cM along an n-dimensional Moore curve, then the indices (M + 1) mod 2nk (M − 1) mod 2nk correspond to cells that share a facet of dimension n − 1 with cM, even at the beginning and the end of the curve. Hence, any section of the Moore curve forms a face-continuous set of cells, even if the section contains the beginning and end of the curve. In addition, the great locality properties of the Hilbert and Moore curve mean that small sections of these curves map to small re- gions in n-dimensional space. Therefore, the partitions given by this method are usually compact. Furthermore, using SFC partitioning, it is very easy to create partitions that each contains the same number of points from a given point set. The points need to be sorted by their 1D indices on the SFC. Atthis moment, a section on the list of sorted points corresponds to a section of the SFC which in turn corresponds to a partition. By dividing the list into equal- sized sections, the corresponding partitions contain the same number of points. In Figure 2.16c, 4 Moore partitions were created from a list of 20 points sorted by their Moore indices. Hence, each partition ti begins at the index given by the i∗5th point in the sorted list. A partition ends at the index preceding the in- dex where the next partition t(i+1) mod 4 begins. Figure 2.16d shows partitions based on a non-uniform grid (Figure 2.16b), created from non-uniform inter- vals, as explained in Section 2.4. In addition, Moore indices were subjected to a right circular shift by f ∈ Z positions. More specifically, an original Moore index M ∈ [0 . . . 2kn[, where k is the level of refinement of the Moore curve and n is the number of dimensions, was shifted to M′ = (M + f) mod 2kn. By changing the grid and shifting the Moore indices, it is possible to obtain SFC partitions that are vastly different.

Use for parallelization Suppose that different threads in a program need to do local constant time mod- ifications around each point of a point set, but different threads cannot workon the same portion of space at the same time. SFC partitioning allows the work to be perfectly divided between multiple threads, and the compactness of par- titions limits the risk of conflicts. SFC partitioning is thus very useful for paral- lelization and is employed in various for this purpose[40, 8, 6, 74, 77]. It is often useful to know if a point with no associated workload is inside apar- tition or not. This point must not be sorted with the points that have associated workload in order to keep an optimal load-balancing between threads. Only its SFC index needs to be computed, which is usually extremely quick. 2.9. SFC PARTITIONING 49 k-d tree SFC partitioning A k-d tree provides a pseudo-grid which is more refined where more points are present. Indeed, the traversal of k-d tree, similar to the tree on Figure 2.6, naturally defines a sort of Z-order. Using the ̀mortonToMoore() ̀ function, it is thus easy to reorder cells to obtain a Moore curve on this pseudo-grid. The resulting Moore partitions are usually better than those with a standard grid. However, the creation of a k-d tree more or less requires to sort n-dimensional points n times: once for each dimension4. In 3D for example, it means that partitions based on a k-d tree pseudo-grid will be created approximately three times slower than partitions on a standard (not necessarily uniform) grid. A balance must therefore be struck between SFC partitioning quality and speed.

4This is not necessarily the case: an approximated median can be computed from asubsetof points to speedup the process. However the same is true for creating partitions with a standard grid: only a subset of the points can be sorted and partitions will still be balanced in average.

Chapter 3

Delaunay Triangulation

This chapter, together with Chapter 4 and Chapter 5, is mostly a reproduction of our paper entitled “One machine, one minute, three billion tetrahedra”[81]. This chapter details our sequential Delaunay kernel.

The Delaunay triangulation is a fundamental geometrical object that as- sociates a unique triangulation to a given point-set in general position. This triangulation and its dual, the Voronoi diagram, have locality properties that make them ubiquitous in various domains[14]: mesh generation[33], surface reconstruction from 3D scanners point clouds[19], astrophysics[31, 94], terrain modeling[55] etc. Delaunay triangulation algorithms now face a major chal- lenge: the availability of more and more massive point sets. LiDAR or other photogrammetry technologies are used to survey the surface of entire cities and even countries, like the Netherlands[84]. The size of finite element mesh is also growing with the increased availability of massively parallel numeri- cal solvers. It is now common to deal with meshes of over several hundred millions of tetrahedra[95, 63, 122]. Parallelizing 3D Delaunay triangulations and Delaunay-based meshing algorithms is however very challenging and the mesh generation process is nowadays considered a technological bottleneck in computational engineering [115]. The most part of today’s clusters have two levels of parallelism. Distributed memory systems contain thousands of nodes, each node being itself a shared memory system with multiple cores. In recent years, nodes have seen their number of cores and the size of their memory increase, with some clusters al- ready featuring 256-core processors and up to 12tb of RAM[88, 52]. As many- core shared memory machines are becoming standard, Delaunay triangulation algorithms designed for shared memory should not only scale well up to 8 cores

51 52 CHAPTER 3. DELAUNAY TRIANGULATION but to several hundred cores. In this paper, we complete the approach pre- sented in our previous research note[80] and show that a billion tetrahedra can be computed very efficiently on a single many-core shared memory machine. This chapter detail our sequential implementation of the Delaunay trian- gulation algorithm in 3D that is able to triangulate a million points in about 2 seconds. In comparison, the fastest 3D open-source sequential programs (to our best knowledge): Tetgen[113], CGAL[24] and Geogram[73] all triangulate a million points in about 6 seconds on one core of a high-end laptop. Our imple- mentation is also based on the incremental Delaunay insertion algorithm, but the gain in performance is to ascribe to the details of the specific data structures we have developed, to the optimization of geometric predicates evaluation, and to specialized adjacency computations. This sequential implementation will later be integrated in a parallel version (Chapter 4) using partitions based on the Hilbert curve (Chapter 2). The par- allel version will itself be integrated into a mesh refinement process, forming the basis of our tetrahedral mesh generator (Chapter 5). This mesh generator will later be enhanced by a mesh improvement step (Chapter 6) and its overall performance will be tested against two reference open-source implementations that use similar techniques: TetGen and Gmsh (Chapter 7).

3.1 Bowyer-Watson algorithm

The Delaunay triangulation( DT S) of a point set S has the fundamental geomet- rical property that the circumsphere of any tetrahedron contains no other point of S than those of the considered tetrahedron. More formally, a triangulation1 3 T(S) of the n points S = {p1, . . . , pn} ∈ R is a set of non overlapping tetrahe- dra that covers exactly the convex hull Ω(S) of the point set, and leaves no point pi isolated. If the empty circumsphere condition is verified for all tetrahedra, the triangulation T(S) is said to be a Delaunay triangulation. If, additionally, S contains no group of 4 coplanar points and no group of 5 cospherical points, then it is said to be in general position, and the Delaunay triangulation DT(S) is unique. The fastest serial algorithm to build the 3D Delaunay triangulation( DT S) is probably the Bowyer-Watson algorithm, which works by incremental inser- tion of points in the triangulation. The Bowyer-Watson algorithm, presented in (§3.2), was devised independently by Bowyer and Watson in 1981 [25, 124]. Efficient open-source implementations are available: Tetgen [113], CGAL [24]

1This paper is about 3D meshing. Still we use the generic term triangulation insteadof tetrahedralization. 3.2. ALGORITHM OVERVIEW 53 and Geogram [73]. They are designed similarly and offer therefore similar per- formances (Table 3.1). In the remaining of this section, the incremental insertion algorithm is re- called and we describe the dedicated data structures that we developed as well as the algorithmic optimizations that make our sequential implementation three times faster than reference ones.

3.2 Algorithm overview

Let DTk be the Delaunay triangulation of the subset Sk = {p1, . . . , pk} ⊂ S. The Delaunay kernel is the procedure to insert a new point pk+1 ∈ Ω(Sk) into DTk, and construct a new valid Delaunay triangulation DTk+1 of Sk+1 = {p1, . . . , pk, pk+1}. The Delaunay kernel can be written in the following abstract manner:

DTk+1 ← DTk − C(DTk, pk+1) + B(DTk, pk+1), (3.1) where the Delaunay cavity C(DTk, pk+1) is the set of all tetrahedra whose cir- cumsphere contains pk+1 (Figure 3.1b), whereas the Delaunay ball B(DTk, pk+1) is a set of tetrahedra filling up the polyhedral hole obtained by removing the Delaunay cavity C(DTk, pk+1) from DTk (Figure 3.1c).

Ours Geogram TetGen CGAL Sequential_Delaunay 12.7 34.6 32.9 33.8 Init + SoRt 0.5 4.2 2.1 1.3 IncRemental inseRtion 12.2 30.4 30.8 32.5 WalK 1.0 2.1 1.6 1.4 orient3d 0.7 1.4 1.1 ≈ 0.5 Cavity 6.2 11.4 ≈ 10 14.9 inSphere 3.2 6.2 5.6 10.5 DelaunayBall 4.5 12.4 ≈ 15 15.3 Computing sub-determinants 1.3 / / / Other operations 0.5 4.5 ≈ 4 ≈ 1

Table 3.1: Timings for the different steps of the Delaunay incremental insertion (Al- gorithm 1) for four implementations: Ours, Geogram[73], Tetgen[113] and CGAL[24]. Timings in seconds are given for 5 million points (random uniform distribution). The ≈ prefix indicates that no accurate timing is available. 54 CHAPTER 3. DELAUNAY TRIANGULATION

The complete state-of-the-art incremental insertion algorithm (Algorithm 1) that we implemented has five main steps that are described in the following.

Init The triangulation is initialized with the tetrahedron formed by thefirst four non-coplanar vertices of the point set S. These vertices define a tetrahe- dron τ with a positive volume.

SoRt Before starting point insertion, the points are sorted so that two points that have close indices are close in space. Used alone, this order would result in cavities C(DTk, pk+1) containing a pathologically large number of tetrahedra. Insertions are organized in different randomized stages to avoid this issue[11]. The three kernel functions WalK, Cavity and DelaunayBall thereby have a constant complexity in practice. We have implemented a very fast sorting procedure (Section 3.4).

WalK The goal of this step is to identify the tetrahedron τk+1 enclosing the next point to insert pk+1. The search starts from a tetrahedron τk in the last Delaunay ball B(DTk, pk), and walks through the current triangulation DTk in direction of pk+1 (Figure 3.1a). We say that a point is visible from a facet when the tetrahedron defined by this facet and this point has negative volume. The WalK function thus iterates on the four facets of τ, selects one from which the point pk+1 is visible, and then walks across this facet to the adjacent tetrahe-

Algorithm 1 Sequential computation of the Delaunay triangulation DT of a set of vertices S Input: S Output: DT(S) ▷ Section 3.3 1: function Se葥ential_Delaunay(S) 2: τ ← Init(S) ▷ τ is the current tetrahedron 3: DT ← τ ′ 4: S ← SoRt(S \ τ) ▷ Section 3.4 ′ 5: for all p ∈ S do 6: τ ← WalK(DT, τ, p) 7: C ← Cavity(DT, τ, p) ▷ Section 3.5 8: DT ← DT \C 9: B ← DelaunayBall(C, p) ▷ Section 3.6 10: DT ← DT ∪ B 11: τ ← t ∈ B 12: return DT 3.2. ALGORITHM OVERVIEW 55

pk+1

pk τ

(a) WalK

pk+1 pk+1

C(DTk, pk+1) B(DTk, pk+1)

A(DTk, pk+1) A(DTk, pk+1) (b) Cavity (c) DelaunayBall

Figure 3.1: Insertion of a vertex pk+1 in the Delaunay triangulation DTk: (a) The triangle containing pk+1 is obtained by walking toward pk+1. The WalK ∈ B starts from τ pk . (b) The Cavity function finds all cavity triangles (orange) whose circumcircle contains the vertex pk+1. They are deleted, while cavity adjacent triangles (green) are kept.

(c) The DelaunayBall function creates new triangles (blue) by connecting pk+1 to the edges of the cavity boundary. 56 CHAPTER 3. DELAUNAY TRIANGULATION dron. This new tetrahedron is called τ and the WalK process is repeated until none of the facets of τ sees pk+1, which is equivalent to say that pk+1 is inside τ (see Figure 3.1a). The visibility walk algorithm is guaranteed to terminate for Delaunay tri- angulations [39]. If the points have been sorted, the number of walking steps is essentially constant [98]. Our implementation of this robust and efficient walking algorithm is similar to other available implementations.

Cavity Once the tetrahedron τ ← τk+1 that contains the point to insert pk+1 has been identified, the function Cavity finds all tetrahedra whose cir- cumsphere contain pk+1 and deletes them. The Delaunay cavity C(DTk, pk+1) is simply connected and contains τ [109], it is then built using a breadth-first search algorithm. The core and most expensive operation of the Cavity func- tion is the inSphere predicate, which evaluates whether a point e is inside/on or outside the circumsphere of given tetrahedron. This Cavity function is thus an expensive function of the incremental insertion, which accounts for about 33% of the total computation time (Table 3.1). To accelerate this, we propose in Section 3.5 to precompute sub-components of the inSphere predicate.

DelaunayBall Once the cavity has been carved, the DelaunayBall func- tion first generates a set of new tetrahedra adjacent to the newly inserted point pk+1 and filling up the cavity, and then updates the mesh structure. In partic- ular, the mesh update consists in the computation of adjacencies between the newly created tetrahedra. This is the most expensive step of the algorithm, with about 40% of the total computation time (Table 3.1). To accelerate this step, we replace general purpose elementary operations like tetrahedron creation/dele- tion or adjacency computation, with batches of optimized operations making benefit from a cavity-specific data structure (Section 3.6).

3.3 Mesh data structure

One key for enhancing performances of a Delaunay triangulation algorithm resides in the optimization of the data structure used to store the triangulation. Various data structure designs have been proposed that are very flexible and allow representing hybrid meshes, high order meshes, add mesh elements of any type, or manage adjacencies around vertices, edges, etc. The versatility of such general purpose data structures has a cost, both in terms of storage and efficiency. Here, our aim is to have a structure as lightweight andfastas possible, dealing exclusively with 3D triangulations. Our implementation is coded in plain C language, with arrays of doubles, floats, and integers to store 3.3. MESH DATA STRUCTURE 57

mesh topology and geometry. This seemingly old-style coding has important advantages in terms of optimization and parallelization because it compels us to use simple and straightforward algorithms.

typedef struct { double coordinates[3]; uint64_t padding; } point3d_t; 5 typedef struct { struct { uint32_t* vertex_ID; uint64_t* neighbor_ID; 10 double* sub_determinant; uint64_t num; // number of tetrahedra uint64_t allocated_num; // capacity [in tetrahedra] } tetrahedra;

15 struct { point3d_t* vertex; uint32_t num; // number of vertices uint32_t allocated_num; // capacity [in vertices] } vertices; 20 } mesh_t;

Listing 3.1: The aligned mesh data structure mesh_t we use to store the vertices and tetrahedra of a Delaunay triangulation in 3D.

The mesh data only contains vertices and tetrahedra explicitly, and alltopo- logical and geometrical information can be deduced from it. However, in order to speed up mesh operations, it is beneficial to store additional connectivity information. A careful trade-off needs however to be made between compu- tational time and memory space. The only connectivity information we have chosen to store is the adjacency between tetrahedra, as this allows walking through the mesh using local queries only.

Vertices Vertices are stored in a single array of structures point3d_t (see Listing 3.1). For each vertex, in addition to the vertex coordinates, a padding variable is used to align the structure to 32 bytes (3 doubles of 8 bytes each, and an additional padding variable of 8 bytes sum up to a structure of 32 bytes) and conveniently store temporarily some auxiliary vertex related values at differ- ent stages of the algorithm. Memory alignment ensures that a vertex does not overlap two cache lines during memory transfer. Modern computers usually work with cache lines of 64 bytes. The padding variable in the vertex structure ensures that a vertex is always loaded in one single memory fetch. The biggest advantage here is that we can load an additional padding variable without re- quiring any additional bandwidth. 58 CHAPTER 3. DELAUNAY TRIANGULATION

Tetrahedra Each tetrahedron knows about its 4 vertices and its 4 neigh- boring tetrahedra. These two types of adjacencies are stored in separate ar- rays. The main motivation for this storage is flexibility and, once more, mem- ory alignment. Keeping good memory alignment properties on a tetrahedron structure evolving with the implementation is cumbersome. In addition, it pro- vides little to no performance gain in this case. On the other hand with parallel arrays, additional information per tetrahedron (e.g. a color for each tetrahe- dron, sub-determinants etc.) can be added easily without disrupting memory layout. Each tetrahedron is identified by the indices of its four vertices inthe vertices structure. Vertex indices of tetrahedron t are read between posi- tions 4*t and 4*t+3 in the global array tetrahedra.vertex_ID storing all tetrahedron vertices. Vertices are ordered so that the volume of the tetrahedron is positive. An array of double, sub_determinant, is used to store 4 values per tetrahedron. This space is used to speed up geometric predicate evaluation (see Section 3.5).

Adjacencies By convention, the i-th facet of a tetrahedron is the facet opposite the i-th vertex, and the i-th neighbor is the tetrahedron adjacent to that facet. In order to travel efficiently through the triangulation, facet indices are stored together with the indices of the corresponding adjacent tetrahedron, thanks to an integer division and its modulo. The scheme is simple. Each adja- cency is represented by the integer obtained by multiplying by four the index of the tetrahedron and adding the internal index of the facet in the tetrahe- dron. Take, for instance, two tetrahedra t1 and t2 sharing facet f, whose index is respectively i1 in t1 and i2 in t2. The adjacency is then represented as

• tetrahedra.neighbor_ID[4t1+i1]=4t2+i2,

• tetrahedra.neighbor_ID[4t2+i2]=4t1+i1.

This multiplexing avoids a costly looping over the facets of a tetrahedron. The fact that it reduces by a factor 4 the maximum number of elements in a mesh is not a real concern since, element indices and adjacencies being stored as 64 bytes unsigned integers, the maximal number of element in a mesh is 262 ≃ 4.6 1018, which is huge. Note also that division and modulo by 4 are very cheap bitwise operations for unsigned integers: i/4 = i ≫ 2 and i%4 = i&3.

Memory footprint The key to an efficient data structure is the balance between its memory footprint and the computational cost of its modification. We do not expect to generate meshes of more than UINT32_MAX vertices, i.e. about 4 billion vertices on one single machine. Each vertex therefore occupies 3.3. MESH DATA STRUCTURE 59

32 bytes, 24 bytes for its coordinates and 8 bytes for the padding variable. On the other hand, the number of tetrahedra could itself be larger than 4 billion, so that a 64 bits integer is needed for element indexing. Each tetrahedron occupies

c e

f t1 t2 t0 d

a b

t3

g

memory index vertex_ID neighbor_ID

4t0 a 4t1 + 3 4t0 + 1 b 4t2 + 3 4t0 + 2 c 4t3 + 3 4t0 + 3 d − : 4t1 b − 4t1 + 1 c − 4t1 + 2 d − 4t1 + 3 e 4t0 + 0 : 4t2 a − 4t2 + 1 d − 4t2 + 2 c − 4t2 + 3 f 4t0 + 1 : 4t3 a − 4t3 + 1 b − 4t3 + 2 d − 4t3 + 3 g 4t0 + 2

Figure 3.2: Four adjacent tetrahedra : t0, t1, t2, t3 and one of their possible memory representations in the tetrahedra data structure given in Listing 3.1. tetrahedra. neighbor_ID[4ti + j]/4 gives the index of the adjacent tetrahedron opposite to tetrahedra.vertex_ID[4ti + j] in the tetrahedron ti and tetrahedra.neighbor _ID[4ti + j] gives the index where the inverse adjacency is stored. 60 CHAPTER 3. DELAUNAY TRIANGULATION

80 bytes, 4 × 4 = 16 bytes for the vertices, 4 × 8 = 32 bytes for the neighbors, 32 bytes again for the sub-determinants. In average a tetrahedral mesh of n vertices has a little more than 6n tetrahedra. Thus, our mesh data structure requires ≈ 6 × 80 + 32 = 512 bytes per vertex.

3.4 Fast spatial sorting

The complexity of the Delaunay triangulation algorithm depends on thenum- ber of tetrahedra traversed during the walking steps, and on the number of tetrahedra in the cavities. An appropriate spatial sorting is required to improve locality, i.e., reduce the walking steps, while retaining enough randomness to have cavity sizes independent of the number of vertices already inserted in the mesh (by cavity size, we mean the number of tetrahedra in the cavity, not the volume of the cavity). An efficient spatial sorting scheme has been proposed by Boissonnat et al.[23] and is used in state-of-the-art Delaunay triangulation implementations. The main idea is to first shuffle the vertices, and to group theminroundsof increasing size (a first round with, e.g., the first 1000 vertices, a second with the next 7000 vertices, etc.) as described by the Biased Randomized Insertion Order (BRIO)[11]. Then, each group is ordered along a space-filling curve. The space-filling curve should have the property that successive points on the curve are geometrically close to each other. With this spatial sorting, the number of walking steps between two successive cavities remains small and essentially constant[98]. Additionally, proceeding by successive rounds according to the BRIO algorithm tends to reduce the average size of cavities. The Hilbert curve is a continuous self-similar (fractal) curve. It has thein- teresting property to be space-filling, i.e., to visit exactly once all the cells of a regular grid with 2m × 2m × 2m cells, m ∈ N. A Moore curve is a closed Hilbert curve. Hilbert and Moore curves have the sought spatial locality prop- erties, in the sense that points close to each other on the curve are also close to each other in R3[64, 62, 7]. Any 3D point set can be ordered by a Hilbert/- Moore curve according to the order in which the curve visits the grid cells that contains the points. The Hilbert/Moore index is thus an integer value d ∈ {0, 1, 2, . . . , 23m − 1}, and several points might have the same index. The bigger m is for a given point set, the smaller the grid cells are, and hence the lower the probability of having two points with the same Hilbert/Moore index. Given a 3D point set with n points, there are in average n/23m points per grid cell. Therefore, choosing m = k log2(n) with k constant ensures the average number of points in grid cells to be independent of n. If known in advance, the minimum mesh size can also be taken into account: m can be chosen such that 3.4. FAST SPATIAL SORTING 61 a grid cell never contain more than a certain number of vertices, thus capturing non-uniform meshes more precisely. For Delaunay triangulation purposes, the objective is both to limit the number of points with the same indices (in the same grid cell) and to have indices within the smallest range as possible to ac- celerate subsequent sorting operations. There is thus a balance to find, sothat Hilbert indices are neither too dense nor too sparse. Computing the Hilbert/Moore index of one cell is a somewhat technical point. The Hilbert/Moore curve is indeed fractal, which means recursively com- posed of replicas of itself that have been scaled down by a factor two, rotated and optionally reflected. Those reflections and rotations can be efficiently com- puted with bitwise operations. Various transformations can be applied to point coordinates to emulate non-regular grids, as explained in Section 2.4, a useful functionality we will resort to in Section 4.3. The full details on how to compute 2D and 3D Hilbert and Moore indices are given in Chapter 2. Once the Hilbert/Moore indices have been computed, points can be sorted accordingly in a data structure where the Hilbert/Moore index is the key, and the point the associated value. An extremely well suited strategy to sort bounded integer keys is the radix sort[21, 128, 116, 103], a non-comparative sorting al- gorithm working digit by digit. The base of the digits, the radix, can be freely chosen. Radix sort has a O(wn) computational complexity, where n is the num- ber of {key, value} pairs and w the number of digits (or bits) of the keys. In our case, the radix sort has a complexity in O(mn) = O(n log(n)), where n is the number of points and m = k log2(n) the number of levels of the Hilbert/Moore grid. In general, m is small because a good resolution of the space-filling curve is not needed. Typically, the maximum value among keys is lower than the number of values to be sorted. We say that keys are short. Radix-sort is able to sort such keys extremely quickly. Literature is abundant on parallel radix sorting and impressing performances are obtained on many-core CPUs and GPus[123, 91, 104, 85, 17, 106]. However, implementations are seldom available and we were not able to find a parallel radix sort implementation properly designed for many-core CPUs. We imple- mented HXTSort, that is available independently as open source at https: //www.hextreme.eu/hxtsort. Figure 3.3 compares the performances of HXTSort with qsort, std::sort and the most efficient implementations that we are aware of for sorting 64-bit key and value pairs. Our implementation is fully multi-threaded and takes advantage of the vectorization possibilities of- fered by modern computers such as the AVX512 extensions on the Xeon PHI. It has been developed primarily for sorting Hilbert indices, which are typically short. We use different strategies depending on the key bit size. These arethe reason why HXTSort outperforms the Boost Sort Library, GNU’s and Intel’s parallel implementation of the standard library and Intel TBB when sorting 62 CHAPTER 3. DELAUNAY TRIANGULATION

Ours GCC libstdc++ parallel mode: std::sort() Intel PSTL: std::sort(par_unseq,..) Intel TBB: parallel_sort() Boost: block_indirect_sort() Boost: sample_sort() Boost: parallel_stable_sort() GCC libstdc++: std::sort() glibc: qsort()

1,000

100

10

1

Time [s] 0.1

0.01

0.001

0.0001 103 104 105 106 107 108 109 Number of {key, value} pairs sorted

Figure 3.3: Performances of HXTSort for sorting {key, value} pairs (uniform distribu- tion of 64-bit keys, 64-bit values) on an Intel® Xeon PhiTM 7210 CPU and comparison with widely used implementations.

Hilbert indices.

3.5 Improving Cavity: spending less time in geometric predicates

The first main operation of the Delaunay kernel is the construction ofthecav- ity C(DTk, pk+1) formed by all the tetrahedra {a, b, c, d} whose circumscribed sphere encloses pk+1 (Figure 3.1). This step of the incremental insertion repre- sents about one third of the total execution time in available implementations 3.5. IMPROVING CAVITY: SPENDING LESS TIME IN GEOMETRIC PREDICATES63

(Table 3.1). The cavity is initiated with a first tetrahedron τ containing pk+1 determined with the WalK function (Algorithm 1 and Figure 3.1a), and then completed by visiting the neighboring tetrahedra with a breadth-first search algorithm.

Optimization of the inSphere predicate

c

e

d

a b

The most expensive operation of the Cavity function is the fundamental geo- metrical evaluation of whether a given point e is inside, exactly on, or outside the circumsphere of a given tetrahedra {a, b, c, d} (Table 3.1). This is evaluated using the inSphere predicate that computes the sign of the following deter- minant:

∥ ∥2 ax ay az a 1 ∥ ∥2 bx by bz b 1 2 inSphere(a, b, c, d, e) = cx cy cz ∥c∥ 1 2 dx dy dz ∥d∥ 1 2 ex ey ez ∥e∥ 1

∥ ∥2 bx − ax by − ay bz − az b − a ∥ ∥2 cx − ax cy − ay cz − az c − a = 2 dx − ax dy − ay dz − az ∥d − a∥ 2 ex − ax ey − ay ez − az ∥e − a∥

This is a very time consuming computation, and to make it more efficient, we propose to expand the 4 × 4 determinant into a linear combination of four 3 × 3 determinants independent of point e. 64 CHAPTER 3. DELAUNAY TRIANGULATION

∥ ∥2 by − ay bz − az b − a 2 inSphere(a, b, c, d, e) = −(ex − ax) cy − ay cz − az ∥c − a∥ 2 dy − ay dz − az ∥d − a∥

∥ ∥2 bx − ax bz − az b − a 2 +(ey − ay) cx − ax cz − az ∥c − a∥ 2 dx − ax dz − az ∥d − a∥

∥ ∥2 bx − ax by − ay b − a 2 −(ez − az) cx − ax cy − ay ∥c − a∥ 2 dx − ax dy − ay ∥d − a∥

bx − ax by − ay bz − az 2 +∥e − a∥ cx − ax cy − ay cz − az

dx − ax dy − ay dz − az

Being completely determined by the tetrahedron vertex coordinates, the four 3×3 determinants can be pre-computed and stored in the tetrahedron data structure when it is created. The cost of the inSphere predicate becomes then negligible. Notice also that the fourth sub-determinant is minus the tetrahedron volume. We can set it to a positive value to flag deleted tetrahedra during the breadth-first search, thereby saving memory space. The maximal improvement of the Cavity function obtained with this opti- mization depends on the number of times the inSphere predicate is invoked per tetrahedron. First, in order to simplify further discussion, we will assume that the number of tetrahedra is about 6 times the number of vertices in the final triangulation[99]. This means that each point insertion results inaver- age in the creation of 6 new tetrahedra. On the other hand, we have seen in Section 3.4 that an appropriate point ordering ensures an approximately con- stant number of tetrahedra in the cavities. This number is close to 20 in a usual mesh generation context[98]. One point insertion thus results in the deletion of 20 tetrahedra, and the creation of 26 tetrahedra (all figures are approxima- tions). This number is also the number of triangular faces of the cavity, since all tetrahedra created by the DelaunayBall function associate a cavity facet to the inserted vertex pk+1. The inSphere predicate is therefore evaluated positively for the 20 tetrahedra forming the cavity and negatively for the 26 tetrahedra adjacent to the faces of the cavity, a total of 46 calls for each vertex insertion. When n points have been inserted in the mesh, a total of 26n tetrahe- dra were created and the predicate has been evaluated 46n times. Thus, we 3.6. IMPROVING DELAUNAYBALL: SPENDING LESS TIME COMPUTING ADJACENCIES 65 may conclude from this analysis that the inSphere predicate is called approx- imately 46n/26n = 1.77 times per tetrahedron. In consequence, the maxi- mal improvement that can be obtained from our optimization of the inSphere predicate is of 1 − 1/1.77 = 43%. Storing the sub-determinants has a memory cost (4 double values per tetrahedron) and a bandwidth cost (loading and stor- ing of sub-determinants). For instance, for n = 4. 106, we observe a speedup of 32% in the inSphere predicate evaluations, taking into account the time spent to compute the stored sub-determinants. Note that a second geometric predicate is extensively used in Delaunay tri- angulation. The orient3d predicate evaluates whether a point d is above/on/ under the plane defined by three points {a, b, c} ∈ R3[108]. It computes the triple product (b − a) · ((c − a) × (d − a)), i.e. the signed volume of tetrahedron {a, b, c, d}. This predicate is mostly used in the WalK function.

Implementation details Geometrical predicates evaluated with standard float- ing point arithmetics may lead to inaccurate or inconsistent results. To have a robust and efficient implementations of the inSphere and orient3d predi- cates, we have applied the strategy implemented in Tetgen[113]. 1. Evaluate first the predicate with floating point precision. This givesa correct value in the vast majority of the cases, and represents only 1% of the code. 2. Use a filter to check whether the obtained result is certain. A static filter is first used, then, if more precision is needed, a dynamic filter iseval- uated. If the result obtained by standard arithmetics is not trustworthy, the predicate is computed with exact arithmetics[108]. 3. To be uniquely defined, Delaunay triangulations requires each point tobe inside/outside a sphere, under/above a plane. When a point is exactly on a sphere or a plane, the point is in a non-general position that is slightly perturbed to “escape” the singularity. We implemented the symbolic per- turbations proposed by Edelsbrunner[44].

3.6 Improving DelaunayBall: spending less time computing adjacencies

The DelaunayBall function creates the tetrahedra filling the cavity (Figure 3.1c), and updates the tetrahedron structures. In particular, tetrahedron adja- cencies are recomputed. This is the most expensive step of the Delaunay trian- gulation algorithm as it typically takes about 40% of the total time (Table 3.1). 66 CHAPTER 3. DELAUNAY TRIANGULATION

In contrast to existing implementations, we strongly interconnect the cavity building and cavity retriangulation steps. Instead of relying on a set of elegant and independent elementary operations like tetrahedron creation/deletion or adjacency computation, a specific cavity data structure has been developed, which optimizes batches of operations for the specific requirements of the De- launay triangulation (Listing 3.2).

typedef struct { uint32_t new_tetrahedron_vertices[4]; // facet vertices + vertex to insert uint64_t adjacent_tetrahedron_ID; } cavityBoundaryFacet_t 5 typedef struct{ uint64_t adjacency_map[1024]; // optimization purposes, see Section 3.6

struct { 10 cavityBoundaryFacet_t* boundary_facets; uint64_t num; // number of boundary facets uint64_t allocated_num; // capacity [in cavityBoundaryFacet_t] } to_create;

15 struct { uint64_t* tetrahedra_ID; uint64_t num; // number of deleted tetrahedra uint64_t allocated_num; // capacity } deleted; 20 } cavity_t; Listing 3.2: Cavity specific data structure.

Use cavity building information The tetrahedra reached by the breadth- first search during the Cavity step (Section 3.5), are either inside or adjacent to the cavity. Each triangular facet of the cavity boundary is thus shared by a tetrahedron t1 ∈ A(DTk, pk+1) outside the cavity, by a tetrahedron t2 ∈ C(DTk, pk+1) inside the cavity, and by a newly created tetrahedron inside the cavity t3 ∈ B(DTk, pk+1) (Figure 3.1). The facet of t1 adjacent to the sur- face of the cavity defines, along with the point pk+1, the tetrahedron t3. We thus know a priori that t3 is adjacent to t1, and store this information in the cavityBoundaryFacet_t structure (Listing 3.2).

Fast adjacencies between new tetrahedra During the cavity construc- tion procedure, the list of vertices on the cavity as well as adjacencies with the tetrahedra outside the cavity are computed. The last step is to compute thead- jacencies between the new tetrahedra built inside the cavity, i.e. the tetrahedra of the Delaunay ball. 3.6. IMPROVING DELAUNAYBALL: SPENDING LESS TIME COMPUTING ADJACENCIES 67

The first vertex of all created tetrahedra is settobe pk+1, whereas the other three vertices {p1, p2, p3} are on the cavity boundary, and are ordered so that the volume of the tetrahedral element is positive, i.e., orient3d(pk+1, p1, p2, p3)> 0. As explained in the previous section, the adjacency stored at index 4ti + 0, which corresponds to the facet of tetrahedron ti opposite to the vertex pk+1, is already known for every tetrahedron ti in B(DTk, pk+1). Three neighbors are thus still to be determined for each tetrahedron ti, which means practically that an adjacency index has to be attributed to 4ti + 1, 4ti + 2 and 4ti + 3. The internal facets across which these adjacencies have to be identified are made of the common vertex pk+1 and one oriented edge of the cavity boundary. Using an elaborated hash table with complex collision handling would be overkill. We prefer to use a double entry lookup table of dimension n × n, whose rows and columns are associated with the n vertices {pj} of the cavity boundary, to which auxiliary indices 0 ⩽ ij < n are affected for convenience and stored in the padding variable of the vertex, With this, the unique index of an oriented edge p1p2 is set to be n×i1 +i2, which corresponds to one position in the n × n lookup table. So, for each tetrahedron ti in the Delaunay ball with vertices {pk+1, p1, p2, p3}, the adjacency index 4ti + 1 is stored at position n × i2 + i3 in the lookup table, and similarly 4ti + 2 at position n × i3 + i1 and 4ti + 3 at position n × i1 + i2. A square array with a zero diagonal is built proceeding this way, in which the sought adjacencies are the pairs of symmetric components.

Theoretically, there is no maximal value for the number n of vertices in the cavity but, in practice, we can take advantage of the fact that it remains rela- tively small. Indeed, the cavity boundary is the triangulation of a topological sphere, i.e. a planar graph in which the number of edges is E = 3F/2, where F is the number of faces, and whose Euler characteristic is n − E + F = 2. Hence, the number of vertices is linked to the number of triangular faces by n = F/2 + 2. As we have shown earlier that F is about 26, we deduce that there are about n = 15 vertices on a cavity boundary. Hence, a nmax × nmax lookup table, with maximum size nmax = 32 is sufficient provided there are at most Fmax = 2(nmax − 2) = 60 tetrahedra created in the cavity. If the cavity excep- tionally contains more elements, the algorithm smoothly switches to a simple linear search.

Once all adjacencies of the new tetrahedra have been properly identified, the DelaunayBall function is then in charge of updating the mesh data struc- ture. This ends the insertion of point pk+1. The space freed by deleted tetra- hedra is reused, if needed additional tetrahedra are added. Note that almost all steps of adjacencies recovery are vectorizable. 68 CHAPTER 3. DELAUNAY TRIANGULATION

3.7 About a ghost

G

G

t1 t2

t0

t3 a b

Figure 3.4: Triangle t0 surrounded by three Figure 3.5: Circumcircles of an ghost triangles t1, t2 and t3. Circumcircles are edge ab and increasingly distant shown in dash lines of the respective color. vertices. For a vertex G infinitely far away, the degenerate circle ap- proaches a line.

The Bowyer–Watson algorithm for Delaunay triangulation assumes that all newly inserted vertices are inside an element of the triangulation at the pre- vious step (Algorithm 1). To insert a point pk+1 outside the current support of the triangulation, one possible strategy is to enclose the input vertices in a sufficiently large bounding box, and to remove the tetrahedra lying outside the convex hull of the point set at the end of the algorithm. A more efficient strategy, adopted in TetGen[113], CGAL [24], Geogram [73], is to work with the so-called ghost tetrahedra connecting the exterior faces of the triangula- tion with a virtual ghost vertex G. Using this elegant concept, vertices can be inserted outside the current triangulation support with almost no algorithmic change. The ghost vertex G is the vertex ”at infinity” shared by all ghost tetrahe- dra (Figure 3.4). The ghost tetrahedra cover the whole space outside the reg- ular triangulation. Like regular tetrahedra, ghost tetrahedra are stored in the mesh data structure, and are deleted whenever a vertex is inserted inside their circumsphere. The accurate definition of the circumsphere of a ghost tetrahe- dron, in particular with respect to the requirements of the Delaunay condition, is however a more delicate question. From a geometrical point of view, as the ghost vertex G moves away to- wards infinity from an exterior facet abc of the triangulation, the circumscribed sphere of tetrahedron abcG tends to the half space on the positive side of the plane determined by the points abc, i.e., the set of points x such that orient3d(a, b, c, d) is positive. The question is whether this inequality should be strict or not, and the robust answer is neither one nor the other. This is il- 3.8. SERIAL IMPLEMENTATION PERFORMANCES 69 lustrated in 2D in Figure 3.5. The circumcircle of a ghost←→ triangle abG actually contains not only the open half-plane strictly above←→ the line ab, but also the line segment [ab] itself; the other parts of the line ab being excluded. In 3D, the cir- cumsphere of a ghost tetrahedron abcG contains the half-space L such that for any point d ∈ L, orient3d(a, b, c, d) is strictly positive, plus the disk defined by the circumcircle of the triangle abc. If the point d is in the plane abc, i.e. orient3d(a, b, c, d)= 0, we thus additionally test if d is in the circumsphere of the adjacent regular tetrahedron sharing the triangle abc. This composite definition of the circumscribed circle makes this approach robust with minimal changes: only the inSphere predicate is modified in the implementation. Because the 3 × 3 determinant of orient3d is used instead of the 4 × 4 determinant of inSphere for ghost tetrahedra, the approach with a ghost ver- tex is faster than other strategies [109]. Note that the WalK cannot evaluate orient3d on the faces of ghost tetrahedra that connect edges of the convex hull to the ghost vertex. If the starting tetrahedron is a ghost tetrahedron, the WalK directly steps into the non-ghost adjacent tetrahedron. Then, if it steps in an other ghost tetrahedron abcG while walking toward pk+1, we have orient3d(a, b, c, pk+1)> 0 and the ghost tetrahedron is inside of the cavity. The walk hence stops there, and the algorithm proceeds asusual.

3.8 Serial implementation performances

Our reference implementation of the serial Delaunay triangulation algorithm has about one thousand lines (without Shewchuk’s geometric predicates[108]). It is open-source and available at https://git.immc.ucl.ac.be/hextreme/hxt_seqdel. Overall, it is almost three time faster than other sequential implementations. Figure 3.6 and 3.7 show the performance gap between our implementation and concurrent software on a laptop with a maximum core frequency of 3.5Ghz, and on a slower computer with a maximum core frequency of 1.5Ghz and wider SIMD instructions. Table 3.1 indicates that the main difference in speed comes from the more efficient adjacencies computation in the DelaunayBall func- tion. Other software use a more general function that can also handle cavities with multiple interior vertices. But this situation does not happen with De- launay insertion, where there is always one unique point in the cavity, pk+1. Since our DelaunayBall function is optimized for Delaunay cavities, it is approximately three time faster, despite the additional computation of sub- determinants. The remaining performance gain is explained by our choice of simple but efficient data structures. In practice, the computational complexity of our implementation grows linearly with the number of tetrahedra. On Figure 3.8, we can see the performance of our program for a distribution of points that 70 CHAPTER 3. DELAUNAY TRIANGULATION yields on the order of n2 tetrahedra. Our implementation consistantly produces between 2.5 and 6 million tetrahedra per second. Note that the triangulation with 20 000 points has 100 million tetrahedra which occupies roughly the half of the 16 GB of our laptop’s RAM.

100 Ours Geogram 1.5.4 CGAL 4.12 10 TetGen 1.5.1-beta1

1 Time [s]

0.1

104 105 106 107 Number of points (random uniform distribution)

# vertices 104 105 106 107 Ours 0.027 0.21 2.03 21.66 Geogram 0.060 0.51 5.53 56.02 CGAL 0.062 0.64 6.65 66.24 TetGen 0.054 0.56 5.89 63.99

Figure 3.6: Performances of our sequential Delaunay triangulation implementation (Al- gorithm 1) on a laptop’s Intel® CoreTM i7-6700HQ CPU, with a maximum core fre- quency of 3.5Ghz. Timings are in seconds and exclude the initial spatial sort. 3.8. SERIAL IMPLEMENTATION PERFORMANCES 71

10,000 Ours Geogram 1.5.4 1,000 CGAL 4.12 TetGen 1.5.1-beta1 100

Time [s] 10

1

0.1 104 105 106 107 Number of points (random uniform distribution)

# vertices 104 105 106 107 Ours 0.134 0.98 9.36 97.97 Geogram 0.240 2.47 25.34 259.74 CGAL 0.265 2.81 28.36 286.54 TetGen 0.283 2.97 31.13 336.21

Figure 3.7: Performances of our sequential Delaunay triangulation implementation (Al- gorithm 1) on a slow Intel® Xeon PhiTM 7210 CPU, which has a maximum core fre- quency of 1.5Ghz. However, it has AVX-512 vectorized instructions. Timings are in seconds and exclude the initial spatial sort. 72 CHAPTER 3. DELAUNAY TRIANGULATION

(a) n/2 points are placed on each of two lines. The overall set of n points has a Delaunay tetrahedralization with n2/4 − n + 1 tetrahedra. 40 100

30 80 60 20 40 Time [s] 10 20 Million tetrahedra 0 0 0 5,000 10,000 15,000 20,000 Number of points (distributed on 2 lines)

(b) Number of tetrahedra and generation time per number of points on an Intel® CoreTM i7-6700HQ CPU.

Figure 3.8: Performance of our program on a point set that generates on the order of n2 tetrahedra. (a) the geometry of the point set.(b) the graph displaying the generation time and the number of tetrahedra in the Delaunay triangulation. Chapter 4

Parallel Delaunay

This chapter, together with Chapter 3 and Chapter 5, is a slightly adapted transcription from our paper entitled “One machine, one minute, three billion tetrahedra”[81]. The results obtained in this chapter can be reproduced using the Delaunay_CLI executable, which can be compiled following the instruc- tions in gmsh/contrib/hxt. A small addendum, not present in any paper, was added in Section 4.8 to explain the research directions that were analyzed but did not convince.

This chapter details a scalable parallel version of the Delaunay triangulation algorithm devoid of heavy synchronization overheads. The domain is parti- tioned using the Hilbert curve and conflicts are detected with a simple coloring scheme. The performances and scalability are demonstrated on three differ- ent machines: a high-end four core laptop, a 64-core Intel® Xeon Phi Knight’s Landing, and a recent AMD® EPYC 64-core machine (Section 4.7). On the latter computer, we have been able to generate three billion tetrahedra in less than a minute (about 107 points per second).

4.1 Related work

To overcome memory and time limitations, Delaunay triangulations should be constructed in parallel making the most of both distributed and shared memory architectures. A triangulation can be subdivided into multiple parts, each con- structed and stored on a node of a distributed memory cluster. Multiple meth- ods have been proposed to merge independently generated Delaunay triangulations[35, 32, 54, 22]. However, those methods feature complicated merge steps, which are often difficult to parallelize. To avoid merge operations, other distributed im- plementations maintain a single valid Delaunay triangulation and use synchro-

73 74 CHAPTER 4. PARALLEL DELAUNAY nization between processors whenever a conflict may occur at inter-processor boundaries[89, 34]. In finite-element mesh generation, merging two triangula- tions can be simpler because triangulations are not required to be fully Delau- nay, allowing algorithms to focus primarily on load balancing[70]. On shared memory machines, divide-and-conquer approaches remain ef- ficient, but other approaches have been proposed since communication costs between different threads are not prohibitive to the contrary of distributed memory machines. To insert a point in a Delaunay triangulation, the kernel procedure operates on a cavity that is modified to accommodate the inserted point (Figure 3.1). Two points can therefore be inserted concurrently in a De- launay triangulation if their respective cavities do not intersect, C(DTk, pk1) ∩ C(DTk, pk2) = ∅, otherwise there is a conflict. In practice, other types of con- flicts and data-races should possibly be taken into account depending onthe chosen data structures and the insertion point strategy. Conflict management strategies relying heavily on locks 1 lead to relatively good speedups on small numbers of cores[69, 20, 16, 47]. Remacle et al. presented an interesting strat- egy that checks if insertions can be done in parallel by synchronizing threads with barriers[98]. However, synchronization overheads prevent those strate- gies from scaling to an high number of cores. More promising approaches rely on a partitioning strategy[76, 77]. Contrarily to pure divide-and-conquer strategies for distributed memory machines, partitions can change and move regularly, and they do not need to cover the whole mesh at once. In this section, we propose to parallelize the Delaunay kernel using parti- tions based on a space-filling curve, similarly to Loseille et al.[77, 6]. The main difference is that we significantly modify the partitions at each iteration level. Our code is designed for shared memory architecture only, we leave its inte- gration into a distributed implementation for future work.

4.2 A parallel strategy based on partitions

There are multiple conditions a program should ensure to avoid data-races and conflicts between threads when concurrently inserting points in a Delaunay triangulation. Consider thread t1 is inserting point pk1 and thread t2 is simul- taneously inserting point pk2.

1. Thread t1 cannot access information about any tetrahedron in C(DT, pk2) and inversely. Hence:

1A lock is a synchronization mechanism enforcing that multiple threads do not access a resource at the same time. When a thread cannot acquire a lock, it usually waits. 4.2. A PARALLEL STRATEGY BASED ON PARTITIONS 75

a) C(DT, pk1) ∩ C(DT, pk2) = ∅

b) A(DT, pk1) ∩ C(DT, pk2) = ∅ and A(DT, pk2) ∩ C(DT, pk1) = ∅

c) Thread t1 cannot walk into C(DT, pk2) and reciprocally, t2 cannot walk into C(DT, pk1)

2. A tetrahedron in B(DT, pk1) and a tetrahedron in B(DT, pk2) cannot be created at the same memory index.

To ensure rule (1) it is sufficient to restrain each thread to work on anin- dependent partition of the mesh (Figure 4.1). This lock-free strategy minimizes synchronization between threads and is very efficient. Each point of the Delau- nay triangulation is assigned a partition corresponding to a unique thread. A tetrahedron belongs to a thread if at least three of its vertices are in that thread’s partition. To ensure (1a) and (1b), the insertion of a point belonging to a thread is aborted if the thread accessed a tetrahedron that belongs to another thread or that is in the buffer zone. To ensure (1c), we forbid threads to walk in tetrahedra belonging to another thread. Consequently, a thread aborts the insertion of a vertex when (i) the WalK reaches another partition, or when (ii) the Cavity function reaches a tetrahedron in the buffer zone. To insert these points, the vertices are re-partitioned differently (Section 4.3), a procedure repeated until there is no more vertices to insert or until the rate of successful insertions has become too low. In that case, the number of threads is decreased (Section 4.4).

mesh vertices

points to insert

conflicts t1 bu€er zone t2

Figure 4.1: Vertices are partitioned such that each vertex belongs to a single thread. A triangle can only be modified by a thread that owns all of its three vertices. Triangles that cannot be modified by any thread form a buffer zone. 76 CHAPTER 4. PARALLEL DELAUNAY

When the number of points to insert has become small enough, the insertion runs in sequential mode to insert all remaining vertices. The first BRIO round is also inserted sequentially to generate a first base mesh. Nevertheless, the vast majority of points are inserted in parallel. Rule (2) is obvious from a parallel memory management point of view. It is however difficult to ensure it without requiring an unreasonable amountof memory. As explained in Section 4.6, synchronization between threads is re- quired punctually.

4.3 Partitioning and re-partitioning with Moore curve

We subdivide the nv points to insert such that each thread inserts the same number of points. Our partitioning method is based on the Moore curve, i.e. on the point insertion order implemented for the sequential Delaunay trian- gulation (Section 3.4). Space-filling curves are relatively fast to construct and have already been used successfully for partitioning meshes [40, 8, 6, 74, 77]. Each partition of the nthread partitions is a set of grid cells that are consecu- tive along the Moore curve, as explained previously in Section 2.9. To compute the partitions, i.e. its starting and ending Moore indices, we sort the points to insert according to their Moore indices (see Section 3.4). Then, we assign the first nv/nthreads points to the first partition, the next nv/nthreads to the second, etc (Figure 4.2a). The second step is to partition the current Delaunay triangulation in which the points will be inserted. We use once more the Moore indices to assign the mesh vertices to the different partitions. The ghost vertex is assigned a random index. The partition owning a tetrahedron is determined from the partitions of its vertices, if a tetrahedron has at least three vertices in a partition it belong to this partition, otherwise the tetrahedron is in the buffer zone. Each thread then owns a subset of the points to insert and a subset of the current triangulation. Once all threads attempted to insert all their points, a large majority ofver- tices is generally inserted. To insert the vertices for which insertion failed be- cause the point cavity spans multiple partitions (Figure 4.1), we modify signif- icantly the partitions by modifying the Moore indices computation. We apply a coordinate transformation and a circular shift to move the zero index around the looping curve (Figure 4.2). Coordinates below a random threshold are lin- early compressed, while coordinates above the threshold are linearly expanded. 4.4. ENSURING TERMINATION 77

129 94 145 109 98 120 68 85 133 89 181 155 94 166 50 137 86 173 10 143 78 190 16 30

60 204 23 164 34 4 start 219 188 end 246 228 7 235 17 242 start end 236 207 (a) (b)

Figure 4.2: Partitioning of 20 points in 2D using the Moore indices, on the right the supporting grid of the Moore curve is transformed and the curve is shifted. In both cases, each partition contains 5 points. Indeed, the starting and ending Moore index of each partition are defined in a way that balances the point insertions between threads.

4.4 Ensuring termination

When the number of points of the current Delaunay triangulation is small, typi- cally at the first steps of the algorithm, the probability that the points of acavity belong to different partitions is very high. Hence, none or very few point in- sertions may succeed. To avoid wasting precious milliseconds, the first BRIO round is always inserted sequentially. Moreover, there is no guarantee that re-partitioning will be sufficient to in- sert all the points. And even when a large part of the triangulation has already been constructed, parallel insertion may enter an infinite failure loop. After a few rounds, the remaining points to insert are typically located in restricted vol- umes (intersection of previous buffer zones). Because re-partitioning is done on the not-yet-inserted points, the resulting partitions are also small and thin. This leads to inevitable conflicts as partitions get smaller than cavities. In practice, we observed that the ratio ρ of successful insertions decreases for a constant number of threads, supporting our theory. Typically, if 80 out of 100 vertices are successfully inserted in a mesh (ρk = 0.8), less than 16 of the 20 remaining vertices will be inserted at the next attempt (ρk+1 < 0.8). Note that this effect is more important for small meshes, because the bigger the mesh, the relatively smaller the buffer zone and the higher the insertion success rate. This differ- ence explains the growth of the number of tetrahedra created per second with the number of points in the triangulation (Figure 4.4). To avoid losing a significant amount of time in re-partitioning and trying to insert the same vertex multiple times we gradually decrease the number of threads. Choosing the adequate number of threads is a question of balance be- tween the potential parallelization gain and the partitioning cost that comes 78 CHAPTER 4. PARALLEL DELAUNAY

Intel i7-6700HQ 100 Intel Xeon Phi 7210 AMD EPYC 7551 perfect scaling

Time [s] 10

1 1 2 4 8 16 32 64 Number of threads

Figure 4.3: Strong scaling of our parallel Delaunay for a random uniform distribution of 15 million points, resulting in over 100 million tetrahedra on 3 machines: a quad-core laptop, an Intel Xeon Phi with 64 cores and a dual-socket AMD EPYC 2 × 32 cores.

100 Intel i7-6700HQ Intel Xeon Phi 7210 AMD EPYC 7551

10

1 Million tetrahedra per second 104 105 106 107 108 109 Number of points (random uniform distribution)

Figure 4.4: Number of tetrahedra created per second by our parallel implementation for different number of points. Tetrahedra are created more quickly when thereisa lot of points because the proportion of conflicts is lower. An average rate of 65 million tetrahedra created per second is obtained on the EPYC. 4.5. DATA STRUCTURES 79 with each insertion attempt. When decreasing the number of threads, wede- crease the number of attempts needed. When the ratio of successful insertions is too low, ρ < 1/5, or when the number of points to insert per thread is under 3000, we divide the number of threads by two. Furthermore, if ρk ⩽ 1/nthreads, the next insertion attempt will not benefit from multi-threading and weinsert the points sequentially. In Table 4.1 are given the number of threads used at each step of the De- launay triangulation of a million vertices depending on the number of points to insert, the size of the current mesh, and the success insertion ratio ρ. Note that the computation of Moore indices and the sorting of vertices to insert are always computed on the maximal number of threads available (8 threads in this case) even when the insertion runs sequentially.

4.5 Data structures

The data structure for our parallel Delaunay triangulation algorithm issimi- lar to the one used by our sequential implementation (Section 3.3). There are two small differences. First, the parallel implementation does not compute sub-determinants for each tetrahedron. Actually, the bandwidth usage with the parallel insertions is already near its maximum. Loading and storing sub- determinant becomes detrimental to the overall performance. Instead we store a 16-bit2 color flag to mark deleted tetrahedra. Second, each thread has itsown Cavity_t structure (Listing 3.2) to which are added two integers identifying the starting and ending Moore indices of the thread’s partition.

Memory footprint Because we do not store four sub-determinants per tetrahedra anymore but only a 2-bytes color, our mesh data structure is lighter. Still assuming that there is approximately 6n tetrahedra for n vertices, it re- quires a little more than 6 × 50 + 32 = 332 bytes per vertex. Thanks to this low memory footprint, we were able to compute the tetrahedralization of N = 109 vertices (about 6 billion of tetrahedra) on an AMD EPYC machine that has 512 GB of RAM. The experimental memory usage shown in Table 4.2 differ slightly from the theoretical formula because (i) more memory is allocated than what is used (ii) measurements represent the maximum memory used by the whole program, including the stack, the text and data segment etc.

2An 8-bit char would also work (not less or it would create data races) but the color flag is also used to distinguish volumes in our mesh generator described in Chapter 5 80 CHAPTER 4. PARALLEL DELAUNAY

#points #threads #mesh to insert inserted ρ vertices Initial mesh 4 BRIO Round 1 2044 2044 100% 1 2048 BRIO Round 2 12 288 6988 57% 4 9036 5300 3544 67% 2 12 580 1756 1756 100% 1 14 336 BRIO Round 3 86 016 59 907 70% 8 74 243 26 109 11 738 45% 8 85 981 14 371 7092 49% 4 93 073 7279 5332 73% 2 98 405 1947 1947 100% 1 100 352 BRIO Round 4 602 112 503 730 84% 8 604 082 98 382 44 959 46% 8 649 041 53 423 31 702 59% 8 680 743 21 721 7903 36% 8 688 646 13 818 9400 68% 4 698 046 4418 3641 82% 2 701 687 777 777 100% 1 702 464 BRIO Round 5 297 536 271 511 91% 8 973 975 26 025 16 426 63% 8 990 401 9599 8092 84% 4 998 493 1507 1507 100% 1 1 000 000

Table 4.1: Numbers of threads used to insert points in our parallel Delaunay triangu- lation implementation according to the number of points to insert, the mesh size and the insertion success at the previous step. 94.5% of points are inserted using 8 threads and 5% using 4 threads

4.6 Critical operations

When creating new tetrahedra in the Delaunay ball of a point, a thread first recycles unused memory space by replacing deleted tetrahedra. The indices of deleted tetrahedra are stored in the cavity->deleted.tetrahedra_ID (see Listing 3.1) array of each thread. When the cavity->deleted.tetrahedra _ID array of a thread is empty, additional memory should be reserved by this thread to create new tetrahedra. This operation is a critical part of the program requiring synchronization between threads to respect the rule (2). We need to capture the current number of tetrahedra and increment it by the requested number of new tetrahedra in 4.7. PARALLEL IMPLEMENTATION PERFORMANCES 81

# vertices 104 105 106 107 Ours 6.9 mb 43.8 mb 404.8 mb 3.8 gb Geogram 6.7 mb 30.5 mb 268.6 mb 2.7 gb CGAL 14.1 mb 66.7 mb 578.8 mb 5.7 gb

Table 4.2: Comparison of the maximum memory usage of our parallel implementation, CGAL[65] and Geogram[73] when using 8 threads. one single atomic operation. OpenMP provides the adequate mechanism, see Listing 4.1. To reduce the number of time this operation is done, the number of tetra- hedra is incremented atomically by at least 8192, and the cavity->deleted. tetrahedra_ID is filled with the new indices of tetrahedra. Those tetrahe- dra are conceptually deleted although they have never been in the mesh. The number 8192 was found experimentally among multiples of 5123. Increasing it would reduce the number of time reservation of new tetrahedra is performed but it is not necessary. Indeed, since the critical operation occurs then in aver- age every 1000+ insertions, the time wasted is very low. Therefore, increasing the default number of deleted tetrahedra would only wastes memory space for little to no gain. When the space used by tetrahedra exceeds the initially allocated capac- ity, reallocation is implemented that doubles the capacity of the arrays of mesh tetrahedra. During that operation, memory is potentially moved at another lo- cation. No other operation can be performed at that time on the mesh. There- fore, the reallocation code is placed in between two OpenMP barriers (Listing 4.2). This synchronization event is very rare and does not impact performances. In general, a good estimation of the needed memory needed to reserve is pos- sible and that critical section is never reached.

4.7 Parallel implementation performances

We are able to compute the Delaunay triangulation of over one billion tetrahe- dra in record-breaking time: 41.6 seconds on the Intel Xeon Phi and 17.8 sec- onds on the AMD EPYC. These timings do not include I/Os. As for the titleof this article, we are able to generate three billion tetrahedra in 53 seconds on the EPYC. The scaling of our implementation regarding the number of threads isde- tailed in Figure 4.3. We obtain a good scaling until the number of threads reach

3It is common to choose a multiple of the page size (usually 4096 bytes) to minimize TLB misses. 82 CHAPTER 4. PARALLEL DELAUNAY

if(cavity->to_create.num > cavity->deleted.num) { uint64_t nTetNeeded = MAX(8192, cavity->to_create.num) - cavity->deleted. num;

5 uint64_t nTet; #pragma omp atomic capture { nTet = mesh->tetrahedra.num; mesh->tetrahedra.num+=nTetNeeded; 10 }

reallocTetrahedraIfNeeded(mesh); reallocDeletedIfNeeded(state, cavity->deleted.num + nTetNeeded);

15 for (uint64_t i=0; ideleted.tetrahedra_ID[cavity->deleted.num+i] = 4*(nTet+i); mesh->tetrahedra.color[nTet+i] = DELETED_COLOR; }

20 cavity->deleted.num += nTetNeeded; } Listing 4.1: When there are less deleted tetrahedra than there are tetrahedra in the Delaunay ball, 8192 new ”deleted tetrahedra” indices are reserved by the thread. As the mesh data structure is shared by all threads, mesh->tetrahedra.num must be increased in one single atomic operation.

void reallocTetrahedraIfNeeded(mesh_t* mesh) { if(mesh->tetrahedra.num > mesh->tetrahedra.allocated_num) { 5 #pragma omp barrier

// all threads are blocked except the one doing the reallocation #pragma omp single { 10 uint64_t nTet = mesh->tetrahedra.num; alignedRealloc(&mesh->tetrahedra.neighbor_ID, nTet*8*sizeof( uint64_t)); alignedRealloc(&mesh->tetrahedra.vertex_ID, nTet*8*sizeof(uint32_t )); alignedRealloc(&mesh->tetrahedra.color, nTet*2*sizeof(uint16_t)); mesh->tetrahedra.allocated_num = 2*nTet; 15 } // implicit OpenMP barrier here } } Listing 4.2: Memory allocation for new tetrahedra is synchronized with OpenMP barriers. 4.7. PARALLEL IMPLEMENTATION PERFORMANCES 83 the number of cores, i.e. 4 cores for the Intel i7-6700HQ, 64 cores for the Intel Xeon Phi 7210 and the AMD EPYC 7551. To our best knowledge, CGAL[65] and Geogram[73] are the two fastest open-source CPU implementations available for 3D Delaunay triangulation. We compare their performances to ours on a laptop (Figure 4.6) and on the Intel Xeon Phi (Figure 4.7).

Ours (HXT) 10 Geogram 1.5.4 CGAL 4.12

1 Time [s]

0.1

104 105 106 107 Number of points (random uniform distribution)

# vertices 104 105 106 107 Ours 0.032 0.13 0.85 7.40 Geogram 0.041 0.19 1.73 17.11 CGAL 0.037 0.24 2.20 23.37

Figure 4.5: 4-core Intel® CoreTM i7-6700HQ CPU.

Figure 4.6: Comparison of our parallel implementation with the parallel implementa- tion in CGAL[65] and Geogram[73] on the 4-core Intel® CoreTM i7-6700HQ CPU of a high-end laptop. All timings are in seconds. 84 CHAPTER 4. PARALLEL DELAUNAY

Ours (HXT) 100 Geogram 1.5.4 CGAL 4.12

10 Time [s]

1

0.1 104 105 106 107 108 Number of points (random uniform distribution)

# vertices 104 105 106 107 108 Ours 0.11 0.43 1.17 4.48 28.95 Geogram 0.10 0.54 4.58 43.70 / CGAL 0.27 0.48 2.44 20.15 /

Figure 4.7: Comparison of our parallel implementation with the parallel implementa- tion in CGAL[65] and Geogram[73] on the 64-core Intel® Xeon PhiTM 7210 CPU of a many-core computer All timings are in seconds. 4.8. ADDENDUM: RESEARCH JOURNEY 85

4.8 Addendum: research journey

The following is a short description of all the things that were attempted which either ended up being dead ends, or evolved to lead to the current version of our parallel Delaunay kernel. Some say that the journey is more important than the destination. Unfortunately, it is always difficult to convince reviewers of the legitimacy of a “failed” research. This manuscript gives us a good opportunity to reveal what did not work, what worked less well and the whole thinking process which is 90% of a thesis.

A two-level Delaunay kernel Four years ago, in 2016, we tried to optimize the Delaunay kernel of Remacle[98], searching for ways to reduce synchronization overheads to a minimum. Basi- cally, his two-level Delaunay kernel works in a synchronized manner:

1. Each thread takes n consecutive points from a list of points sorted along the Hilbert curve. These n points correspond to a section of the Hilbert curve. Sections on which different threads are working should be as far apart as possible.

2. Each thread collects information about the n associated cavities, by stor- ing the results of multiple calls to the Cavity function in n different structures.

3. Threads are waiting for each other (barrier)

4. If a cavity is not overlapped by any cavity of another thread (no con- flict), a thread computes the DelaunayBall function, modifying the cav- ity such that it contains the new point. If, on the other hand, there is a conflict, the insertion of the point is delayed.

5. Threads are again waiting for each other (barrier).

The clever idea of this method is that computing multiple cavities at onceeffec- tively reduces the number of time threads have to wait for each other. Indeed, even when threads have exactly the same amount of work to do, a barrier in- trinsically has a constant overhead. By inserting, for example, 100 points at once, the impact of this constant overhead is drastically reduced. However, the method has big load-balancing issues: some threads may have a lot more work to do than others. There are sections of the space where cavities tend tobe 86 CHAPTER 4. PARALLEL DELAUNAY bigger or they tend to create more conflicts, or they require extended computa- tion for the evaluation of the inSphere predicate. Sometimes the WalK takes more time, sometime the DelaunayBall falls back to a slower method. All these unpredictable computations, or lack of computations, make it impossible to correctly balance the work between threads ahead of time.

A major improvement to this method consists in assigning a priority pi to each thread ti, such that if a cavity of thread ta overlaps a cavity of a thread tb and pa > pb, then threads ta can call the DelaunayBall function as if there was no conflict. Still, even with a perfect implementation, the constant overhead of the bar- rier and the load-balancing issues prevent this method from scaling well, even on very large meshes. We extracted plenty of statistics to see how these issues could be improved but found no real solution. Based on the very good Hilbert curve implementation that we created, we decided to switch to a method rely- ing on SFC partitioning.

SFC partitioning The first versions of our Delaunay kernel had no repartitioning strategies. In our very first implementation points were simply inserted in parallel overand over again, with the very same underlying Hilbert curve. When a good propor- tion of points has been inserted, the next attempts at inserting points whose insertion failed may still have a good success rate, even without modification of the Hilbert indices. Partitions indeed vary depending on the list of points to insert. If points are removed from that list because they have been inserted, then the subsequent partitions will change. However, when very few points are inserted, the changes to the subsequent partitions are negligible and the next insertion attempts are guaranteed to fail. The simplest solution to prevent this sort of deadlock is to switch to sequential insertion when the success rate is too low. This straightforward tactic already gives very good results, better results than what we got with the two-level approach of Remacle, especially for big meshes. Afterwards, our strategy was improved by halving the number of threads when ρ, the ratio of successful insertions, was too low. The idea is simply to apply the concept of parallel reduction to our meshing strategy. To cut a long story short, halving the number of threads is equivalent to merging pairs of partitions together, as done in the second half of most divide and conquer ap- proaches. In addition, we switched from a Hilbert curve to a Moore curve and a random circular shift was applied to the curve indices before building the partitions. On its favorite input, a big uniform point cloud, this version of our 4.8. ADDENDUM: RESEARCH JOURNEY 87 program was already meshing more than 6 million tetrahedra per second on our 4-core laptop. However, even if the circular shift was effectively making partitions turn around the center of the bounding box, following the Moore curve, applying the circular shift was not making much of a difference to the success ratio ρ. We soon found out that the zones where conflicts happen, the buffer zones, are mostly located on the three main cutting planes of the Moore curve (lines in 2D, as in Figure 4.8a), which intersect at the center of the bounding box. These are the three planes that divide the Moore curve into 8 sub-curves during the first level of construction of the curve (see Chapter 2). If those planes stay still, the conflicts remain. At that point, the program was already goingso fast that re-computing the Moore curve differently seemed like a big waste of time. Well, nothing ventured, nothing gained, so we tried it anyway. Between each insertion attempt loop, the bounding box used to create the uniform grid on which the Moore curve relies was randomly enlarged, as shown in Figure 4.8. Recomputing the whole Moore curve with this bounding box resulted in noticeably better success ratios and, hence, a slight performance increase. The

129 128 142 145 109 114 123 98 120 133 96 135 156 90 138 181 74 186 94 166 71 184 86 173 82 174 78 190 49 190 47 60 204 204 34 219 246 7 17 start end 236 start end (a) (b)

Figure 4.8: (a): partitioning with a standard uniform grid that fits the bounding box. (b): repartitioning with a random enlargement of the bounding box applied beforehand. problem with such an enlargement of the bounding box is that it increases the chance of having discontinuous partitions. The curve may wander out ofthe actual bounding box before coming back in from another side. Therefore, we changed to a simple transformation of the underlying grid, without modifying the bounding box. The underlying grid is transformed as explained in Section 2.4. The impact of this small change was actually greater than expected. The performance of our program was once more increased. Now, it is able to reach a top speed of nearly 10 million tetrahedra per second on our 4-core laptop.

Chapter 5

Mesh refinement

This chapter together with the two previous chapters, Chapter 3 and Chapter 4, forms a reproduction of our paper entitled “One machine, one minute, three billion tetrahedra”[81]. The integration of our parallel Delaunay kernel intoa basic mesh generation framework is explained.

The parallel Delaunay triangulation algorithm that we presented in thepre- vious chapter was integrated in a preliminary Delaunay-based mesh generator. A tetrahedral mesh generator is a procedure that takes as input the boundary of a domain to mesh, defined by set of triangles t that defines the boundary of a closed volume, and that returns a finite element mesh, i.e. a set of tetra- hedra T of controlled sizes and shapes which boundary is equal to the input triangulation: ∂T = t. From a coarse mesh that is conformal to t, a mesher progressively inserts vertices until element sizes and shapes follow the pre- scribed ranges. Generating a mesh is an intrinsically more difficult problem than constructing the Delaunay triangulation of given points because (1) the points are not given, (2) the boundary triangles must be facets of the generated tetrahedra, and (3) the tetrahedra shape and size should be optimized to maxi- mize the mesh quality. Our objective in this chapter is to demonstrate that our parallel Delaunay point insertion may be integrated in a mesh generator. The interested reader is referred to the book by Frey and George [51] for a complete review of finite-element mesh generation. The mesh generation algorithm 2 proposed in this chapter follows the ap- proach implemented for example in Tetgen[113]. First, all the vertices of the boundary triangulation t are tetrahedralized to form the initial “empty mesh” T0. Then, T0 is transformed into a conformal mesh T which boundary ∂T is equal to t : ∂T = t. The triangulation T is then progressively refined by (i) cre- ating vertices S at the circumcenters of tetrahedra for which the circumsphere

89 90 CHAPTER 5. MESH REFINEMENT

radius, rτ, is significantly larger than the desired mesh size, h, i.e. rτ/h > 1.4 [109]. (ii) inserting the vertices in the mesh using our parallel Delaunay algo- rithm. The sets of points S to insert are filtered a priori in order not to generate short edges, i.e. edges of size smaller than 0.7h. We first use Hilbert coordinates to discard very close points on the curve, and implemented a slightly modified cavity algorithm that aborts if a point of the cavity is too close from the one to insert (Section 3.5). Points are thus discarded both in the filtering process and in the insertion process. Note that contrary to existing implementations, we insert as large as possible point sets S during the refinement step to take ad- vantage of the efficiency of our parallel point insertion algorithm (chapter 4). We parallelized all steps of the mesh generator (Algorithm 2) except the boundary recovery procedure that is the one of Gmsh [57] which is essentially based on Tetgen [113]. One issue when parallelizing mesh generation is to ensure reproducibility. For mesh generation, it is usually admitted that mesh generation should be deterministic, and thus reproducible for a given number of threads on a given machine. Our mesh generator has been made reproducible in that sense, with a loss of 20% in overall performance. This feature has been implemented as an option called the reproducible mode. The reproducible mode basically reorders tetrahedra in the lexicographic order of their nodes before every refinement iteration. The order of tetrahedra before the filtering process is therefore deterministic which also makes the filtering process determinis- tic for a given number of threads. Note that if our program is called multiple time with varying numbers of threads as parameter, the output will also vary. Making the code reproducible independently of the number of threads would dramatically harm its performance. The cavity algorithm has also been modified to accommodate possible con- straints on the faces and edges of tetrahedra. With this modification, mesh refinement never breaks the surface mesh and constrained edges and facescan be included in the mesh interior. As a result, the mesh may not satisfy the De- launay property anymore. Therefore, we must always ensure that cavities are star-shaped. This is achieved by a simple procedure that checks if boundary facets of the cavity are oriented such that they form a positive volume with the new point. If a boundary facet is not oriented correctly, indicating that the cav- ity is not star-shaped, the tetrahedron containing it is removed from the cavity and the procedure is repeated. In practice, these modifications do not affect the speed of point insertions significantly. To generate high-quality meshes, an optimization step should be added to the algorithm to improve the general quality of the mesh elements and remove sliver tetrahedra. Mesh improvements are out of the scope of this paper. How- ever, first experiments have shown that optimization step should slow down the mesh generation by a factor two. For example, mesh refinement and im- 5.1. SMALL AND MEDIUM SIZE TEST CASES ON STANDARD LAPTOPS 91 provement take approximately the same time in Gmsh[57].

Algorithm 2 Mesh generation algorithm Input: A set of triangles t Output: A tetrahedral mesh T 1: function PaRallel MesheR(t) 2: T0 ← EmptyMesh(t) 3: T ← RecoveRBoundaRy(T0) 4: while T contains large tetrahedra do 5: S ← SamplePoints(T) 6: S ← FilteRPoints(S) 7: T ← InseRtPoints(T, S) 8: return T

5.1 Small and medium size test cases on standard laptops

In order to verify the scalability of the whole meshing process, meshes of up to one hundred million tetrahedra were computed on a 4 core 3.5 GHz Intel Core i7-6700HQ with 1, 2, 4 and 8 threads. Those meshes easily fit within the 8Gbof RAM of this modern laptop. Three benchmarks are considered in this section: (i) a cube filled withcylin- drical fibers of random radii and lengths that are randomly oriented, (ii)ame- chanical part and (iii) a truck tire. Surface meshes are computed with Gmsh[57]. Mesh size on the surfaces is controlled by surface curvatures and mesh size in- side the domain is simply interpolated from the surface mesh. Illustrations of the meshes, as well as timings statistics are presented in Fig- ure 5.1. Our mesher is able to generate between 40 and 100 million tetrahedra per minute. Using multiple threads allows some speedup, the mesh refinement process is accelerated by a factor ranging between 2 and 3 on this 4 core ma- chine. The last test case (truck tire) is defined by morethan 7000 CAD surfaces. Recovering the 27 892 triangular facets missing from T0 takes more than a third of the total meshing time with the maximal number of threads. Parallelizing the boundary recovery process is clearly a priority of our future developments. On this same example, the surface mesh was done with Gmsh using four threads. The surface mesher of Gmsh is not very fast and it took about the sametimeto generate the surface mesh of 6 881 921 triangles as to generate the 92 CHAPTER 5. MESH REFINEMENT that contains over one hundred million tetrahedra using the same number of threads. The overall meshing time for the truck tire test case is thus about 6 minutes.

5.2 Large mesh generation on many core machine

We further generated meshes containing over 300,000,000 elements on a AMD® EPYC 64 core machine. Three benchmarks are considered: (i) two cubes filled with many randomly oriented cylindrical fibers of random radii and lengths, and (ii) the exterior of an aircraft. Surface meshes were also generated by Gmsh. Our strategy reaches its maximum efficiency for large meshes. In the 500 thin fibers test case, over 700,000,000 tetrahedra were generated in 135 seconds. This represents a rate of 5.2 million tetrahedra per second. In the 500 thinfibers test case, boundary recovery cost was lower and a rate of 6.2 million tetrahedra per second was reached. 5.2. LARGE MESH GENERATION ON MANY CORE MACHINE 93

100 fibers Timings (s) # threads # tetrahedra BR Refine Total 1 12 608 242 0.74 19.6 20.8 2 12 600 859 0.72 13.6 14.6 4 12 567 576 0.72 8.7 9.8 8 12 586 972 0.71 7.6 8.7

300 fibers Timings (s) # threads # tetrahedra BR Refine Total 1 52 796 891 6.03 92.4 101.3 2 52 635 891 5.76 61.2 69.0 4 52 768 565 5.71 39.4 46.8 8 52 672 898 5.67 32.5 39.8 94 CHAPTER 5. MESH REFINEMENT

Mechanical part (Knuckle) Timings (s) # threads # tetrahedra BR Refine Total 1 24 275 207 8.6 43.6 56.3 2 24 290 299 8.4 30.4 41.8 4 24 236 112 8.1 24.6 35.3 8 24 230 468 8.1 21.8 32.6 5.2. LARGE MESH GENERATION ON MANY CORE MACHINE 95

Truck tire Timings (s) # threads # tetrahedra BR Refine Total 1 123 640 429 75.9 259.7 364.7 2 123 593 913 74.5 166.8 267.1 4 123 625 696 74.2 107.4 203.6 8 123 452 318 74.2 95.5 190.0

Figure 5.1: Performances of our parallel mesh generator on a Intel® CoreTM i7-6700HQ 4-core CPU. Wall clock times are given for the whole meshing process for 1 to 8 threads. The total timings include IOs (sequential), initial mesh generation (parallel), aswell sequential boundary recovery (BR), and parallel mesh refinement for which detailed timings are given. 96 CHAPTER 5. MESH REFINEMENT

100 thin fibers Timings (s) # threads # tetrahedra BR Refine Total 1 325 611 841 3.1 492.1 497.2 2 325 786 170 2.9 329.7 334.3 4 325 691 796 2.8 229.5 233.9 8 325 211 989 2.7 154.6 158.7 16 324 897 471 2.8 96.8 100.9 32 325 221 244 2.7 71.7 75.8 64 324 701 883 2.8 55.8 60.1 127 324 190 447 2.9 47.6 52.0

500 thin fibers Timings (s) # threads # tetrahedra BR Refine Total 1 723 208 595 18.9 1205.8 1234.4 2 723 098 577 16.0 780.3 804.8 4 722 664 991 86.6 567.1 659.8 8 722 329 174 15.8 349.1 370.1 16 723 093 143 15.6 216.2 236.5 32 722 013 476 15.6 149.7 169.8 64 721 572 235 15.9 119.7 140.4 127 721 591 846 15.9 114.2 135.2 5.2. LARGE MESH GENERATION ON MANY CORE MACHINE 97

Aircraft Timings (s) # threads # tetrahedra BR Refine Total 1 672 209 630 45.2 1348.5 1418.3 2 671 432 038 42.1 1148.9 1211.5 8 665 826 109 39.6 714.8 774.8 64 664 587 093 38.7 322.3 380.9 127 663 921 974 38.1 255.0 313.3

Figure 5.2: Performances of our parallel mesh generator on a AMD® EPYC 64-core machine. Wall clock times are given for the whole meshing process for 1 to 127 threads. The total timings include IOs (sequential), initial mesh generation (parallel), aswell sequential boundary recovery (BR), and parallel mesh refinement for which detailed timings are given.

Chapter 6

Mesh Improvement

This chapter and chapter 7 are based on concepts already developed in our two papers:

• Reviving the Search for Optimal Tetrahedralizations[83]

• Quality tetrahedral mesh generation with HXT[82].

Tetrahedral meshes are the geometrical support for most finite element dis- cretizations. The size and the shape of the generated tetrahedral elements must however be controlled cautiously to ensure reliable numerical simulations in industrial applications. The majority of popular tetrahedral mesh generators are based on a Delaunay kernel, because Delaunay-based algorithms are fast, especially in 3D. They are also robust and consume relatively little memory. Yet, pure Delaunay meshes are known to contain near-zero volume elements, called slivers, and a mesh improvement stage is mandatory if one wishes to end up with a high-quality computational mesh.

We proposed, in Chapter 3 and Chapter 4, techniques to compute a Delaunay triangulation of three billion tetrahedra in less than one minute using multiple threads. We also explained in Chapter 5 how to extend that algorithm to ob- tain an efficient parallel tetrahedral mesh generator. This mesh generator was however incomplete in the sense that it did not provide any mesh improvement process. The present chapter proposes a mesh improvement strategy that is both highly parallel and efficient, and generates high quality meshes. Starting with an existing surface mesh, tetrahedral mesh generation is usu- ally broken down into four steps.

99 100 CHAPTER 6. MESH IMPROVEMENT

1. Empty Mesh: A triangulation (tetrahedralization) of the volume, based on all points of the surface mesh and with no additional interior nodes (hence the name empty mesh), except for some isolated user-specified points in the interior of the volume, is first generated.

2. Boundary Recovery: The recovery step locally modifies the empty mesh to match the triangular facets of the surface mesh as well as some possible embedded user-specified constrained triangles and lines.

3. Refinement: The empty mesh is iteratively refined by adding points in the interior of the volumes while always preserving a valid mesh at each iteration. The refinement step ends up when all tetrahedra are smaller in size than a user-prescribed sizemap.

4. Improvement: Whereas the refinement step controls the size of the gen- erated tetrahedra, the improvement step finalizes the mesh by locally eliminating badly shaped tetrahedra, and optimizes the quality of the mesh by means of specific topological operations and vertex relocations.

Step 1 is simply done using a Delaunay tetrahedralization (see Chapter 3 and 4) of the points of the surface mesh. Step 3 was described in in Chapter 5. Step 2 is currently done using TetGen’s implementation[113]. Finally, Step 4, which is the central topic in this chapter, relies on a novel mesh improvement strategy based on an innovative local mesh modification operator called the Growing SPR Cavity.

This chapter is structured as follows. Section 6.1 contains a concise review of existing mesh improvement operations. Whenever our implementation dif- fers from others, the differences are briefly explained. Section 6.2 details a very complex mesh improvement operation called the Small Polyhedron Reconnec- tion (SPR), which basically finds the optimal tetrahedralization of a cavity. We propose several improvement to accelerate this slow but effective operation. Section 6.3 is devoted to the presentation of our new Growing SPR Cavity oper- ation, which cleverly integrates the SPR operation within a mesh improvement process. It is certainly the main contribution of this chapter. In Section 6.4, we detail our mesh improvement schedule, its parallelization and the quality im- provement that can be obtained. Finally, in Chapter 7, our complete tetrahedral mesh generator algorithm (HXT) will be compared against Gmsh and TetGen, two other reference open-source tetrahedral mesh generation softwares. 6.1. MESH IMPROVEMENT OPERATIONS 101

6.1 Mesh improvement operations

Mesh improvement operations are modifications of the mesh aiming at increas- ing its overall quality, the latter being essentially determined by the quality of its worst quality element[111]. Therefore, when we shall speak of the quality of a mesh or of a cavity, we shall in practice refer to the quality of its worst tetra- hedron. The value of a quality measure is expected to be inversely proportional to the interpolation error associated with the tetrahedron in some discretization scheme. If the tetrahedron is not valid, i.e., if it is inverted or flat, the quality measure should then be non-positive. Two examples of quality measure, the Gamma and SICN measure, are described in Section 7.1.3 and 7.1.4. A cavity is a volume corresponding to a set of face-connected tetrahedra. A mesh modification operates on a cavity, which is by definition the partofthe mesh that is modified. Among all possible mesh modifications, mesh improve- ment operations are those that strictly increase the quality of a cavity, from an original tetrahedralization of quality qa to an improved tetrahedralization of quality qb > qa. Any mesh modification can be decomposed into a succes- sion of elementary moves, called bistellar flips or Pachner moves[90]. The 1-4 move adds a point inside a cavity formed by the removal of a single tetrahedron, and fills it with 4 new tetrahedra. The 4-1 move is the opposite, it removesa point. Both operations are illustrated on Figure 6.1a. The two remaining Pach- ner moves are the 2-3 and 3-2 flips, which do neither add nor remove any points in the tetrahedralization (see Figure 6.1b).

1-4 flip 2-3 ‚ip

4-1 flip 3-2 ‚ip

(a) The 1-4 move (point insertion) and 4-1 move (point removal) (b) The 2-3 and 3-2 flips

Figure 6.1: Pachner moves

Flipping The purpose of the mesh refinement step is to obtain a distribution ofpoints all over the volume whose distance to the closest neighbour is prescribed by a given mesh size field. Adding or removing vertices during the subsequent 102 CHAPTER 6. MESH IMPROVEMENT mesh improvement phase is therefore likely to disrupt the refined point dis- tribution and see it deviate from the prescribed size map. The simplest mesh improvement schedule one may think of is thus composed of 2-3 and 3-2 flips only, the simplest operations that do not add or remove points, and are executed whenever they improve the quality of the tetrahedralization in their respective cavity. This strategy is able to eliminate efficiently most ill-shaped tetrahedra. However, this hill-climbing method often reaches local where 2-3 or 3-2 flip no longer improve the mesh quality although the overall quality isnot yet optimal. To overcome this limitation, combinations of multiple flips can be applied at once whenever one can check they result in a tetrahedralization of better quality. This is equivalent with creating more complex topological oper- ations. For example, the 4-4 flip, which is an operation on a cavity of 6 points and 4 tetrahedra with only one interior edge (see Figure 6.2a), can be obtained by doing a 2-3 flip that creates a new interior edge, followed by a 3-2 flipthat removes the initial edge.

multi-face removal

4-4 flip 4-4 flip edge removal

multi-face removal edge removal

4-4 flip

multi-face retriangulation

(b) Edge removal, multi-face removal and (a) The 4-4 flip multi-face retriangulation

Figure 6.2: Examples of composite topological operations

Edge removal The most useful topological operation is the edge removal operation. It is a gen- eralization of the 3-2 and 4-4 flips, starting from a cavity with N ⩾ 3 tetrahedra 6.1. MESH IMPROVEMENT OPERATIONS 103 surrounding a unique edge ab (see Figure 6.2b). The operation removes first all tetrahedra adjacent to that edge, and creates instead N−2 upper tetrahedra con- nected to a and N−2 lower tetrahedra connected to b. The facets between lower and upper tetrahedra form a 2D sandwiched triangulation, and Shewchuk has proposed an algorithm, based on dynamic programming concepts, that finds the 2D triangulation resulting in an optimal tetrahedralization[110]. In contrast, Si presents a more versatile implementation of the edge removal operation that recursively removes other edges that prevents the current edge removal from producing a valid tetrahedralization[113]. This forms a tree of edge removal op- erations, where the leafs are removed with a sequence of 2-3 flips terminated by a final 3-2 flip. Our implementation of edge removal does neither use dynamic programming nor sequences of flips. It is rather a brute-force approach inspired by the first description of edge removal by Freitag and Ollivier-Gooch[50]. In a nutshell, the principle is as follows. Each triangle of the sandwiched 2D tri- angulation is assigned a quality, which is the minimum of the corresponding upper and lower tetrahedron quality. If the quality of a triangle is less than the quality of the original tetrahedralization, the triangle is marked bad. Us- ing precomputed tables giving all possible triangulations up to N = 7 in which this triangle is found, it is then possible to eliminate triangulations involving bad triangles. Whenever all triangulations are eliminated, the edge removal is not performed. If, on the other hand, several tetrahedralizations are pos- sible, their respective overall quality are computed, and the best candidate is selected. As Freitag and Ollivier-Gooch noted, having more than 7 tetrahedra around an edge is exceptional, and edge removal is also less likely to succeed in those cases. Our approach favors thus simplicity over asymptotic complex- ity, although all approaches have of course their pros and cons in practice. In average, our edge removal terminates within about 1 microsecond for cavities of about 5 tetrahedra, on modern hardware.

The inverse of the edge removal was coined multi-face removal by Shewchuk [110]. The operations that derive from edge removal are shown in Figure 6.2b. This group of operations includes a third operation, named Mutli-face retrian- gulation by Misztal et al.[86]. Mutli-face retriangulation can be regarded either as a combination of an edge removal and a multi-face removal, or as a sequence of 4-4 flips modifying the 2D sandwiched triangulation. Although multi-face removal and multi-face retriangulation have proved their effectiveness, their implementation is more involved than edge removal, and they are far less used. We decided not to implement them, because they are covered by the small poly- hedron reconnection (SPR) anyway. This mother of all flips is detailed in Section 6.2. 104 CHAPTER 6. MESH IMPROVEMENT

Smoothing A vertex relocation, or smoothing operation, in the context of tetrahedral mesh optimization, is an operation that changes the position of a point in order to improve the quality of the adjacent tetrahedra. Smoothing methods have been extensively studied in the past 25 years[10, 48, 49, 37], and their objective is twofold: improve the overall quality of the mesh and space out points appro- priately so that subsequent topological transformations can further improve the mesh. Smoothing and topological transformations are indeed more effec- tive when combined [50]. The most used smoothing technique, called Laplacian smoothing, simply relocates a point at the centroid of the set of points to which is it connected by a mesh edge. As for flipping, Laplacian smoothing is applied only if it effectively improves the quality of the mesh. In our mesh generation library HXT, Laplacian smoothing is combined with a golden-section search of the optimal relocation on the segment between the original position and the centroid. This approach is effective in practice, although the objective function may have local maxima over the segment, and the golden-section search is not guaranteed to identify the largest one. In their work[48], Freitag et al. studied this optimized Laplacian smoothing technique, among other techniques.

Other operations A good review of classical mesh improvement operations is found in Klingner’s Ph.D. Dissertation[67]. Klingner reports therein that large quality improve- ments are obtained by using point insertion and edge contraction. We have not implement those operations so far, but we will consider them for later up- dates. They are indeed more complex than one might think at first sight.It is not enough to insert or remove points and check on whether the quality is improved. The position of adjacent points must also often be modified toob- tain tetrahedra of better quality and, more importantly, to respect the meshsize map.

2-2 flip

Figure 6.3: The 2-2 flip. Triangles of the surface mesh are shaded in orange.

Valid mesh of higher quality may be obtained in some situations if a slight modification of the surface mesh is allowed. This however implies relying ona 6.2. SMALL POLYHEDRON RECONNECTION 105

CAD software during the meshing process to evaluate the distance to thepara- metric definition of the initial surface. A more lightweight approach consider- ing a modification of the surface mesh itself, up to an approximated Hausdorff distance threshold, would still mean bookkeeping in memory an unmodified version of the surface mesh. As we are aiming at a simple and parallelizable mesh generator, we chose to not consider at all operations that modifies the surface mesh. In some cases, this is even an asset, as there are situations where one wants the surface mesh to remain strictly unchanged, for instance, if the surface is an interface between independent parts of an assembly. Boundary modification is however an option worth being considered for future imple- mentations, as it has shown its effectiveness[68]. The most elementary surface mesh modification operation simply flips a pair of adjacent tetrahedra, which are themselves adjacent to two boundary triangles on the surface mesh. This operation is called a 2-2 flip and is illustrated in Figure 6.3.

6.2 Small Polyhedron Reconnection

Instead of adding more and more operations to the existing zoo of topological transformation, it is possible to use an operation that generalizes all of them, called the small polyhedron reconnection (SPR) [75]. This operation considers the problem of finding the optimal triangulation of a cavity. A cavity, in this context, is a set of volumes defined by constrained closed surfaces, with possible interior constrained vertices, edges and triangles. Finding whether a polyhe- dron can be triangulated or not is already a NP-complete problem[100]. Finding the best triangulation is an even more difficult NP-hard problem. The SPRal- gorithm has indeed factorial complexity with regard to the number of points n in the cavity. The core idea of the SPR algorithm is to compute the besttrian- gulation by searching the set of all possible triangulations using a branch and bound algorithm. It starts from a selected face, and recursively fills up the cav- ity with well-shaped tetrahedra, trying out each of the n − 3 remaining points, until the best triangulation is found. The cavity is then replaced using its highest-quality triangulation tocon- tain all constrained triangles and edges. This method is highly flexible and independent of the chosen quality measure. However, using the SPR operation in practice presents great technical challenges and subsequent mesh improve- ment procedures implemented in Tetgen [113], CGAL [24, 66] or MMG3D [42] do not use this method. Indeed, the performance of the SPR search algorithm varies dramatically based on the parameters and heuristics that it relies upon. A poor choice makes the algorithm completely inadequate for large meshes, which can now be created at a rate of several million tetrahedra per second 106 CHAPTER 6. MESH IMPROVEMENT

[81]. The main contribution of this paper is to present efficient heuristics and implementation strategies that enable the use of SPR to optimize large meshes. We propose improvements to the SPR algorithm by exploring the space of possible triangulations in a different order, aiming at a significant reduction to the number of triangulations that need to be considered during the search (Section 6.2.3). Repeated computations of expensive quality measures are also avoided by storing their results (Section 6.2.5). These choices affect the time taken to optimize cavities by several orders of magnitudes (Section 6.2.6).

6.2.1 The branch and bound algorithm The core of our method is an efficient algorithm to search for the best triangula- tion of a small polyhedral cavity C. Let T(C) denote the set of all triangulations of this cavity. The best triangulation is defined as the triangulation of C that maximizes the quality of its worst element according to a quality measure q:

Topt = arg maxT∈T(C) min q(t) t∈T The set of triangulations considered by the algorithm may be restricted to only those that contain a given set of edges and triangles if certain features must be preserved. An optimal triangulation is computed using a branch and bound algorithm (Algorithm 3). Starting from the boundary ∂C of the cavity C as input, a triangle F ∈ ∂C is selected. Any mesh of the cavity C must include a tetrahedron t that contains the triangle F. Each possible tetrahedron is considered, by branching on all possible choices for its fourth vertex. After inserting a new tetrahedron, the boundary ∂C is replaced by the boundary of the part of the cavity which has not yet been filled with tetrahedra (see Figure 6.4). The best mesh of this new cavity is computed by recursively applying the algorithm. This corresponds to the exploration of a tree whose nodes correspond to triangulations and whose edges correspond to the insertion of tetrahedra. Throughout this process, the best triangulation found so far is tracked, as well as its quality q∗. After each branch, an upper bound on the quality ofthe best solution that could be obtained is computed by finding the minimum qual- ity element that has already been added to the solution. If the upper bound is worse than q∗, this part of the search tree is skipped. By the end of the al- gorithm, the optimal triangulation Topt will have been found. The rest of this section discusses important design choices in order to achieve an efficient al- gorithm: 1. for a given triangle, how to compute the set of tetrahedra that can be built on top of it; 6.2. SMALL POLYHEDRON RECONNECTION 107

Algorithm 3 Optimize-Cavity: search for the best triangulation with a given boundary. Input:

• B: the target boundary • T: a partial triangulation • T ∗: the best triangulation found • q: a quality function Output: The best triangulation of the cavity if B = ∅ then return T F ← some triangle of B for all vertex v do t ← the tetrahedron formed by joining v to each vertex of F ′ if q(t) > mint′∈T ∗ q(t ) then T ′ ← T ∪ {t} B′ ← B − ∂t T ∗ ← Optimize-Cavity(B′,T ′,T ∗, q) return T ∗

2. the selection of the triangle F on top of which the next tetrahedron should be built;

3. the order in which the tetrahedra containing F are considered for inser- tion into the mesh;

4. how to avoid repeated evaluations of expensive geometric predicates.

6.2.2 Computing candidate tetrahedra At each branching step, the algorithm considers a triangular face F before trying to add each tetrahedron t that contains F to the current triangulation. Because three of the vertices of t must be the vertices of F, only one vertex needs to be chosen. Many of those choices need to be filtered out, because inserting the corresponding tetrahedra would result into an invalid triangulation. Below are the conditions used to eliminate such candidates.

Geometric validity Each candidate tetrahedron t must have a positive orien- tation and a positive quality q(t) greater than the quality q∗ of the best trian- gulation found so far.

Geometric intersections The solution must not include intersecting tetrahe- dra. The new faces and edges of each candidate t are tested for intersection 108 CHAPTER 6. MESH IMPROVEMENT

Figure 6.4: Enumerating the triangulations of a polygon. At each step, a boundary edge is picked, and a branch is created for each possible triangle that contains that edge. The untriangulated area remaining after a triangle insertion is filled byapplying this procedure recursively. with the edges and triangles of the boundary of the unmeshed region as well as with constrained features. Tetrahedra that completely enclose a vertex are also rejected. All intersection tests are performed exactly by relying only on exact computations of the orientations of tetrahedra.

6.2.3 Mesh construction order Changing the order in which tetrahedra are inserted into the mesh drastically affects the number of triangulations that need to be considered by thealgo- rithm. This behavior is common in difficult optimization problems [56, 58]. Heuristics are used to choose a favorable order for most cases. First, a triangular face F must be selected from the boundary of the un- meshed region. The algorithm then branches on the set of all tetrahedra con- taining F that can be added to the current triangulation. Faces are selected by attributing a cost to each of them. This cost is computed by summing thenum- ber of candidate tetrahedra containing the face and their qualities. A lower cost means either fewer candidates, hence a smaller search tree, or worse candidates, 6.2. SMALL POLYHEDRON RECONNECTION 109 hence a tighter upper bound allowing for more pruning.

6.2.4 Ordering candidate tetrahedra Once the face F has been selected, a second heuristic defines the order in which the different candidate tetrahedra are inserted into the mesh. If a good solution is found early, subtrees that provably cannot contain a better solution need not be explored. Hence, candidate tetrahedra are evaluated based on criteria used to determine how likely they are to be part of a good solution:

1. the number of faces shared with the boundary, since cavities with fewer boundary faces are generally easier to fill;

2. whether or not the candidate has a higher quality than the tetrahedra that have already been inserted into the mesh;

3. whether or not the candidate contains a new vertex, since any solution must contain all vertices present in the cavity.

Each tetrahedron is given a score based on the number of criteria that it meets. Candidates with a higher score are tested first.

6.2.5 Reusing results of geometric predicates During the search, the orientations and qualities of many tetrahedra need to be evaluated. A robust algorithm requires these to be evaluated using adap- tive precision in order to obtain consistent results despite numerical errors. As a result, these evaluations are computationally intensive. This effect is com- pounded by the need to evaluate the same tetrahedra many times, if it is con- sidered as a candidate at multiple occasions  during the search. n Computing the qualities of all 4 tetrahedra of an n-vertex point ahead of time would avoid repeated computations, but this solution is inadequate: the orientations and qualities of some tetrahedra are never needed. In many cases, the search ends early after only evaluating a small fraction of all possible tetrahedra. Instead, our approach is to memoize the computation of quality values. To evaluate a tetrahedron T, it is first normalized as T ′ by sorting the indices of its four vertices. While sorting, the parity of the number of permutations that were performed is tracked. A table is then accessed to test whether or not the quality of T ′ is known. If not, it is computed and stored in the table. The quality of T is the same as that of T ′ for an even number of permutations, and the opposite otherwise. 110 CHAPTER 6. MESH IMPROVEMENT

In addition, if the tetrahedron intersects a constrained edge or fully encloses a vertex, a null quality is stored in the look-up table. This prevents the insertion of these tetrahedra into the mesh without requiring the reevaluation of the intersection tests.

6.2.6 SPR performance results The SPR algorithm can be used with any chosen quality measure. For thepur- pose of this analysis we use √ √ 24 3V 24 r γ = = in , (6.1) |emax|(A1 + A2 + A3 + A4) |emax| where V is the volume of the tetrahedron, |emax| is the length of the longest edge, Ai is√ the area of the ith face and rin is the inradius of the tetrahedron. The factor 24 is added such that the optimal tetrahedron, which is a regular tetrahedron, has a quality γ = 1. This quality measure penalize all tetrahedra according to their associated interpolation error [111]. For each cavity featured in Figure 6.5, we measured the running time of the SPR operation (Table 6.1). All 5 cavities were extracted from real-world, non- optimized meshes. We measured the running time in two different settings:

1. with no initial lower bounds given to the algorithm, as is the case when SPR is used for boundary recovery and no initial triangulation is known;

2. with the quality of the initial triangulation γorig as a lower bound, as is done for mesh optimization.

The execution time in the second case is always strictly less than in thefirst, because the lower bound allows the algorithm to prune triangulations contain- ing a tetrahedron with a quality worse than γorig. All cavities were optimized within 3.6 milliseconds when the lower bound was used, and within 13.3 sec- onds otherwise. We then measured the speedups offered by the different improvements to the algorithm by disabling each optimization independently:

1. the face selection heuristic (Section 6.2.3);

2. the candidate ordering heuristic (Section 6.2.4);

3. the reuse of previously computed qualities of tetrahedra (Section 6.2.5). 6.2. SMALL POLYHEDRON RECONNECTION 111

1 2

3 4

5

Figure 6.5: Different cavities on which the tests were conducted. № 1,3 and 4were randomly extracted from the mesh improvement procedure of a torus or sphere. № 2 looks ordinary but is particularly slow without cleverly chosen heuristics, whereas № 5 comes from a mesh with bad elongated triangles on its surface.

The combined effect of these improvements was measured by disabling all of them simultaneously. The improved algorithm is between 104 and 107 times faster than a naive implementation. In the case of the third cavity, the execution with all optimizations disabled was stopped after 20 hours with no improved 112 CHAPTER 6. MESH IMPROVEMENT

Quality Time (ms) Cavity V |∂C| γorig γafter q(t) > 0 q(t) > γorig 1 25 44 0.28 0.52 3.1 2.5 2 31 52 0.31 0.53 13.3 3.0 3 31 52 0.25 0.51 9.6 3.6 4 22 38 0.23 0.55 4.8 1.0 5 29 48 0.29925 0.29938 7.3

Speedup Cavity 6.2.3 6.2.4 6.2.5 Combined 1 25 1.4 0.68 4×104 2 3.5×104 3.1 5×104 3×106 3 106 10 2072 > 107 4 501 2.6 0.97 2×104 5 5×104 1.1 1.5 8×104

Table 6.1: Summary of the results obtained by optimizing the cavities of Figure 6.5 using SPR. Speedups are given for q(t) > γorig. solution found (Table 6.1). All experiments were performed on an Intel©Core™i7-6700HQ CPU. Tim- ings and speed ups are the average of 100 runs, or of two runs for entries that required more than one minute of computation time. Our implementation uses only a small amount of memory, although asymptotically proportional to n4 were n is the number of points of the cavity. For the cavities tested, the maxi- mum RAM usage did not exceed 3 MB.

6.3 Growing SPR cavity

The choice of the cavity on which to apply the SPR operation is amatterthat received very little attention, although the performances strongly depend on the geometry and topology of the cavity. In this paper, we propose a new al- gorithm called Growing SPR Cavity (GSC), whose basic idea is to increment the number of points gradually, and to apply the SPR at each step until a better triangulation is found. As the SPR algorithm has factorial complexity, the cost of repeating SPR operations from 4 to n points is not much higher than com- puting directly the best triangulation with n points. In addition, most of the SPR structure can be reused from one iteration to the next, and a better trian- gulation of the cavity is usually found within few iterations. The limit to the 6.3. GROWING SPR CAVITY 113 maximum number of points has been set to n = 32, and GSC abandons if no better triangulation has been found when the cavity has reached 32points. We explain in Section 6.2 how the SPR algorithm can be optimized by stor- ing, in a n4 table, the quality (and thus also the validity) of all tetrahedra the algorithm already encountered. By limiting the number of points to 32, we can allocate an acceptably large 324 table at the beginning of the GSC algorithm and keep it from one iteration of the GSC algorithm to another.

Algorithm 4 The Growing SPR Cavity (GSC) operation, applied on a bad tetra- hedron t of a mesh M 1: function GSC(t, M) 2: C ← t ▷ the cavity is the tetrahedron at first 3: n ← 4 ▷ the number of points in C 4: while n < 32 do 5: C ← ExtendCavity(M, C, n) ▷ add pk+1 and all tetrahedra with 4 vertices in Pk+1 6: n ← n + 1 7:

8: Cb ← SPR(C) ▷ find the optimal triangulation of C 9: if Cb ≠ C then 10: return (M\C) ∪ Cb ▷ Mesh is improved 11: 12: return M ▷ Failed to improve the mesh

C3 A3 C4 A4 C5 A5 C5new A5 (a) (b) (c) (d)

Figure 6.6: Growing SPR Cavity in 2D. Triangles in the cavity C are colored in orange, and triangles adjacent to the cavity A are colored in cyan. (a): at first, the cavity only contains a bad triangle. (b) and (c): every point outside the cavity is a vertex of at most one triangle in A, so we add the triangle in A with the worst quality to C. (d): the SPR algorithm found a better triangulation for C5, the triangulation of the cavity is replaced and the GSC algorithm ends. 114 CHAPTER 6. MESH IMPROVEMENT

The GSC pseudocode given in algorithm 4 can be explained as follows. As its initial cavity, GSC starts with a tetrahedron of bad quality which needs to be optimized. Let the four point of this initial tetrahedron be denoted as {p1, p2, p3, p4}. Now, let Ck be a cavity containing a set of k points Pk = {p1, p2, . . . , pk}. To iteratively obtain Ck+1, a point pk+1 is added and every tetrahedra whose four points are in Pk+1 = Pk ∪ pk+1 are also added. The selection of the next point of pk+1 is a heuristic based on a simple intuition comforted by experience in testing the SPR algorithm. We observed that it is in general easier to find a better tetrahedralization for a cavity with few points and many tetrahedra than for a cavity with few tetrahedra per point. There- fore, the point pk+1 is chosen so as to add as many tetrahedra as possible. Let Ak denote the set of tetrahedra sharing at least one facet with a tetrahedron in Ck. In practice, GSC adds the most connected point, i.e., the point adjacent to the largest number of tetrahedra in Ak. If several points are adjacent to m tetrahedra in Ak, the sum of the qualities of those m tetrahedra is evaluated for each point, and the point with the lowest sum is selected as a tiebreaker rule. Figure 6.6 illustrates the GSC algorithm in 2D, where the tiebreaker rule has been used twice. This rule is however empirical and different alternative rule were also tested: selecting the point with the highest sum of qualities, with the maximum or minimum quality, with the quality function replaced by the vol- ume or by the height associated to the boundary facet. The proposed tiebreaker rule consistently gave better results in our full mesh improvement schedule, as detailed in Section 6.4.

In 2D, the triangles to be added to the cavity, i.e. with all 3 points in Pk+1 = Pk ∪ pk+1, are always in Ak. However, in 3D there might be tetrahedra that are neither in Ak nor in Ck, but still have all their vertices in Pk+1. Therefore, in reality, the GSC algorithm does not always add the optimal point that result in the addition of as many tetrahedra as possible. Choosing the most connected point and not the optimal point is however simpler and faster in practice.

6.4 Mesh improvement schedule

Our mesh improvement strategy includes a Laplacian smoothing phase, an edge removal phase and the GSC. As Laplacian smoothing and edge removal are approximately 100× faster than GSC, they are used in priority, whereas GSC is used as a last resort technique to unlock processes trapped in local maxima of the mesh quality objective function. The pseudocode for a simplified serial mesh improvement schedule is presented in Algorithm 5. The for loop on line 7 will be called the SER loop in the following (Smoothing Edge Removal) and the loop where the GSC operation is applied, on line 17, will be called 6.4. MESH IMPROVEMENT SCHEDULE 115

(a) Tetrahedra with γ < 0.35 (b) Tetrahedra with γ < 0.35 (c) Tetrahedra with γ < 0.35 before (above) and after (be- before (above) and after (be- before (above) and after (be- low) the mesh improvement low) the mesh improvement low) the mesh improvement step on the Rotor mesh step on the Piston mesh step on the Rim mesh

Figure 6.7: Effect of HXT’s mesh improvement step on bad tetrahedra. The threshold for being considered a bad tetrahedron was set to γthreshold = 0.35. Almost all bad tetrahedra of the Rotor, Piston and Rim meshes (specified in Table 7.2) were improved above this threshold.

GSC loop. Both loops iterate over bad tetrahedra, which are tetrahedra with a quality under a user-defined threshold. Using the Gamma quality function detailed in Section 7.1.3, our mesh improvement strategy is able to eliminate most tetrahedra with γ < 0.35, as shown in Figure 6.7, where one can see that only a few bad tetrahedra subsist.

Parallelization The mesh improvement strategy described above can be parallelized pretty much in the same way as the mesh refinement. The parallel shared-memory 116 CHAPTER 6. MESH IMPROVEMENT

Algorithm 5 The proposed serial mesh improvement schedule tries to improve each tetrahedron τ from a list of bad tetrahedra T in a mesh M. Bad tetrahedra are tetrahedra with a quality smaller than a user-defined threshold qmin

1: function MeshImpRovement(M, qmin) 2: do 3: modifGSC ← 0 ▷ count modifications of the mesh by Growing SPR Cavity 4: do 5: modifSER ← 0 ▷ count modifications of the mesh by Smoothing or Edge Removal 6: T ← GetBadTetRahedRa(M, qmin) 7: for τ ∈ T do ▷ the SER loop 8: improved ← False 9: for point ∈ τ and ¬improved do 10: improved ← LaplacianSmoothing(M, point) 11: for edge ∈ τ and ¬improved do 12: improved ← edgeRemoval(M, edge) 13: if improved then 14: modifSER ← modifSER + 1 15: while modifSER > 0 16: T ← GetBadTets(M, qmin) 17: for τ ∈ T do ▷ the GSC loop 18: if GSC(M, tau) then 19: modifGSC ← modifGSC + 1 20: while modifGSC > 0

Delaunay kernel introduced in Chapter 4 partitions the domain on basis of a 3D Moore curve defined in such a way that all partitions contain the sameamount of points to be inserted. A point is in a partition if its Moore index is in the range that defines the partition. A tetrahedron, then, is considered to belongto a partition if at least 3 of its 4 vertices lay in that partition. When the cavity cre- ated for the insertion of a new point overlaps the boundary between different partitions, the operation is suspended, until all other insertions have been tried. At that moment, a new Moore curve and hence new partitions are created, and the insertion loop resumes with the points whose insertion was suspended. The number of parallel threads, and hence the number of partitions, is determined at the beginning of each sweep by the percentage ρ of suspended point inser- tions during the previous sweep. If ρ = 1, a single thread is used so that the termination of the algorithm is guaranteed. 6.4. MESH IMPROVEMENT SCHEDULE 117

(a) Rotor mesh partitions (b) Piston mesh partitions (c) Rim mesh partitions

Figure 6.8: Partitions based on the 3D Moore curve ordering. These partitions were created almost instantaneously at the start of the mesh improvement step, running on 8 threads.

Similarly, for mesh improvement, the space filling curve is partitioned so as to equally distribute bad tetrahedra over the threads. Figure 6.8 shows par- titioning examples for 8 threads on 3 different meshes. The number of threads is again decided in function of the percentage of suspended operations—due to a conflict with another partition— in the previous sweep. The SER loopis thus executed repeatedly with decreasing numbers of threads, until all partition conflicts have been resolved. The same procedure is used for the GSCloopas well. This parallel algorithm scales well in case of very big meshes only,for essentially two reasons. Firstly, elements crossing partition boundaries repre- sent a larger portion of space in small meshes than in large meshes, yielding thus mechanically more conflicts. This is less of an issue with mesh refinement because cavities are usually smaller for point insertion than for the Growing SPR Cavity. The second reason is that the time spent in one execution ofthe SER or of the GSC loop is very small, typically in the millisecond range. There- fore, the overhead of launching new threads and computing Moore indices for the whole mesh is significant. Figure 6.9a compares the scaling in the case of two highly-refined models. The mesh generation was done on the 64 physical cores of an Intel Xeon Phi 7210 machine, running at 1.3 GHz. The overall effec- tiveness and performance of our mesh improvement algorithm is analyzed in Section 7. 118 CHAPTER 6. MESH IMPROVEMENT

103

102 Time [s] Aircraft 500 thin fibers perfect scaling 101 1 2 4 8 16 32 64 Number of threads (a) Scaling of tetrahedral mesh improvement on 2 very large meshes.

(b) The 2 different meshes used within the scaling graph. Above, an Aircraft with 637million tetrahedra, the interior being also meshed. Below, a cube with 500 thin fibers that has 351 million tetrahedra.

Figure 6.9: Scaling of our parallel mesh improvement schedule on 2 huge meshes Chapter 7

Mesh generator’s performance

This chapter is the second part of our yet unpublished paper “Quality tetra- hedral mesh generation with HXT” (the first part is in chapter 6). Our mesh generator, called HXT, is compared with TetGen and Gmsh, quality-wise and performance-wise.

Before comparing the performance of our tetrahedral mesh generator with TetGen and Gmsh, the similarities and key differences between the different implementations are briefly recalled. Besides being all open-source and free, the three considered software tools are also structured very similarly. Asex- plained in chapter 5, the mesh generation process can be splitted into four dis- tinct steps: creation of an empty mesh, boundary recovery, mesh refinement and mesh improvement. Although it could be sensible to work on improv- ing the quality of tetrahedra right at the moment of their creation, none of the examined mesh generators do so. We think the main reason for this, be- sides the obvious one that it allows programmers to focus on smaller tasks and goals, is that it allows using the programs as building blocks. It is in- deed possible to create a mesh with Gmsh, refine it with TetGen, and opti- mize it with our HXT algorithm, without duplicating any part of the process. Another advantage is that the implementations of the different steps can be analyzed separately, and their respective execution times be compared on dif- ferent models. The results of our benchmark for all 4 steps of the tetrahedral mesh generation, including the specification of relevant hardware characteris- tics and compilation flags, are given in Section 7.1. More specifically, element numbers, timings and CPU/memory usages are reported in Table 7.2. Com- parative performances per step (except for the boundary recovery step) and per mesh generator are shown as bar plots in Figure 7.1. Being committed to provide reproducible research, the code and meshes used in this chapter’s test

119 120 CHAPTER 7. MESH GENERATOR’S PERFORMANCE cases are all made available at https://git.immc.ucl.ac.be/hextreme/ tet-mesher-bench. The computer code of our mesh generator, HXT, is avail- able in Gmsh 4.6.0 1 and above versions through the -algo hxt parameter or the Mesh.Algorithm3D=10 option. HXT successfully passes all Gmsh 3D benchmarks and, as discussed below, is faster and generates meshes of higher quality than other open-source implementations, including Gmsh’s native 3D mesher.

Empty mesh The first step, i.e., the creation of the empty mesh, consists in computinga Delaunay tetrahedralization through all points of the surface mesh plus some possible user-defined interiors points. Since a Delaunay tetrahedralization is unique, provided a simulation of simplicity[44] is used in conjunction with robust adaptive predicates[108], all three programs generate identical empty meshes. The ordering of the tetrahedra however, can largely vary from one package to another. The ordering is even non-deterministic with HXT, be- cause threads can reserve chunks of memory corresponding to a set of 8192 tetrahedra at different moments from one run to another. To get rid ofthis non-deterministic behaviour, HXT offers a reproducible mode, which reorders tetrahedra deterministically in the lexicographic order of their nodes. The per- formance of TetGen, Gmsh and HXT (with and without the reproducible mode), for the different meshes shown in Table 7.2, are reported in Figure 7.1b. HXT is the fastest for the empty mesh step, but not because of parallelization. Most tetrahedra of the empty mesh indeed cross the domain from side to side. With our partitioning method based on a space-filling curve, points that are at dif- ferent extremities of the domain have very little chances of being in the same partition. Because a tetrahedron is only considered in a partition when at least three of its vertices are in that partition, parallelization is not very effective at this step. The speed difference between HXT and TetGen, or between HXTand Gmsh, is rather explained here by the good serial performances of our Delaunay kernel[81].

Boundary Recovery All 3 software tools rely internally on TetGen’s boundary recovery code. Ta- ble 7.2 shows however that HXT is a bit slower than TetGen for boundary recovery, because of the back and forth conversion of the mesh between its own data structure and TetGen’s format. Gmsh applies the same, albeit much

1https://gitlab.onelab.info/gmsh/gmsh/-/tree/master/contrib/hxt 121 slower, kind of conversion. Moreover, the order under which tetrahedra are stored in memory can either slow down or speedup the boundary recovery process, which explains timing differences observed between HXT in normal or reproducible mode. As the other parts of HXT are parallelized and optimized for heavy workload, boundary recovery is the bottleneck of our code for large meshes. In contrast, HXT usually performs very well on small meshes. We suspect that TetGen’s algorithm for locating missing facets or edges has a su- perlinear complexity with respect to the size of the mesh.

Mesh Refinement

Mesh refinement is the step where our parallel HXT algorithm really stands out. The ways TetGen, HXT or Gmsh proceed to refine the mesh arerather different, but Figure 7.1a nonetheless shows that the different software tools generate meshes with only slightly different numbers of tetrahedra. TetGen was given the options q1.1/14 , causing it to refine only tetrahedra with a radius-edge ratio larger than 1.1 or a minimum dihedral angle smaller than 14◦[113]. No meshsize constraint was given, although TetGen allows it. Tet- Gen therefore adds point only to optimize the quality of elements. In contrast, HXT and Gmsh add a new point pk+1 inside a tetrahedron τ only if the inser- tion does not create an edge shorter than the prescribed meshsize, which is the value linearly interpolated from prescribed meshsizes at the vertices of τ. At the beginning of the refinement step, nodal meshsizes are evaluated from the surface mesh. A node of the surface mesh pbnd has a meshsize equal to the av- erage length of the constrained edges that contain pbnd. Contrary to TetGen, HXT and Gmsh thus do not refine with the goal of improving the quality ofthe elements. All three softwares do however end up with meshes with very simi- lar numbers of tetrahedra. Still, timings for mesh refinement and improvement indicated in Table 7.2, Figure 7.1c and 7.1d, are expressed in second per million tetrahedra to alleviate the effect of the different refinement strategies. HXT is at least 5 times faster than Gmsh and TetGen for mesh refinement, consis- tently generating more than one million tetrahedra per second when the other mesh generators peak at 300 000 tetrahedra per second. On the 300 fibers model, HXT reaches nearly 3 million tetrahedra per second, which is 132 times faster than Gmsh. HXT is also efficient in terms of memory usage. For the 100 fibers mesh, the peak resident set size does not exceed 77 bytes per tetra- hedron. If memory allocations not depending on the size of the mesh are put aside, HXT consumes only about 60 bytes per tetrahedron. 122 CHAPTER 7. MESH GENERATOR’S PERFORMANCE

Mesh Improvement

For the mesh improvement step, TetGen was given the -o/130 option, setting to 130◦ the target maximum dihedral angle. In contrast, HXT and Gmsh target a minimum inradius to longest edge ratio γthreshold = 0.35. As Gmsh and HXT avoid modifying the surface mesh, TetGen was also prevented from modifying the surface mesh by enabling the -Y option. In addition to the timings of Table 7.2, mesh quality has also been compared after the mesh improvement step for 3 geometries (Rotor, Piston and Rim) and 3 quality measures:

• the dihedral angles, plotted in Figure 7.2

• the inradius to longest edge ratio (γ), shown on the bar plot of Figure 7.3. The tetrahedra with γ < 0.35 before and after HXT’s mesh improvement step are also shown in Figure 6.7

• the signed inverse condition number (SICN), plotted in Figure 7.4

HXT gives comparatively the best results regarding each of those three quality measures, even for the maximum dihedral angles which is the qual- ity measure used by TetGen during the mesh improvements step. Figure 7.2 indeed shows noticeably smaller maximum dihedral angles for HXT compared to Gmsh and TetGen for each of the 3 considered meshes. This difference is ex- plained by the effectiveness of the new Growing SPR Cavity operation described in Section ⁇. The performance of HXT is even more striking when looking at the minimum inradius to longest edge ratio and the minimum SICN:

Rotor

γ SICN ∡min ∡max Gmsh 0.014 0.077 4.06 174.76 TetGen 0.037 0.16 5.10 168.28 HXT (ours) 0.13 0.16 5.10 149.2

Piston

γ SICN ∡min ∡max Gmsh 0.23 0.23 11.24 161.40 TetGen 0.15 0.30 14.40 156.12 HXT (ours) 0.29 0.39 13.76 146.64 7.1. BENCHMARKS 123

Rim

γ SICN ∡min ∡max Gmsh 0.0094 0.034 1.59 177.22 TetGen 0.073 0.15 6.70 168.26 HXT (ours) 0.16 0.17 5.92 149.09

Table 7.1: Minimum γ and SICN measure among all tetrahedra, minimum and maxi- mum dihedral angle among all dihedral angles of the mesh, for the Rotor, Piston and Rim meshes with the 3 different open-source software tools tested.

HXT’s mesh improvement is parallelized and has very fast smoothing and edge removal operations. Table 7.2 and Figure 7.1d indeed indicate that HXT is also the fastest when it comes to mesh improvement, but not by much. The rea- son for this is simple: TetGen and Gmsh stop optimizing whenever the smooth- ing and edge removal operations become ineffective, whereas HXT, then, starts its first GSC sweep.

7.1 Benchmarks

Benchmarks use the average of 5 runs on an Intel® Core™i7-6700HQ with 4 cores running at 3.5GHz and 16Gb of RAM. We use Gmsh 4.6.0 with the -optimize_threshold=0.35 option and HXT’s tetMesh_CLI executable provided in gmsh/contrib/hxt/tetMesh/ with default options. We use TetGen 1.5.1 with "-BNEFIVVYp -q1.1/14 -O7 -o/130" options. TetGen was com- piled with gcc -O3, Gmsh and HXT with "gcc -O3 -march=native". See Chapter 7 for an in-depth discussion on the results. The code for the bench- marks is available at https://git.immc.ucl.ac.be/hextreme/tet-mesher-bench 124 CHAPTER 7. MESH GENERATOR’S PERFORMANCE

7.1.1 Speed Figure 7.1 shows bar plots of different performance measures, further detailed in Table 7.2.

·107 Gmsh Gmsh TetGen TetGen 2 Ours 10 Ours Ours (reprod. mode) Ours (reprod. mode) # tet.

1 Time [s] 5

0 0

Rim Rim Rotor Piston Rotor Piston Knuckle Knuckle 300 fibers 100 fibers 300 fibers 100 fibers

(a) Final number of tetrahedra (b) Tetrahedrizing the empty mesh

·10−5 ·10−6

Gmsh 8 Gmsh 4 TetGen TetGen Ours 6 Ours Ours (reprod. mode) Ours (reprod. mode) 4 2

2 Time / # tet.[s/tet] Time / # tet. [s/tet]

0 0

Rim Rim Rotor Piston Rotor Piston Knuckle Knuckle 300 fibers 100 fibers 300 fibers 100 fibers

(c) Mesh refinement (d) Mesh improvement

Figure 7.1: Performance benchmark bar plots 7.1. BENCHMARKS 125

models Rotor Knuckle Rim 100 fibers 300 fibers Piston Surface mesh number of points 115 052 435 423 397 985 47 759 328 661 27 187 number of triangles 230 232 870 914 796 030 94 994 656 162 54 374 Gmsh final number of points 138 652 1 435 736 964 213 583 860 2 583 840 52 823 final number of tetrahedra 493 632 7 536 855 4 729 987 3 580 914 15 680 789 241 982 Empty Mesh [s] 2.31 13.78 9.68 0.95 9.78 0.52 Boundary Recovery [s] 2.43 33.039 32.93 6.30 0.07 23.92 Mesh Refinement [µs/tet] 14.61 36.73 28.50 28.72 45.12 20.69 Mesh Improvement [µs/tet] 4.14 7.16 5.96 5.73 8.24 4.23 Max. mem. usage [B/tet] 1226.70 558.86 569.25 524.24 493.53 865.92 CPU % 99.9 99.9 99.9 99.9 99.9 99.9 TetGen final number of points 169 755 2 140 715 1 101 619 614 938 3 183 439 59 876 final number of tetrahedra 670 594 11 442 634 5 382 975 3 606 324 18 589 149 275 491 Empty Mesh [s] 1.78 8.13 6.32 0.63 4.81 0.42 Boundary Recovery [s] 1.23 6.21 4.70 0.54 4.60 0.32 Mesh Refinement [µs/tet] 3.19 5.23 4.76 3.49 4.14 3.98 Mesh Improvement [µs/tet] 0.82 1.00 0.94 0.88 0.94 0.73 Max. mem. usage [B/tet] 903.46 314.12 474.32 204.72 227.18 614.19 CPU % 99.9 99.9 99.9 99.9 99.9 99.9 HXT (ours) final number of points 146 957 2 121 555.4 1 184 497.6 1 162 217 4 350 880.2 57 452 final number of tetrahedra 541 929.6 11 714 033.8 6 063 127.6 7 139 662 26 526 744.2 268 116.4 Empty Mesh [s] 0.36 1.57 0.87 0.16 1.13 0.12 Boundary Recovery [s] 1.91 8.54 6.64 0.72 6.53 0.47 Mesh Refinement [µs/tet] 0.64 0.50 0.47 0.35 0.34 0.62 Mesh Improvement [µs/tet] 0.78 0.65 0.11 0.36 0.31 0.25 Max. mem. usage [B/tet] 1032.73 221.15 328.88 80.86 77.13 646.72 CPU % 292.2 383.0 326.4 523.8 479.4 351.2 HXT (ours) reproducible mode final number of points 146 967 2 121 944 1 184 569 1 162 848 4 348 050 57 505 final number of tetrahedra 542 012 11 716 723 6 064 318 7 141 786 26 506 921 268 493 Empty Mesh [s] 0.35 1.67 1.35 0.19 1.35 0.12 Boundary Recovery [s] 1.60 7.60 6.10 0.64 5.74 0.42 Mesh Refinement [µs/tet] 0.65 0.83 0.78 0.51 0.75 0.75 Mesh Improvement [µs/tet] 0.61 0.64 0.90 0.36 0.24 0.15 Max. mem. usage [B/tet] 1031.42 221.40 328.60 123.45 113.98 654.47 CPU % 301.4 444.6 391.4 577.4 514.8 379.2

Table 7.2: Performance benchmark table 126 CHAPTER 7. MESH GENERATOR’S PERFORMANCE

7.1.2 Dihedral Angles The dihedral angles formed by each pair of facets of a tetrahedron arecustom- ary measures to look at because of their conceptual simplicity. However, this measure does not directly correspond to any type of error that the discretization induces during a finite element simulation [111].

·104 ·104 ·105 3 6 Gmsh - Rotor Gmsh - Piston Gmsh - Rim 4 2 4 2 # angles 2 1

0 0 0 0 45 90 135 180 0 45 90 135 180 0 45 90 135 180 ·104 ·104 ·105 8 Tetgen - Rotor Tetgen - Piston 6 Tetgen - Rim 3 6 2 4 4

# angles 2 1 2 0 0 0 0 45 90 135 180 0 45 90 135 180 0 45 90 135 180 ·104 ·104 ·105 8 Ours - Rotor Ours - Piston Ours - Rim 6 3 6

4 2 4

# angles 2 1 2

0 0 0 0 45 90 135 180 0 45 90 135 180 0 45 90 135 180 dihedral angle [◦] dihedral angle [◦] dihedral angle [◦]

Figure 7.2: Histograms of the dihedral angles. Lower and upper bound are marked with red vertical lines. The average is marked with a blue line. 7.1. BENCHMARKS 127

7.1.3 Gamma √ √ 24 3V 24 r γ = = in , |emax|(A1 + A2 + A3 + A4) |emax| where V is the volume of the tetrahedron, |emax| is the length of the longest edge, Ai is√ the area of the ith face and rin is the inradius of the tetrahedron. The factor 24 is added such that the optimal tetrahedron, which is a regular tetrahedron, has a quality γ = 1. This quality measure penalize all tetrahedra according to their associated interpolation error [111].

·104 ·104 ·105 1 Gmsh - Rotor Gmsh - Piston Gmsh - Rim

1 0.5 1 # tet.

0 0 0 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 ·104 ·104 ·105 2 Tetgen - Rotor 1 Tetgen - Piston Tetgen - Rim 2

1

# tet. 0.5 1

0 0 0 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 ·104 ·104 ·105 2 Ours - Rotor 1 Ours - Piston Ours - Rim 2

1 # tet. 0.5 1

0 0 0 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 γ γ γ

Figure 7.3: Histograms of the quality measure γ. Lower bound is marked with a red vertical line. The average is marked with a blue line. 128 CHAPTER 7. MESH GENERATOR’S PERFORMANCE

7.1.4 Signed inverse condition number (SICN)

3 The inverse condition number in Frobenius norm SICN = κ(S) for each tetra- hedron. κ(S) is the condition number of the linear transformation matrix S between a tetrahedron of the mesh and a regular tetrahedron. the SICN is di- rectly proportional to the greatest lower bound for the distance of S to the set of singular matrices [49].

·104 ·104 ·105 2 2 Gmsh - Rotor 1 Gmsh - Piston Gmsh - Rim

1 1 # tet. 0.5

0 0 0 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 ·104 ·104 ·105 1.5 3 3 Tetgen - Rotor Tetgen - Piston Tetgen - Rim 1 2 2

# tet. 1 1 0.5

0 0 0 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 ·104 ·104 ·105 1.5 3 Ours - Rotor Ours - Piston Ours - Rim 2 1 2

# tet. 1 0.5 1

0 0 0 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 SICN SICN SICN

Figure 7.4: Histograms of the SICN quality measure. Lower bound is marked with a red vertical lines. The average is marked with a blue line. Chapter 8

Conclusion

This thesis focused on a very specific strategy, that led to the unique tetrahe- dral mesh generator which can be found in HXT1. In this manuscript, the whole parallel tetrahedral mesh generation process was fully described, in a manner that makes it possible for researchers and developers to understand our code, from the ordering of points along a Moore curve to the Growing SPR Cavity operation. The performance of each stage of our mesh generator was alsocom- pared to reference open-source implementations. To our knowledge, HXT is the only available open-source Delaunay-based mesh generator that automat- ically and robustly creates, refines and optimizes 3D meshes in parallel. Itis already used by a wide range of users inside Gmsh and proved its robustness. Indeed, HXT successfully passes all Gmsh 3D benchmarks, and since the re- lease of Gmsh 4.6.0—the first version where HXT is truly correctly integrated inside Gmsh—users have not reported any issue that concerns the robustness of HXT.

Although the quality and the performance of our mesh generator is very sat- isfactory, there is room for improvement. The first and easiest improvement will be the implementation of edge contraction, point insertion and 2-2 flips, which would be very useful operations in order to further improve the quality of our generated mesh. Another particularly challenging task is the paralleliza- tion of the boundary recovery step, which is the last non-parallelized step in the whole meshing process. Thanks to the acquired experience, improvement on the boundary recovery procedures of TetGen is now definitely within reach, probably by adding our parallelization strategy based on Moore curves. The novel GSC operation could also be adapted as a last resort operation to recover constrained facets or edges during the boundary recovery step. Concerning

1https://gitlab.onelab.info/gmsh/gmsh/-/tree/master/contrib/hxt

129 130 CHAPTER 8. CONCLUSION our parallelization strategy using the Moore curve, we think it is really great for Delaunay tetrahedralization and for the refinement step in general. Com- pared to other strategies detailed in Section 1.5, which all have their pros and cons, our strategy scales well on any number of cores that a single shared mem- ory machine can have. To our knowledge, it also outperforms any open-source program on the same number of cores. It could also be combined with one of the divide and conquer strategy detailed in the introduction to leverage the full power of distributed memory systems. However, this strategy is not adapted to parallelize several Growing SPR Cavity (GSC) operations. Our strategy re- quires a rollback in the case of a conflict. This rollback is almost free inthe case of Delaunay insertions but is computationally costly for GSC operations. Indeed, a single GSC operation can take several hundredth of a second to com- plete. If a GSC operation needs a point that is in another partition, the rollback causes all the work already done to be lost. Add up thousands of rollbacks and several seconds are lost. Therefore, a lock-based strategy to parallelize the GSC operation would certainly be far more efficient.

Being knowledgeable about all aspects of mesh generation, parallelization included, is impossible for a single mortal man. Being able to produce a com- plete tetrahedral mesh generator that internally uses only sensible strategies while being almost fully parallel and totally robust is already an achievement of which we are proud. In the future, we hope to be able to maintain HXT, de- velop the few missing operations and add the clever and efficient innovations that are bound to appear in the field of mesh generation. Bibliography

[6] Frédéric Alauzet and Adrien Loseille. On the Use of Space Filling Curves for Parallel Anisotropic Mesh Adaptation. In Brett W. Clark, editor, Proceedings of the 18th International Meshing Roundtable, pages 337–357. Springer Berlin Heidelberg, Berlin, Heidelberg, 2009.

[7] J. Alber and R. Niedermeier. On Multidimensional Curves with Hilbert Property. Theory of Computing Systems, 33(4):295–312, August 2000.

[8] S. Aluru and F.E. Sevilgen. Parallel domain decomposition and load bal- ancing using space-filling curves. In Proceedings Fourth International Conference on High-Performance Computing, pages 230–235, Bangalore, India, 1997. IEEE Comput. Soc.

[9] Nancy M. Amato and Franco P. Preparata. An NC parallel 3D convex hull algorithm. In Proceedings of the ninth annual symposium on Compu- tational geometry - SCG ’93, pages 289–297, San Diego, California, United States, 1993. ACM Press.

[10] Nina Amenta, Marshall Bern, and David Eppstein. Optimal Point Place- ment for Mesh Smoothing. Journal of Algorithms, 30(2):302–322, Febru- ary 1999.

[11] Nina Amenta, Sunghee Choi, and Günter Rote. Incremental construc- tions con BRIO. In Steven Fortune, editor, Proceedings of the 19th ACM Symposium on Computational Geometry, San Diego, CA, USA, June 8-10, 2003, pages 211–219. ACM, 2003.

[12] Robert Anderson, Julian Andrej, Andrew Barker, Jamie Bramwell, Jean- Sylvain Camier, Jakub Cerveny, Veselin Dobrev, Yohann Dudouit, Aaron Fisher, Tzanio Kolev, Will Pazner, Mark Stowell, Vladimir Tomov, Johann Dahm, David Medina, and Stefano Zampini. MFEM: a modular finite element methods library. Computers & Mathematics with Applications, page S0898122120302583, July 2020. arXiv: 1911.09220.

131 132 BIBLIOGRAPHY

[13] Daniel Arndt, Wolfgang Bangerth, Denis Davydov, Timo Heister, Luca Heltai, Martin Kronbichler, Matthias Maier, Jean-Paul Pelteret, Bruno Turcksin, and David Wells. The deal.II finite element library: design, features, and insights. arXiv:1910.13247 [cs], February 2020. arXiv: 1910.13247.

[14] Franz Aurenhammer. Voronoi diagrams—a survey of a fundamental geo- metric data structure. ACM Computing Surveys, 23(3):345–405, sep 1991.

[15] Michael Bader. Space-Filling Curves, volume 9 of Texts in Computational Science and Engineering. Springer Berlin Heidelberg, Berlin, Heidelberg, 2013.

[16] Vicente H.F. Batista, David L. Millman, Sylvain Pion, and Johannes Sin- gler. Parallel geometric algorithms for multi-core computers. Computa- tional Geometry, 43(8):663–677, October 2010.

[17] Nathan Bell and Jared Hoberock. Thrust: A productivity-oriented library for cuda. In GPU computing gems Jade edition, pages 359–371. Elsevier, 2011.

[18] M. Bellet, F. Decultieux, M. Ménaï, F. Bay, C. Levaillant, J. L. Chenot, P. Schmidt, and I. L. Svensson. Thermomechanics of the cooling stage in casting processes: Three-dimensional finite element analysis and exper- imental validation. Metallurgical and Materials Transactions B, 27(1):81– 99, February 1996.

[19] Matthew Berger, Andrea Tagliasacchi, Lee M. Seversky, Pierre Alliez, Gaël Guennebaud, Joshua A. Levine, Andrei Sharf, and Claudio T. Silva. A Survey of Surface Reconstruction from Point Clouds: A Survey of Surface Reconstruction from Point Clouds. Computer Graphics Forum, 36(1):301–329, jan 2017.

[20] Daniel K. Blandford, Guy E. Blelloch, and Clemens Kadow. Engineering a compact parallel delaunay algorithm in 3d. page 292. ACM Press, 2006.

[21] Guy E. Blelloch. Scans as primitive parallel operations. IEEE Trans. Com- puters, 38(11):1526–1538, 1989.

[22] Guy E. Blelloch, G. L. Miller, J. C. Hardwick, and D. Talmor. Design and Implementation of a Practical Parallel Delaunay Algorithm. Algorith- mica, 24(3-4):243–269, July 1999. BIBLIOGRAPHY 133

[23] Jean-Daniel Boissonnat, Olivier Devillers, and Samuel Hornus. Incre- mental construction of the delaunay triangulation and the delaunay graph in medium dimension. In Proceedings of the Twenty-fifth Annual Symposium on Computational Geometry, SCG ’09, pages 208–216, New York, NY, USA, 2009. ACM. [24] Jean-Daniel Boissonnat, Olivier Devillers, Sylvain Pion, Monique Teil- laud, and Mariette Yvinec. Triangulations in CGAL. Computational Ge- ometry, 22(1):5–19, may 2002. [25] A. Bowyer. Computing Dirichlet tessellations*. The Computer Journal, 24(2):162–166, jan 1981. [26] A.R. Butz. Alternative Algorithm for Hilbert’s Space-Filling Curve. IEEE Transactions on Computers, C-20(4):424–426, April 1971. Conference Name: IEEE Transactions on Computers. [27] Arthur R. Butz. Convergence with Hilbert’s space filling curve. Journal of Computer and System Sciences, 3(2):128–146, May 1969. [28] P M Campbell, K D Devine, P O Box, J E Flaherty, L G Gervasio, and J D Teresco. Dynamic Octree Load Balancing Using Space-Filling Curves. page 26. [29] Thanh-Tung Cao, Ashwin Nanjappa, Mingcen Gao, and Tiow-Seng Tan. A GPU accelerated algorithm for 3D Delaunay triangulation. pages 47– 54. ACM Press, 2014. [30] Nathan C. Carter. Visual Group Theory. American Mathematical Soc., 2009. Google-Books-ID: vWXxDwAAQBAJ. [31] Marius C. Cautun and Rien van de Weygaert. The DTFE public software - The Delaunay Tessellation Field Estimator code. arXiv:1105.0370 [astro- ph], may 2011. arXiv: 1105.0370. [32] M. Chen. The Merge Phase of Parallel Divide-and-Conquer Scheme for 3D Delaunay Triangulation. In International Symposium on Parallel and Distributed Processing with Applications, pages 224–230, September 2010. [33] Siu-Wing Cheng. Delaunay mesh generation. Chapman & Hall/CRC com- puter and information science series. CRC Press, Boca Raton, 2013. [34] Nikos Chrisochoides and Damian Nave. Parallel Delaunay mesh gener- ation kernel. International Journal for Numerical Methods in Engineering, 58(2):161–176, September 2003. 134 BIBLIOGRAPHY

[35] P. Cignoni, C. Montani, R. Perego, and R. Scopigno. Parallel 3D Delaunay Triangulation. Computer Graphics Forum, 12(3):129–142, August 1993.

[36] Paolo Cignoni, Claudio Montani, and Roberto Scopigno. DeWall: A fast divide and conquer Delaunay triangulation algorithm in Ed. Computer- Aided Design, 30(5):333–341, 1998.

[37] Franco Dassi, Lennard Kamenski, Patricio Farrell, and Hang Si. Tetrahe- dral mesh improvement using moving mesh smoothing, lazy searching flips, and RBF surface reconstruction. Computer-Aided Design, 103:2–13, oct 2018.

[38] H. L. de Cougny, M. S. Shephard, and C. Özturan. Parallel three- dimensional mesh generation on distributed memory MIMD computers. Engineering with Computers, 12(2):94–106, June 1996.

[39] Leila De Floriani, Bianca Falcidieno, George Nagy, and Caterina Pienovi. On sorting triangles in a delaunay tessellation. Algorithmica, 6(1):522– 532, Jun 1991.

[40] Karen D. Devine, Erik G. Boman, Robert T. Heaphy, Bruce A. Hendrick- son, James D. Teresco, Jamal Faik, Joseph E. Flaherty, and Luis G. Ger- vasio. New challenges in dynamic load balancing. Applied Numerical Mathematics, 52(2-3):133–152, February 2005.

[41] David P Dobkin and Michael J Laszlo. Primitives for the manipulation of three-dimensional subdivisions. page 30.

[42] Cecile Dobrzynski. MMG3d: User Guide. Technical Report RT-0422, INRIA, mar 2012.

[43] Herbert Edelsbrunner. Geometry and Topology for Mesh Generation. Cam- bridge Monographs on Applied and Computational Mathematics. Cam- bridge University Press, 2001.

[44] Herbert Edelsbrunner and Ernst P. Mücke. Simulation of simplicity: a technique to cope with degenerate cases in geometric algorithms. ACM Trans. Graph., 9(1):66–104, 1990.

[45] Paul Fischer, Misun Min, Thilina Rathnayake, Som Dutta, Tzanio Kolev, Veselin Dobrev, Jean-Sylvain Camier, Martin Kronbichler, Tim Warbur- ton, Kasia Swirydowicz, and Jed Brown. Scalability of High-Performance PDE Solvers. arXiv:2004.06722 [cs], April 2020. arXiv: 2004.06722. BIBLIOGRAPHY 135

[46] Steven Fortune. A sweepline algorithm for Voronoi diagrams. Algorith- mica, 2(1):153, November 1987. [47] Panagiotis Foteinos and Nikos Chrisochoides. Dynamic parallel 3d delau- nay triangulation. In William Roshan Quadros, editor, Proceedings of the 20th International Meshing Roundtable, pages 3–20, Berlin, Heidelberg, 2012. Springer Berlin Heidelberg. [48] Lori Freitag, Mark Jones, and Paul Plassmann. A Parallel Algorithm for Mesh Smoothing. SIAM Journal on Scientific Computing, 20(6):2023–2040, January 1999. [49] Lori A. Freitag and Patrick M. Knupp. Tetrahedral mesh improvement via optimization of the element condition number. International Journal for Numerical Methods in Engineering, 53(6):1377–1391, February 2002. [50] Lori A. Freitag and Carl Ollivier-Gooch. Tetrahedral mesh improve- ment using swapping and smoothing. International Journal for Numerical Methods in Engineering, 40(21):3979–4002, November 1997. [51] Pascal Jean Frey and Paul-Louis George. Mesh Generation: Application to Finite Elements. ISTE, 2007. [52] Haohuan Fu, Junfeng Liao, Jinzhe Yang, Lanning Wang, Zhenya Song, Xiaomeng Huang, Chao Yang, Wei Xue, Fangfang Liu, Fangli Qiao, et al. The sunway taihulight supercomputer: system and applications. Science China Information Sciences, 59(7):072001, 2016. [53] Valentin Fuetterling, Carsten Lojewski, and Franz-Josef Pfreundt. High- Performance Delaunay Triangulation for Many-Core Computers. Eu- rographics/ ACM SIGGRAPH Symposium on High Performance Graphics, page 8 pages, 2014. Artwork Size: 8 pages ISBN: 9783905674606 Pub- lisher: The Eurographics Association. [54] Daniel Funke and Peter Sanders. Parallel d -D Delaunay Triangulations in Shared and Distributed Memory. In 2017 Proceedings of the Ninteenth Workshop on Algorithm Engineering and Experiments (ALENEX), pages 207–217. Society for Industrial and Applied Mathematics, January 2017. [55] Michael Garland and Paul S Heckbert. Fast polygonal approximation of terrains and height fields. Technical report, Report CMU-CS-95-181, Carnegie Mellon University, 1995. [56] Ian P. Gent and Toby Walsh. Easy problems are sometimes hard. Artif. Intell., 70(1-2):335–345, 1994. 136 BIBLIOGRAPHY

[57] Christophe Geuzaine and Jean-Francois Remacle. Gmsh: a three- dimensional finite element mesh generator with built-in pre facilities. page 24, 2009.

[58] Carla P. Gomes, Bart Selman, and Nuno Crato. Heavy-tailed distributions in combinatorial search. In Principles and Practice of Constraint Program- ming - CP97, Third International Conference, Linz, Austria, October 29- November 1, 1997, Proceedings, pages 121–135, 1997.

[59] Leo J. Guibas and Jorge Stolfi. Primitives for the manipulation of general subdivisions and the computation of Voronoi diagrams. In Proceedings of the fifteenth annual ACM symposium on Theory of computing - STOC’83, pages 221–234, Not Known, 1983. ACM Press.

[60] Chris H. Hamilton and Andrew Rau-Chaplin. Compact Hilbert indices: Space-filling curves for domains with unequal side lengths. Information Processing Letters, 105(5):155–163, February 2008.

[61] Herman Haverkort. How many three-dimensional Hilbert curves are there? Journal of Computational Geometry, 8(1):206–281, September 2017. Number: 1.

[62] Herman Haverkort and Freek van Walderveen. Locality and Bounding- Box Quality of Two-Dimensional Space-Filling Curves. arXiv:0806.4787 [cs], June 2008.

[63] Daniel A. Ibanez, E. Seegyoung Seol, Cameron W. Smith, and Mark S. Shephard. PUMI: Parallel Unstructured Mesh Infrastructure. ACM Trans- actions on Mathematical Software, 42(3):1–28, May 2016.

[64] H.V. Jagadish. Analysis of the Hilbert curve for representing two- dimensional space. Information Processing Letters, 62(1):17–22, April 1997.

[65] Clément Jamin, Sylvain Pion, and Monique Teillaud. 3D triangulations. In CGAL User and Reference Manual. CGAL Editorial Board, 4.12 edition, 2018.

[66] Clément Jamin, Pierre Alliez, Mariette Yvinec, and Jean-Daniel Boisson- nat. CGALmesh: A Generic Framework for Delaunay Mesh Generation. ACM Transactions on Mathematical Software, 41(4):1–24, October 2015.

[67] Bryan Matthew Klingner. Improving Tetrahedral Meshes. PhD thesis, EECS Department, University of California, Berkeley, Nov 2008. BIBLIOGRAPHY 137

[68] Bryan Matthew Klingner and Jonathan Richard Shewchuk. Aggressive Tetrahedral Mesh Improvement. In Michael L. Brewer and David Mar- cum, editors, Proceedings of the 16th International Meshing Roundtable, pages 3–23. Springer Berlin Heidelberg, Berlin, Heidelberg, 2008.

[69] Josef Kohout, Ivana Kolingerová, and Jiří Žára. Parallel Delaunay tri- angulation in E2 and E3 for computers with shared memory. Parallel Computing, 31(5):491–522, May 2005.

[70] Cédric Lachat, Cécile Dobrzynski, and François Pellegrini. Parallel mesh adaptation using parallel graph partitioning. page 13, 2014.

[71] J K Lawder. Calculation of Mappings Between One and n-dimensional Values Using the Hilbert Space-filling Curve. page 13.

[72] Charles L. Lawson. Transforming triangulations. Discrete Mathematics, 3(4):365–372, January 1972.

[73] B Levy. Geogram, 2015. http://alice.loria.fr/software/ geogram/doc/html/index.html.

[74] Hui Liu, Kun Wang, Bo Yang, Min Yang, Ruijian He, Lihua Shen, He Zhong, and Zhangxin Chen. Load balancing using hilbert space- filling curves for parallel reservoir simulations. CoRR, abs/1708.01365, 2017.

[75] Jianfei Liu, Y. Q. Chen, and S. L. Sun. Small polyhedron reconnection for mesh improvement and its implementation based on advancing front technique. International Journal for Numerical Methods in Engineering, 79(8):1004–1018, 2009.

[76] S.H. Lo. Parallel Delaunay triangulation in three dimensions. Computer Methods in Applied Mechanics and Engineering, 237-240:88–106, Septem- ber 2012.

[77] A. Loseille, F. Alauzet, and V. Menier. Unique cavity-based opera- tor and hierarchical domain partitioning for fast parallel generation of anisotropic meshes. Computer-Aided Design, 85:53–67, apr 2017.

[78] Bruno Lévy and Nicolas Bonneel. Variational Anisotropic Surface Mesh- ing with Voronoi Parallel Linear Enumeration. In Xiangmin Jiao and Jean-Christophe Weill, editors, Proceedings of the 21st International Mesh- ing Roundtable, pages 349–366. Springer Berlin Heidelberg, Berlin, Hei- delberg, 2013. 138 BIBLIOGRAPHY

[79] Rainald Löhner, Jose Camberos, and Marshal Merriam. Parallel unstruc- tured grid generation. Computer Methods in Applied Mechanics and En- gineering, 95(3):343–357, March 1992.

[80] Célestin Marot, Jeanne Pellerin, Jonathan Lambrechts, and Jean-François Remacle. Toward one billion tetrahedra per minute. Procedia Engineering, page 5, 2017.

[81] Célestin Marot, Jeanne Pellerin, and Jean-François Remacle. One ma- chine, one minute, three billion tetrahedra. International Journal for Nu- merical Methods in Engineering, 117(9):967–990, 2019.

[82] Célestin Marot and Jean-François Remacle. Quality tetrahedral mesh generation with HXT, aug 2020.

[83] Célestin Marot, Kilian Verhetsel, and Jean-François Remacle. Reviving the Search for Optimal Tetrahedralizations. In Proceedings of the 28th In- ternational Meshing Roundtable, Buffalo, New York, USA, February 2020. Zenodo.

[84] Oscar Martinez-Rubi, Stefan Verhoeven, Maarten van Meersbergen, and Peter van Oosterom. Taming the beast: Free and open-source massive point cloud web visualization. page 12, 2016.

[85] Duane Merrill and Andrew Grimshaw. High performance and scalable radix sorting: A case study of implementing dynamic parallelism for GPU computing. Parallel Processing Letters, 21(02):245–272, 2011.

[86] Marek Krzysztof Misztal, Jakob Andreas Bærentzen, François Anton, and Kenny Erleben. Tetrahedral Mesh Improvement Using Multi-face Retri- angulation. In Brett W. Clark, editor, Proceedings of the 18th International Meshing Roundtable, pages 539–555. Springer Berlin Heidelberg, Berlin, Heidelberg, 2009.

[87] Ashwin Nanjappa. Delaunay triangulation in R3 on the GPU. page 191.

[88] Nicholas A. Nystrom, Michael J. Levine, Ralph Z. Roskies, and J. Ray Scott. Bridges: A uniquely flexible hpc resource for new communities and data analytics. In Proceedings of the 2015 XSEDE Conference: Scien- tific Advancements Enabled by Enhanced Cyberinfrastructure, XSEDE ’15, pages 30:1–30:8, New York, NY, USA, 2015. ACM.

[89] T Okusanya and J Peraire. 3d parallel unstructured mesh generation, 1996. BIBLIOGRAPHY 139

[90] Udo Pachner. P.L. Homeomorphic Manifolds are Equivalent by Elemen- tary Shellings. European Journal of Combinatorics, 12(2):129–145, March 1991.

[91] Orestis Polychroniou and Kenneth A. Ross. A comprehensive study of main-memory partitioning and its application to large-scale comparison- and radix-sort. In Curtis E. Dyreson, Feifei Li, and M. Tamer Özsu, ed- itors, International Conference on Management of Data, SIGMOD 2014, Snowbird, UT, USA, June 22-27, 2014, pages 755–766. ACM, 2014.

[92] F. P. Preparata and S. J. Hong. Convex hulls of finite sets of points in two and three dimensions. Communications of the ACM, 20(2):87–93, February 1977.

[93] Franco P. Preparata and Michael Shamos. Computational Geometry: An Introduction. Monographs in Computer Science. Springer-Verlag, New York, 1985.

[94] M. Ramella, W. Boschin, D. Fadda, and M. Nonino. Finding galaxy clus- ters using Voronoi tessellations. Astronomy & Astrophysics, 368(3):776– 786, mar 2001.

[95] Michel Rasquin, Cameron Smith, Kedar Chitale, Seegyoung Seol, Ben- jamin A Matthews, Jeffrey L Martin, Onkar Sahni, Raymond MLoy, Mark S Shephard, and Kenneth E Jansen. Scalable fully implicit finite element flow solver with application to high-fidelity flow control simu- lations on a realistic wing design. Computing in Science and Engineering, 16:7, 2014.

[96] Dr B Ravi. Casting Simulation and Optimisation: Benefits, Bottlenecks, and Best Practices. page 12.

[97] Nicolas Ray, Dmitry Sokolov, Sylvain Lefebvre, and Bruno Lévy. Mesh- less voronoi on the GPU. ACM Transactions on Graphics, 37(6):1–12, Jan- uary 2019.

[98] Jean-François Remacle. A two-level multithreaded delaunay kernel. Computer-Aided Design, 85:2–9, 2017.

[99] Jean-François Remacle and Mark S Shephard. An algorithm oriented mesh database. International Journal for Numerical Methods in Engineer- ing, 58(2):349–374, 2003. 140 BIBLIOGRAPHY

[100] Jim Ruppert and Raimund Seidel. On the difficulty of triangulating three- dimensional Nonconvex Polyhedra. Discrete & Computational Geometry, 7(3):227–253, March 1992.

[101] Chris H Rycroft. Voro++: a three-dimensional Voronoi cell library in C++. page 14.

[102] Lee Sangyoon, Park Chan-Ik, and Park Chan-Mo. An improved paral- lel algorithm for Delaunay triangulation on distributed memory parallel computers. In Proceedings. Advances in Parallel and Distributed Comput- ing, pages 131–138, March 1997. ISSN: null.

[103] Nadathur Satish, Mark J. Harris, and Michael Garland. Designing effi- cient sorting algorithms for manycore gpus. In 23rd IEEE International Symposium on Parallel and Distributed Processing, IPDPS 2009, Rome, Italy, May 23-29, 2009, pages 1–10. IEEE, 2009.

[104] Nadathur Satish, Changkyu Kim, Jatin Chhugani, Anthony D. Nguyen, Victor W. Lee, Daehyun Kim, and Pradeep Dubey. Fast sort on cpus and gpus: a case for bandwidth oblivious SIMD sort. In Ahmed K. El- magarmid and Divyakant Agrawal, editors, Proceedings of the ACM SIG- MOD International Conference on Management of Data, SIGMOD 2010, In- dianapolis, Indiana, USA, June 6-10, 2010, pages 351–362. ACM, 2010.

[105] Teseo Schneider, Yixin Hu, Xifeng Gao, Jeremie Dumas, Denis Zorin, and Daniele Panozzo. A Large Scale Comparison of Tetrahedral and Hexahe- dral Elements for Finite Element Analysis. arXiv:1903.09332 [cs], October 2019. arXiv: 1903.09332.

[106] Shubhabrata Sengupta, Mark J. Harris, Yao Zhang, and John D. Owens. Scan primitives for GPU computing. In Mark Segal and Timo Aila, edi- tors, Proceedings of the ACM SIGGRAPH/EUROGRAPHICS Conference on Graphics Hardware 2007, San Diego, California, USA, August 4-5, 2007, pages 97–106. Eurographics Association, 2007.

[107] Michael Ian Shamos and Dan Hoey. Closest-point problems. In 16th Annual Symposium on Foundations of Computer Science (sfcs 1975), pages 151–162, USA, October 1975. IEEE.

[108] Jonathan Richard Shewchuk. Adaptive precision floating-point arith- metic and fast robust geometric predicates. Discrete & Computational Geometry, 18(3):305–363, 1997. BIBLIOGRAPHY 141

[109] Jonathan Richard Shewchuk. mesh generation. Technical report, Carnegie-Mellon Univ Pittsburgh Pa School of Com- puter Science, 1997.

[110] Jonathan Richard Shewchuk. Two Discrete Optimization Algorithms for the Topological Improvement of Tetrahedral Meshes. In In Unpublished manuscript, page 11, 2002.

[111] Jonathan Richard Shewchuk. What Is a Good Linear Finite Element? - In- terpolation, Conditioning, Anisotropy, and Quality Measures. Technical report, In Proc. of the 11th International Meshing Roundtable, 2002.

[112] Richard Shewchuk. Star splaying: an algorithm for repairing delaunay triangulations and convex hulls. In Proceedings of the twenty-first annual symposium on Computational geometry - SCG ’05, page 237, Pisa, Italy, 2005. ACM Press.

[113] Hang Si. TetGen, a Delaunay-Based Quality Tetrahedral Mesh Generator. ACM Trans. Math. Softw., 41(2):11:1–11:36, feb 2015.

[114] John Skilling. Programming the Hilbert curve. AIP Conference Pro- ceedings, 707(1):381–387, April 2004. Publisher: American Institute of Physics.

[115] Jeffrey Slotnick, Abdollah Khodadoust, Juan Alonso, David Darmofal, William Gropp, Elizabeth Lurie, and Dimitri Mavriplis. CFD vision 2030 study: a path to revolutionary computational aerosciences. Technical report, 2014. NASA.

[116] Andrew Sohn and Yuetsu Kodama. Load balanced parallel radix sort. In Proceedings of the 12th international conference on Supercomputing, pages 305–312. ACM, 1998.

[117] Petter Solding and Patrik Thollander. Increased Energy Efficiency ina Swedish Iron Foundry Through Use of Discrete Event Simulation. In Proceedings of the 2006 Winter Simulation Conference, pages 1971–1976, December 2006. ISSN: 1558-4305.

[118] Tianyun Su, Wen Wang, Zhihan Lv, Wei Wu, and Xinfang Li. Rapid Delaunay triangulation for randomly distributed point cloud data using adaptive Hilbert curve. Computers & Graphics, 54:65–74, February 2016.

[119] Stefan H Thomke. Simulation, learning and R&D performance: Evidence from automotive development. Research Policy, 27(1):55–74, May 1998. 142 BIBLIOGRAPHY

[120] M. Thompson and J. R. White. The effect of a temperature gradient on residual stresses and distortion in injection mold- ings. Polymer Engineering & Science, 24(4):227–241, 1984. _eprint: https://onlinelibrary.wiley.com/doi/pdf/10.1002/pen.760240402.

[121] A. S. Usmani, J. T. Cross, and R. W. Lewis. A finite ele- ment model for the simulations of mould filling in metal cast- ing and the associated heat transfer. International Journal for Numerical Methods in Engineering, 35(4):787–806, 1992. _eprint: https://onlinelibrary.wiley.com/doi/pdf/10.1002/nme.1620350410.

[122] Mariano Vázquez, Guillaume Houzeaux, Seid Koric, Antoni Artigues, Jazmin Aguado-Sierra, Ruth Arís, Daniel Mira, Hadrien Calmet, Fer- nando Cucchietti, Herbert Owen, Ahmed Taha, Evan Dering Burness, José María Cela, and Mateo Valero. Alya: Multiphysics engineering sim- ulation toward exascale. Journal of Computational Science, 14:15–27, May 2016.

[123] Jan Wassenberg and Peter Sanders. Engineering a multi-core radix sort. In Emmanuel Jeannot, Raymond Namyst, and Jean Roman, editors, Euro- Par 2011 Parallel Processing, pages 160–169, Berlin, Heidelberg, 2011. Springer Berlin Heidelberg.

[124] D. F. Watson. Computing the n-dimensional Delaunay tessellation with application to Voronoi polytopes. The Computer Journal, 24(2):167–172, jan 1981.

[125] J. K. Wilson and B. H. V. Topping. Parallel adaptive tetrahedral mesh generation by the advancing front technique. Computers & Structures, 68(1):57–78, July 1998.

[126] Jo-Yu Wu and R. Lee. The advantages of triangular and tetrahedral edge elements for electromagnetic modeling with the finite-element method. IEEE Transactions on Antennas and Propagation, 45(9):1431–1437, Septem- ber 1997. Conference Name: IEEE Transactions on Antennas and Prop- agation.

[127] Pan Xu, Cuong Nguyen, and Srikanta Tirthapura. Onion Curve: A Space Filling Curve with Near-Optimal Clustering. In 2018 IEEE 34th Inter- national Conference on Data Engineering (ICDE), pages 1236–1239, April 2018. ISSN: 2375-026X. BIBLIOGRAPHY 143

[128] Marco Zagha and Guy E. Blelloch. Radix sort for vector multiprocessors. In Joanne L. Martin, editor, Proceedings Supercomputing ’91, Albuquerque, NM, USA, November 18-22, 1991, pages 712–721. ACM, 1991.

[129] Hai-dong Zhao, Yan-fei Bai, Xiao-xian Ouyang, and Pu-yun Dong. Simu- lation of mold filling and prediction of gas entrapment on practical high pressure die castings. Transactions of Nonferrous Metals Society of China, 20(11):2064–2070, November 2010.

BIBLIOGRAPHY 145

This project has received funding from the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation programme (grant agreement HEXTREME, ERC-2015-AdG-694020).