Vis Comput (2011) 27: 507–517 DOI 10.1007/s00371-011-0571-1

ORIGINAL ARTICLE

Parallel and efficient Boolean on polygonal solids

Hanli Zhao · Charlie C.L. Wang · Yong Chen · Xiaogang Jin

Published online: 22 April 2011 © Springer-Verlag 2011

Abstract We present a novel framework which can effi- 1 Introduction ciently evaluate approximate Boolean set operations for B- rep models by highly parallel algorithms. This is achieved Boolean operations, such as union (∪), subtraction (−), by taking axis-aligned surfels of Layered Depth Images and intersection (∩), are useful for combining simple mod- (LDI) as a bridge and performing Boolean operations on els to create complex solid objects on a Constructive Solid the structured points. As compared with prior surfel-based Geometry (CSG) tree, which has a variety of applications approaches, this paper has much improvement. Firstly, we in computer-aided design and manufacturing (CAD/CAM), adopt key-data pairs to store LDI more compactly. Secondly, , and computer graphics (e.g., [4, 17, 18]). robust depth peeling is investigated to overcome the bottle- Polygonal meshes are the most popular boundary repre- neck of layer-complexity. Thirdly, an out-of-core tiling tech- sentation (B-rep) for 3D models. Although many commer- cial CAD systems, such as Rhinoceros [14] and ACIS [13], nique is presented to overcome the limitation of memory. are able to compute Boolean on polygons, they have diffi- Real-time feedback is provided by streaming the proposed culty in modeling very complex models (e.g., the scaffold pipeline on the many-core graphics hardware. of Bone as shown in Fig. 1). Polygons provide the simplest way to approximate freeform solids and the computation of Keywords Boolean operations · Layered depth images · Boolean is not necessary to be exact in many areas, such Depth peeling · Out-of-core · CUDA as biomedical engineering, jewelry industry, and game in- dustry. This gives us an opportunity to overcome the robust- ness problem [11] and to speed up the computation by using an approximate method. Recent researches have shown that H. Zhao College of Physics & Electronic Information Engineering, densely oriented point sets are competent to fulfill such a Wenzhou University, Wenzhou 325035, China task. Several algorithms [1, 23, 32] have been investigated e-mail: [email protected] to perform Boolean on oriented points but they do not op- erate on B-rep models directly. Some methods [3, 31] can C.C.L. Wang () Department of Mechanical and Automation Engineering, process B-rep models using well-structured points in Lay- The Chinese University of Hong Kong, Hong Kong, China ered Depth Images (LDI). However, both the efficiency and e-mail: [email protected] the memory-usage still have much room for improvement. In this paper, we aim to develop a framework that fully Y. Chen Epstein Department of Industrial and Systems Engineering, exploits the advantages of LDI [27] and the power of University of Southern California, Los Angeles, CA, USA Graphics Processing Unit (GPU) for evaluating approxi- e-mail: [email protected] mate Boolean operations on polygonal meshes. Our algo- rithm is efficient, parallel, and scalable compared with prior X. Jin approaches. These good properties are achieved by intro- State Key Lab of CAD & CG, Zhejiang University, Hangzhou 310058, China ducing the following work. First of all, we adopt a compact e-mail: [email protected] representation for LDI which contains axis-aligned surfels 508 H. Zhao et al.

Fig. 1 A high-quality osseous scaffold is modeled with our system: surfel-represented Boolean result, where surfels with different colors (left) input models (Bone, offset of Bone, and cellular-scaffold), (mid- are from different input solids, and (right) the cross-section view of dle) the layout of solids where 3 × 3 × 3 of cellular-scaffold models are the output osseous tissue. The approximation resolution is 512 × 512 tiled together, resulting in a CSG tree with a depth of 28, (mid-bottom) in this example sampled from the boundary of input solids. By utilizing the The solution to these problems is the major contribution of atomic operation in graphics global memory, we develop a our approach. programmable rasterizer to capture all layers of LDI in a sin- The rest of our paper is organized as follows. After briefly gle pass. Moreover, the problem of limited sampling resolu- reviewing some related previous work in the next section, tion is avoided by splitting the whole working envelope into Sect. 3 introduces an overview of our new Boolean evalua- multiple smaller tiles and processing them independently us- tion framework. Section 4 presents the parallelization tech- ing the out-of-core strategy. Note that the computation in the niques on the GPU in detail. Experimental results and re- whole pipeline is highly parallel, thus allowing for a GPU- lated discussions are given in Sect. 5. Finally, Sect. 6 con- based implementation. To the best of our knowledge, this cludes the advantage of our approach and suggests some fu- is the first algorithm that is capable of computing Boolean ture work. combinations of freeform polygonal shapes in real-time (see the examples shown in Fig. 11). Even for a very complex CSG tree such as the one shown in Fig. 1, we are able to ob- 2 Related work tain the result of the high-quality osseous tissue in 1.7 s on an NVIDIA Geforce GTX 480 GPU while consuming only Boolean operations of solids have been investigated in com- 600 MB of the graphics memory. puter graphics for many years. For a comprehensive survey, In summary, this paper addresses the following important we refer readers to [26]. Here, we only review the Boolean questions in geometric modeling: algorithms based on surfels [24] and oriented points. – How to support real-time approximate geometric feed- Many approaches [6–9, 25] employ the image-based back for a CSG tree with polygonal models? depth-layering or depth-peeling technique to achieve the in- – How to support highly accurate approximation that may teractive rendering of CSG models. They gradually peel can- be beyond the memory capacity of graphics hardware? didate points from the boundaries of CSG primitives and Parallel and efficient Boolean on polygonal solids 509 only display the nearest boundary points of the Boolean re- Algorithm 1 Streaming Boolean operations sult. The points can be fast sampled using the graphics hard- 1: Input Boolean tree accompanied with polygonal meshes ware [5] and encoded as pixels in LDI. However, these ap- 2: for each volumetric tile in bounding volume do proaches only solve the fast rendering problem but do not 3: for each axis in {x, y, z} do provide the ability to obtain the geometric shape of resultant 4: for each node n by depth-first tree traversal do solids. 5: if n is a leaf then Surfel is a representation that can approximate the shape 6: Sample axis-aligned surfels by rasterizing n of a solid model on some oriented sample points. Boolean 7: Shell sort per-pixel surfels operations on freeform solids represented by surfels have 8: else been developed in [1, 23]. Since most CAD/CAM applica- 9: Merge sort candidate surfels from children tions, such as CAE analysis, CNC machining and rapid pro- 10: Boolean on the sorted candidate surfels totyping, need the B-rep of solid models, additional surface 11: end if reconstruction from the surfel-represented solid is required. 12: end for Zhou et al. [32] introduced a parallel surface reconstruc- 13: Obtain boundary surfels from the root node tion algorithm by extending the marching cubes algorithm 14: Accumulate the QEF matrices for boundary cells [20] with a GPU-based octree. They also presented an inter- 15: Mark interior/exterior cells along the axis active Boolean tool for oriented point set. However, their 16: end for adaptive marching cubes algorithm tends to smooth the 17: Output vertices by solving the QEF matrices sharp features in the underlying geometry. Moreover, their 18: Output indices by connecting boundary cells in-core reconstruction can only handle octrees with a lim- 19: end for ited number of depths. 20: return merge all tiles of polygons into a closed mani- Chen and Wang [3] presented an approximate Boolean fold approach that can preserve sharp features in the resultant solids. They made use of a variant of LDI, named as Lay- ered Depth-Normal Images (LDNI), for the efficient volu- each leaf node, we convert the B-rep model into a solid metric test [29], and employed the quadratic error function bounded by surfels using axis-aligned ray-casting. For each (QEF) in dual contouring [15] for surface generation. How- inner node, we first collect all candidate surfels from its ever, only the sampling of LDNI is performed using GPU- children nodes and sort the merged samples according to based depth peeling whereas other algorithms are performed their depth values. Next, valid boundary surfels are extracted on the CPU. Although multi-threaded schemes are designed, by performing Boolean on the candidate surfels in an effi- the parallelism is limited. Wang et al. [31] further transferred cient way. After traversing the CSG tree, the resultant surfel- the LDNI-based algorithm to the GPU to improve the ef- represented solid is obtained at the root node. ficiency. Unfortunately, as the workload on each thread is Efficient surface reconstruction is conducted to output B- not well balanced, the primary algorithm still cannot pro- rep models as the result of Boolean. We develop a modified vide real-time feedback. In addition, their algorithm can- version of dual contouring in this paper. The optimal posi- not process models which contain more than 256 LDI lay- tion of each boundary cell is determined by the QEF matrix, ers because the stencil buffer used in their sampling method which can be accumulated by all the boundary surfels in the only has 8 bits in the current graphics hardware architecture. cell. Moreover, redundancy in graphics memory consumption se- Note that the whole pipeline design is highly parallel, riously limits its applications. All these problems are solved which enables a real-time response by GPU-based imple- by our approach below. mentation. We adopt key-data pairs to store the points more compactly as compared with prior LDI-based algorithms. As a result, in addition to memory reduction, the parallelism 3 Framework overview is also improved as no threads are launched for void opera- An expression of Boolean operations is usually described tions. Algorithm 1 provides the pseudo-code of our approx- as a binary tree (i.e., CSG tree), on which leaves repre- imate Boolean framework. sent primitive shapes and inner nodes are Boolean operators. First of all, if the out-of-core computation is enabled, the bounding volume is divided into multiple volumetric tiles 4 Parallelization on the GPU with the same size. We then perform Boolean on each tile along three axes separately. AsshowninAlgorithm1, except for a few control settings, The final Boolean result is obtained by recursively every processing step can be streamed on the many-core traversing the CSG tree using the depth-first scheme. For graphics hardware, including conversion of polygons into 510 H. Zhao et al.

Fig. 2 A comparison between LDI and CLDI sampled from (left) (right) the corresponding CLDI with only 34.7k, 40.1k, and 30.9k el- Bunny with the resolution of 150 × 150 pixels: (middle) three axis- ements, respectively. We can see that only 18% pixels of LDI need to aligned LDI texture arrays with 10, 8, and 8 layers, respectively, and be stored in CLDI for the representation of the same model surfel-represented solids, Boolean on surfels, B-rep recon- discard empty sample points, while the sort primitive effi- struction, and volumetric tiling. We chose NVIDIA’s CUDA ciently clusters elements with a sorting identifier. In addi- [12] computing environment in our GPU implementation. tion to memory reduction, we also explain how CLDI could CUDA exposes new hardware features which are very use- improve the performance below. ful for data-parallel computation. One important feature is that it allows arbitrary gathering and scattering of graphics 4.2 Robust sampling memory in GPU threads. In this section, we introduce our parallel and efficient Boolean algorithm in detail. Rasterizer on the graphics hardware is designed according to rasterization rules about how to map geometric data into 4.1 Compact LDI data structure raster data. The raster data are snapped to integer locations and per-pixel attributes are linearly interpolated from per- In the prior approaches, LDI is usually stored as an array vertex attributes before being passed to a pixel shader. Any of two-dimensional textures. Its size grows linearly with the pixel center which falls inside a triangle is drawn. A pixel layer-complexity of a model. However, as shown in Fig. 2 is also assumed to be inside if it lies on the top edge or the (left), the actual number of sampled points in each texture left edge of the triangle. A top edge is an edge that is exactly decreases significantly with the increase of the depth layer. Such a primary way to store LDI consumes too much graph- horizontal and above the other two edges of the triangle, and ics memory and limits the application of using LDI to rep- a left edge is an edge that is not exactly horizontal and on the resent complex models (e.g., the scaffold models in Fig. 1). left side of the triangle. In this paper, we develop an efficient data structure to store An edge is called silhouette if two triangles that share this such sparsely distributed data. edge are facing opposite directions with respect to the view- Inspired by the solver developed in [28] for large sparse point. There is a special case that a silhouette is the top/left linear systems which only store non-zero values with corre- edge of its two adjacent triangles and there are some pixels sponding indices, we adopt a compact representation here, which just lie on this silhouette. According to rasterization called CLDI, for the sparse data points in LDI. On each pixel rules as well as the depth less comparison function, these in a model represented by LDI, there are multiple surfels pixels are rasterized only once in traditional depth-peeling with different depth values. We store their depth values as techniques, as illustrated in Fig. 3. However, such a case will well as normals sequentially in a long data array. The access break the boundary of solids as described in [3, 29]. To avoid to each surfel in the data array is achieved by using a key this problem, Wang et al. [31] used the stencil buffer instead array where each key stores the number of surfels per pixel of the depth buffer. They peel each layer of surfels by gradu- as well as the index of its first surfel in the long data array. ally increasing the stencil value. Unfortunately, such a sam- As illustrated in Fig. 2 (right), all empty data points (about pling technique introduces a new issue. The stencil buffer 82% pixels in this example) are not stored, thus a significant only has 8 bits in the current graphics hardware architecture memory reduction is achieved. and 8-bit data can only represent maximum 256 values. If In order to improve the performance of CLDI, our algo- a complex B-rep model contains more than 256 LDI lay- rithm uses GPU-based parallel primitives [10] heavily. The ers, missampling will occur because of the stencil buffer. scan primitive is employed to count the required data size Moreover, modern hardware rasterizers can only capture a before allocating a GPU array and to obtain indices of sam- rather limited number of fragment data in one render pass ple points to the array. The compact primitive is used to [2, 22], thus for complex models, multiple rendering passes Parallel and efficient Boolean on polygonal solids 511

nodes are collected as candidate surfels. Before applying Boolean operations, surfels on a ray must be sorted by their depth values. CUDA does not support recursive functions, so we choose the Shell sort to obtain surfels sorted on each pixel from the leaf nodes. For inner nodes, since the surfels in each of their children nodes have been sorted during prior depth-first traversal, one-pass merge sort, which has linear time complexity, can be used to generate the candidate list of sorted surfels. Let {pi|i = 1, 2,...,m} be the sorted surfels on a pixel, Fig. 3 The colored pixels are inside the two adjacent triangles accord- di be the depth of pi , and fi be the entry/exit flag of the solid ing to rasterization rules. The yellow pixels are covered once while the interior represented by children surfels. For each surfel- light blue and dark blue ones are covered twice. The light blue pixels represented intermediate solid, every per-pixel surfel with just lie on the silhouette edge and should be rasterized twice in order to guarantee the closure of solid represented by surfels even index represents entering the solid interior and we set fi = 1, while the one with odd index represents leaving and we set fi =−1. For subtraction, the entry/exit flags of are needed. As a result, sampling B-rep models into cor- surfels from the right child are interchanged to obtain the rect surfel-represented solids becomes the main bottleneck complement. By going through {p |i = 1, 2,...,m} sequen- of Boolean techniques presented in [3, 31]. i tially, we identify a pair of points p and p that satisfy In this paper, we devise a robust and fast sampling i j |d − d |≤ and f · f < 0, where is slightly greater method which is inspired by the main idea of FreePipe in i j i j than 0. If a pair of such points are identified, both will be [19]. First of all, rasterization resolution R , axis-aligned s removed from CLDI to avoid the ambiguity introduced by bounding box (AABB) and appropriate orthogonal model- overlapped surfaces. Then we go through each p sequen- view-projection matrix are set as input parameters. Back- i tially again and accumulate each f iteratively. As a result, face culling is disabled and frustum clipping is enabled. We i the entry of the solid interior increases f by 1 while the then rasterize the B-rep model according to the scan-line i leave decreases f by 1. Now, the interior/exterior configu- scheme and rasterization rules. For each scanned pixel, we i ration on samples can be easily determined by checking f obtain an interpolated depth. In order to avoid read-modify- i sequentially: f = 1 denotes the entry of resultant solid for write (RMW) hazards in graphics global memory, the atom- i union and subtraction whereas f = 2 denotes the entry for icInc operation [12] of CUDA is employed. We allocate a i counter per pixel in global memory and initialize it to 0. intersection. After applying Boolean on CLDI, invalid sur- With the atomicInc operation, the kth incoming fragment of fels are effectively discarded whereas boundary surfels are a certain pixel increases the counter to k + 1 and return the retained. An illustration has been given in Fig. 4. source value k. Finally, the depth and normal are stored into Compared with the sparsely distributed textures for stor- the kth entry of CLDI without RMW hazards. Note that k is ing LDI or LDNI in prior methods [3, 31], although threads a 32-bit unsigned integer and can be used to sample a model for empty pixels stop immediately, the resource prepara- tion and recycling, and thread synchronization waste a lot with up to 232 layers. of time. The parallelism is therefore reduced. On the con- The proposed sampling algorithm performs two passes trary, our method shows better parallelism since we launch of the programmable rasterization. In the first pass, we only threads only for non-zero pixels when using the compact increase the counter array to obtain the per-pixel numbers of data structure, CLDI. surfels. Then the scan primitive is utilized to count the size of CLDI and to obtain per-pixel indices to CLDI. In the sec- ond pass, all surfel data are truly filled in CLDI. As can be 4.4 B-rep reconstruction seen in Fig. 2, all layers of surfels are compactly captured in the second pass. Moreover, two surfels are correctly sam- After the whole CSG tree is traversed using the depth-first pled on silhouette edges, which ensures the correctness of a strategy, the resultant surfel-bounded solid can be obtained solid represented by the sampled surfels. at the root node. An efficient surface reconstruction algo- rithm is then required to output a B-rep model. Kobbelt et 4.3 Boolean on surfels al. [16] proposed a feature-preserved Extended Marching Cubes method and Ju et al. [15] improved their method by To evaluate the result of each inner node on the CSG tree, using dual contouring to avoid the explicit test for sharp fea- Boolean operations are performed on the models repre- tures. In this paper, we modify their dual contouring to fit for sented by the children nodes. The output surfels of children our compact representation and implement it on the GPU. 512 H. Zhao et al.

Fig. 6 2D illustration of (left) Hermite data and (right) the dual con- touring

After the QEF matrices of all boundary cells are estab- lished, we can compute optimal positions following Ju et al. [15]. We can see the vertices are moved to shape fea- tures in Fig. 6. The topology of boundary polygons is recon- Fig. 4 By processing each pixel independently, the elimination of in- structed simply by processing each axis independently. For terior surfels on Sphere and Torus could be highly parallel. See how each boundary surfel p, we connect the four adjacent bound- the entry/exit flags are changed accordingly ary cells to form a quadrangle. The face of the quadrangle must conform to the orientation of p. Fig. 5 A boundary surfel p intersected with the axis-aligned 4.5 Out-of-core Boolean ray e corresponds to four adjacent boundary cells C00, C01, C10,andC11 The above Boolean evaluation algorithms are computed in- core. However, for very large sampling resolutions, the size of CLDI can grow quickly beyond the memory capacity pro- vided by the graphics hardware even if we employ such a compact representation. An effective out-of-core scheme is needed to meet the requirement of high approximation ac- curacies. In order to fit the graphics memory constraints, we split Since orthogonal projection is used in the sampling step, the AABB into smaller volumetric tiles and apply Algo- =  each pixel width is δ AABB /Rs . The AABB is di- rithm 1 to them independently. The critical part then is to vided into multiple equal-sized cubic cells with an edge guarantee that these independently generated meshes can be length δ. As shown in Fig. 5, each cell is positioned to in- merged into a proper two-manifold surface. In this paper, we tersect the axis-aligned sampling rays; thus, the resolution define the tiles to be overlapped with their adjacent tiles by = + is Rg Rs 1. Each boundary surfel corresponds to four one pixel width δ. By taking advantage of frustum culling, adjacent boundary cells and serves as a Hermite sample in we only sample surfels within current volumetric tiles. Dur- these cells. The construction of QEF matrix from these Her- ing dual contouring, for each cell, point scattering and quad- mite samples is actually an accumulation procedure. At the rangle output should be avoided to be performed more than beginning, the elements of QEF matrices are all assigned once (see Fig. 7). Specifically, the surfels in the overlapped as zero. We then scatter each boundary surfel to its four parts are not scattered to adjacent tiles. In addition, for each neighboring cells and append the related QEF parameters quadrangle, if any of its four adjacent cells lies in the right, into an array. We associate each element of the array with a front or top overlapped area, it is discarded. This asymmet- cell identifier and an index. After scattering, the sort primi- ric contouring technique guarantees that point scattering and tive [10] is used to cluster elements with the same identifier. quadrangle output are performed exactly once. Next, we sum up per-cell QEF parameters and eliminate du- The proposed out-of-core technique enables virtually plicated cell elements by using the compact primitive [10]. unlimited rasterization resolutions since we can split the The resultant QEF matrix is obtained after conducting accu- bounding volume into tiles and process them sequentially mulation along all the three axes of sampling. to save memory or in parallel to reduce computation time. Parallel and efficient Boolean on polygonal solids 513

5 Results are counted in x, y and z axes, respectively. Moreover, we record the peaks of graphics memory consumed in our algo- We have tested our novel Boolean system on a PC with a rithm. The output Boolean models are shown in Fig. 9 and 2.80 GHz Intel Core i5 760 CPU (4 GB main memory) and Fig. 11, where the intersection parts are zoomed in for better an NVIDIA GeForce GTX 480 GPU (480 streaming pro- illustration. cessors and 1.5 GB graphics memory). Our algorithm uses With the data compaction and out-of-core techniques, the floating point arithmetic and approximate computation, and memory peaks vary from 7 MB to 609 MB. All the test data involves no explicit intersection calculation. For each cubic are small enough to be fit in the graphics memory entirely. cell with edge√ length δ, the maximum error is the diago- In the prior algorithm of Wang et al. in [31], each LDNI nal length 3δ (Ref. [30]). Figure 8 shows the results of pixel requires 8 bytes of graphics memory, so 476 MB are validation tests using basic Cube and Cylinder primitives. consumed to store all pixels of LDNI of the finer hollowed- Our approach is able to produce Boolean solids with sharp Bunny and their dual contouring without compaction re- features preserved and robustly deal with overlapped cases. quires even more memory. Their in-core algorithm actually The subtraction of Cylinder results in a cracked Cube be- cannot model high-resolution solids. We also test to split the cause of the degeneration of thin shells and the approxima- CGI2011 model by hundreds of Hexagons and Wang et al. tion error√ which is still within our controllable maximum system crashes for this test because the Hexagon model has error 3δ. We can get a better accuracy with a finer grid res- more than 256 LDI layers. To our surprise, even Rhinoceros olution Rg as more polygons will be generated. We use low- [14] and ACIS [13] fail to produce a correct solid for the resolution intermediate models for real-time preview while decent-topology models due to numerical errors, whereas high-resolution models for final output. our approach is able to output a satisfying result (see Fig. 9). We further test our algorithm using complex models and Table 3 shows the performance comparisons among our Boolean expressions listed in Table 1. The corresponding system, Wang et al. system [31], Rhinoceros and ACIS. statistics are given in Table 2. The maximum resolution of each tile is set as 5122, while the numbers of LDI layers

Fig. 8 Validation tests: (top left) input Cube and Cylinder primitives, (top middle) three component models by − and ∩; and for the case of Fig. 7 The overlapped parts between adjacent tiles are processed Cylinder tangentially contacting Cube, results of (top right) ∪,(bottom asymmetrically to guarantee a correct topology left) Cylinder–Cube and (bottom right) Cube–Cylinder

Table 1 Test Boolean expressions with input models

Test Boolean expression Model Triangles Model Triangles

1 CGI2011 ∩ Hexigon CGI2011 5 K Hexigon 33 K 2 Bunny–(offset-Bunny–cubic-Truss)–Sphere Bunny 60 K offset-Bunny 381 K cubic-Truss 701 K Sphere 30 K 3 (Dragon ∪ Dragon ∪ Dragon)–Buddha Dragon 49 K Buddha 378 K 4 Dinosaur–Fandisk–Helix Dinosaur 112 K Fandisk 3 K Helix 74 K 5 Hub–Flower Hub 162 K Flower 15 K 514 H. Zhao et al.

Table 2 Statistics of our tests. The sampling resolutions, peeled LDI layers, graphics memory peaks, and numbers of output triangles are reported

Test Resolution LDI layers for each model Memory Triangles

1 (512 × 2)2 (26, 6, 2) + (828, 2, 2) 223 MB 2 380 K (512 × 4)2 (26, 6, 2) + (828, 2, 2) 609 MB 9 866 K 2 1282 (10, 8, 10) + (10, 8, 8) + (54, 54, 56) + (2, 2, 2) 41 MB 227 K 5122 (12, 10, 10) + (14, 10, 12) + (54, 54, 56) + (2, 2, 2) 489 MB 3 729 K 3 1282 (12, 14, 8) + (12, 14, 8) + (12, 14, 8) + (12, 16, 10) 18 MB 87 K 5122 (14, 16, 10) + (14, 16, 10) + (14, 16, 10) + (16, 22, 10) 274 MB 1 441 K 4 1282 (6, 12, 12) + (4, 6, 4) + (6,8,8) 7MB 25K (512 × 2)2 (8, 12, 12) + (6, 6, 4) + (8, 8, 8) 194 MB 1 676 K 5 1282 (16, 18, 4) + (8, 6, 6) 17 MB 163 K 5122 (16, 18, 4) + (8, 6, 6) 351 MB 2 503 K

Table 3 Performance comparisons. In addition to its high efficiency, our system can deal with various complex Boolean trees

Test Ours Wang et al.’s method Rhinoceros ACIS Coarser mesh Finer mesh Coarser mesh Finer mesh

1 3.68 s 12.59 s FAILED FAILED FAILED FAILED 2 121 ms 944 ms 717 ms 5.32 s FAILED FAILED 3 84 ms 585 ms 328 ms 2.59 s 19 s FAILED 4 51 ms 2.46 s 183 ms FAILED 10 s 633 s 5 53 ms 643 ms 202 ms 2.09 s FAILED FAILED

Fig. 9 An example of CGI2011 ∩ Hexagon: (top left) the input CGI2011 model and hundreds of Hexagons, (top right) the result of Rhinoceros, (mid-right) the result of ACIS, and (bottom) our result

Multi-pass depth peeling and sparse data structure seriously time interaction. On the contrary, more than three times of decrease the parallelism of Wang et al. GPU algorithm, speedup is gained with our new approach. The commercial their performance still cannot meet the requirement of real- softwares Rhinoceros and ACIS can only process simple Parallel and efficient Boolean on polygonal solids 515

Fig. 10 Statistics of ratio of running time in each step

Fig. 11 Example Boolean results with zoomed in (from left to right): Bunny–(offset-Bunny–cubic-Truss)–Sphere, (Dragon ∪ Dragon ∪ Dragon)–Buddha, Dinosaur–Fandisk–Helix, and Hub–Flower models and both CPU-based algorithms are naturally much very high resolutions, we are able to compute the final mod- slower than the GPU-based ones. With our novel system, els within a few seconds. With the increase of capacity of real-time visual feedback is achieved with the resolution of graphics memory, larger tiles can be used to further improve 1282, and thus complex models can be designed on the fly the performance. Since our framework is highly parallel not by interactive editing of CSG trees, as demonstrated in our only inside each tile but also among tiles, we can resort to accompanying screen-captured video. Even for models with a multi-GPU system or GPU cluster [21] for better approx- 516 H. Zhao et al. imation and even higher efficiency. Note that prior systems 5. Everitt, C.: Interactive order-independent transparency. In: Tech- are prone to fail for complex CSG trees or high resolutions nical report. NVIDIA Corporation (2001) in approximation, whereas our system is much more robust. 6. Goldfeather, J., Hultquist, J.P.M., Fuchs, H.: Fast constructive- solid geometry display in the pixel-powers graphics system. In: We give the ratio of running time in each step in Fig. 10. Proceedings of SIGGRAPH ’86, pp. 107–116. ACM, New York Boolean on surfels is rather fast and few data are transferred (1986) between the host and the device. The efficiency of sampling 7. Goldfeather, J., Molnar, S., Turk, G., Fuchs, H.: Near real-time csg B-rep models to sorted surfels is determined by the layer- rendering using tree normalization and geometric pruning. IEEE Comput. Graph. Appl. 9(3), 20–28 (1989) complexity of input solids. Polygon reconstruction is highly 8. Hable, J., Rossignac, J.: Blister: Gpu-based rendering of boolean dependent on the sampling resolution and consumes a great combinations of free-form triangulated shapes. In: Proceedings of portion of time. SIGGRAPH ’05, pp. 1024–1031. ACM, New York (2005) 9. Hable, J., Rossignac, J.: Cst: Constructive solid trimming for ren- dering breps and csg. IEEE Trans. Vis. Comput. Graph. 13(5), 1004–1014 (2007) 6 Conclusions 10. Harris, M., Sengupta, S., Owens, J.D.: Parallel prefix sum (scan) with cuda. In: Nguyen, H. (ed.) GPU Gems 3. Addison–Wesley, In this paper, we have presented an efficient solid model- Reading (2007) ing framework which computes approximate Boolean oper- 11. Hoffmann, C.: Robustness in geometric computations. J. Comput. Inf. Sci. Eng. 1(2), 143–155 (2001) ations on polygonal solids using highly parallel algorithms. 12. Cuda programming guide for cuda toolkit 3.0. In http:// The main idea is to convert B-rep models into models rep- developer.nvidia.com/object/gpucomputing.html. NVIDIA Cor- resented by well-structured surfels, discard invalid interior poration (2010) surfels according to the CSG trees, and contour them back 13. 3d acis modeling. In http://www.spatial.com. Spatial Corporation (2010) into B-rep models as output. Boolean on the axis-aligned 14. Rhinoceros. In http://www.rhino3d.com. Robert McNeel & Asso- surfels is rather simple and fast. By streaming boolean op- ciates (2010) erations, an out-of-core volumetric tiling technique is intro- 15. Ju, T., Losasso, F., Schaefer, S., Warren, J.: Dual contouring of duced to generate highly accurate results with high sampling hermite data. ACM Trans. Graph. 21(3), 339–346 (2002) 16. Kobbelt, L.P., Botsch, M., Schwanecke, U., Seidel, H.-P.: Feature resolution. sensitive surface extraction from volume data. In: Proceedings of The number of reconstructed polygons in current imple- SIGGRAPH ’01, pp. 57–66. ACM, New York (2001) mentation depends on the sampling resolution. In the future, 17. Krishnan, S., Gopi, M., Manocha, D., Mine, M.R.: Interactive we will explore an adaptive technique to further improve the boundary computation of boolean combinations of sculptured solids. Comput. Graph. Forum 16(3), 67–78 (1997) performance in the B-rep reconstruction step. As our CLDI 18. Krishnan, S., Manocha, D., Gopi, M., Culver, T., Keyser, J.: structure is general, we also plan to incorporate CLDI into Boole: A boundary evaluation system for boolean combinations various modeling applications. of sculptured solids. Int. J. Comput. Geom. Appl. 11(1), 105–144 (2001) Acknowledgements The work was partially supported by HKSAR 19. Liu, F., Huang, M.-C., Liu, X.-H., Wu, E.-H.: Freepipe: a pro- Research Grants Council (RGC) General Research Fund (GRF) on grammable parallel rendering architecture for efficient multi- project CUHK/417508. The research of H. Zhao was partially sup- fragment effects. In: Proceedings of the ACM Symposium on In- ported by the Zhejiang Provincial Natural Science Foundation of China teractive 3D Graphics and Games, pp. 75–82. ACM, New York (Grant No. Y1110004), the Scientific Research Fund of Zhejiang (2010) Provincial Education Department, China (Grant No. Y201010070), 20. Lorensen, W.E., Cline, H.E.: Marching cubes: a high resolution the Science and Technology Plan Program of Wenzhou, China (Grant 3d surface construction algorithm. In: Proceedings of SIGGRAPH No. H20100088) and the Open Project Program of State Key Lab of ’87, pp. 163–169. ACM, New York (1987) CAD&CG, Zhejiang University, China (Grant No. A1117). The re- 21. Müller, C., Frey, S., Strengert, M., Dachsbacher, C., Ertl, T.: search of X. Jin was supported by the Zhejiang Provincial Natural A compute unified system architecture for graphics clusters in- Science Foundation of China (Grant No. Z1110154) and the National corporating data locality. IEEE Trans. Vis. Comput. Graph. 15(4), Natural Science Foundation of China (Grant No. 60833007). 605–617 (2009) 22. Myers, K., Bavoil, L.: Stencil routed a-buffer. In: ACM SIG- GRAPH ’07 Technical Sketch Program. ACM, New York (2007) 23. Pauly, M., Keiser, R., Kobbelt, L.P., Gross, M.: Shape modeling References with point-sampled geometry. In: Proceedings of SIGGRAPH ’03, pp. 641–650. ACM, New York (2003) 1. Adams, B., Dutré, P.: Interactive boolean operations on surfel- 24. Pfister, H., Zwicker, M., van Baar, J., Gross, M.: Surfels: Surface bounded solids. ACM Trans. Graph. 22(3), 651–656 (2003) elements as rendering primitives. In: SIGGRAPH ’00: ACM SIG- 2. Bavoil, L., Myers, K.: Order independent transparency with dual GRAPH 2000 Papers, pp. 335–342. ACM, New York (2000) depth peeling. In: Technical report. NVIDIA Corporation (2008) 25. Rappoport, A., Spitz, S.: Interactive boolean operations for con- 3. Chen, Y., Wang, C.L.C.: Layered depth-normal images for com- ceptual design of 3d solids. In: SIGGRAPH ’97: ACM SIG- plex geometries—part one: Accurate modeling and adaptive sam- GRAPH 1997 Papers, pp. 269–278. ACM, New York (1997) pling. In: Proceedings of the ASME International Design Engi- 26. Rossignac, J., Requicha, A.: Solid modeling. In: Encyclopedia of neering Technical Conferences and Computers and Information in Electrical and Electronics Engineering (1999) Engineering Conference, ASME, Brooklyn (2008) 27. Shade, J., Gortler, S., He, L.-w., Szeliski, R.: Layered depth im- 4. de Berg, M., van Kreveld, M., Overmars, M., Schwarzkopf, O.: ages. In: Proceedings of SIGGRAPH ’98, pp. 231–242. ACM, Computational Geometry. Springer, Berlin (1997) New York (1998) Parallel and efficient Boolean on polygonal solids 517

28. Toledo, S., Chen, D., Rotkin, V.: Taucs, a library of sparse lin- of Technical Committee on Computer-Aided Product and Process De- ear solvers (version 2.2). In http://www.tau.ac.il/~stoledo/taucs/ velopment (CAPPD) of ASME. Dr. Wang has received a few awards (2003) including the ASME CIE Young Engineer Award (2009), the CUHK 29. Trapp, M., Döllner, J.: Real-time volumetric tests using layered Young Researcher Award (2009), the CUHK Vice-Chancellor’s Exem- depth images. In: Proceedings of EUROGRAPHICS ’08, pp. 235– plary Teaching Award (2008), and the Best Paper Awards of ASME 238 (2008) CIE Conferences (in 2008 and 2001). His current research interests 30. Wang, C.C.L.: Approximate boolean operations on large polyhe- include geometric modeling in computer-aided design and manufac- dral solids with partial mesh reconstruction. IEEE Transactions on turing, biomedical engineering and computer graphics, as well as com- and Computer Graphics, Published Online (2010) putational physics in virtual reality. 31. Wang, C.C.L., Leung, Y.-S., Chen, Y.: Solid modeling of polyhe- dral objects by layered depth-normal images on the gpu. Comput. Yong Chen is an Assistant Profes- Aided Des. 42(6), 535–544 (2010) sor in the Epstein Department of In- 32. Zhou, K., Gong, M., Huang, X., Guo, B.: Data-parallel octrees for dustrial and Systems Engineering at surface reconstruction. IEEE Transactions on Visualization and the University of Southern Califor- Computer Graphics 16(6) (2010) nia (USC). He received his Ph.D. degree in Mechanical Engineering majoring in Computer-aided Engi- Hanli Zhao is a faculty member neering from the Georgia Institute of Wenzhou University, Wenzhou, of Technology in 2001. Prior to China. He received his B.Sc. de- joining the USC in July 2006, he gree in software engineering from was a senior Research and Devel- Sichuan University in 2004 and his opment (R&D) Engineer at 3D Sys- Ph.D. degree in computer science tems, Inc., in Valencia, California. from Zhejiang University in 2009. His current research interests are in His research interests include non- the areas of computer-aided design photorealistic rendering and general and manufacturing, especially in topics related to direct digital manu- purpose GPU computing. facturing.

Xiaogang Jin is a Professor of the State Key Lab of CAD&CG, Zhejiang University, People’s Re- public of China. He received his Charlie C.L. Wang is currently an B.Sc. degree in Computer Science Associate Professor (with Tenure) in 1989, M.Sc. and Ph.D. degrees in at the Department of Mechani- Applied Mathematics in 1992 and cal and Automation Engineering, 1995, all from Zhejiang University. the Chinese University of Hong His current research interests in- Kong, where he began his aca- clude implicit surface computing, demic career in 2003. He gained special effects simulation, mesh fu- his B.Eng. (1998) in Mechatronics sion, texture synthesis, crowd ani- Engineering from Huazhong Uni- mation, cloth , and non- versity of Science and Technology, photorealistic rendering. M.Phil. (2000) and Ph.D. (2002) in Mechanical Engineering from the Hong Kong University of Science and Technology. He is a member of IEEE and ASME, and Vice-Chair