Eurographics Symposium on Parallel Graphics and Visualization (2012) H. Childs and T. Kuhlen (Editors)

Parallel Rendering on Hybrid Multi-GPU Clusters

S. Eilemann†1, A. Bilgili1, M. Abdellah1, J. Hernando2, M. Makhinya3, R. Pajarola3, F. Schürmann1

1Blue Brain Project, EPFL; 2CeSViMa, UPM; 3Visualization and MultiMedia Lab, University of Zürich

Abstract Achieving efficient scalable parallel rendering for interactive visualization applications on medium-sized graphics clusters remains a challenging problem. Framerates of up to 60hz require a carefully designed and fine-tuned parallel rendering implementation that fits all required operations into the 16ms time budget available for each rendered frame. Furthermore, modern commodity hardware embraces more and more a NUMA architecture, where multiple processor sockets each have their locally attached memory and where auxiliary devices such as GPUs and network interfaces are directly attached to one of the processors. Such so called fat NUMA processing and graphics nodes are increasingly used to build cost-effective hybrid shared/distributed memory visualization clusters. In this paper we present a thorough analysis of the asynchronous parallelization of the rendering stages and we derive and implement important optimizations to achieve highly interactive framerates on such hybrid multi-GPU clusters. We use both a benchmark program and a real-world scientific application used to visualize, navigate and interact with simulations of cortical neuron circuit models.

Categories and Subject Descriptors (according to ACM CCS): I.3.2 [Computer Graphics]: Graphics Systems— Distributed Graphics; I.3.m [Computer Graphics]: Miscellaneous—Parallel Rendering

1. Introduction Hybrid GPU clusters are often build from so called fat render nodes using a NUMA architecture with multiple The decomposition of parallel rendering systems across mul- CPUs and GPUs per machine. This configuration poses tiple resources can be generalized and classified, accord- unique challenges to achieve optimal distributed parallel ren- ing to Molnar et al. [MCEF94], based on the sorting stage dering performance since the topology of both the individual of the image synthesis pipeline: Sort-first (2D) decompo- single node as well as the full cluster has to be taken into sition divides a single frame in the image domain and as- account to optimally exploit locality in parallel rendering. signs the resulting image tiles to different processes; Sort- In particular, the configuration of using multiple rendering last database (DB) decomposition performs a data domain source nodes feeding one or a few display destination nodes decomposition across the participating rendering processes. for scalable rendering offers room for improvement in the A sort-middle approach cannot typically be implemented ef- rendering stages management and synchronization. ficiently with current hardware architectures, as one needs to intercept the transformed and projected geometry in scan- In this paper we study in detail different performance lim- space after primitive assembly. In addition to this classifica- iting factors for parallel rendering on hybrid GPU clusters, tion, frame-based approaches distribute entire frames, i.e. by such as thread placements, asynchronous compositing, net- time-multiplexing (DPlex) or stereo decomposition (Eye), to work transport and configuration variation in both a straight- different rendering resources. Variations of sort-first render- forward benchmark tool as well as a real-world scientific vi- ing are pixel or subpixel decomposition as well as tile-based sualization application. decomposition frequently used for interactive raytracing. The contributions presented in this paper not only consist of an experimental analysis on a medium-sized visualization cluster but also introduce and evaluate important optimiza- tions to improve the scalability of interactive applications † stefan.eilemann@epfl.ch, ahmet.bilgili@epfl.ch, marwan.ab- for large parallel data visualization. Medium-sized visual- dellah@epfl.ch, jhernando@fi.upm.es, makhinya@ifi.uzh.ch, ization clusters are currently among the most practically rel- [email protected], felix.schuermann@epfl.ch evant configurations as these are cost efficient, widely avail-

c The Eurographics Association 2012. Eilemann & Bilgili & Abdellah & Hernando & Makhinya & Pajarola & Schürmann / Hybrid Parallel Rendering able and reasonably easy to manage. Most notable, we ob- cus on the readback, transmission, compositing and load- served that asynchronous readbacks and regions of interest balancing stages. can contribute substantially towards high-performance inter- active visualization. 3.2. Hybrid GPU Clusters Hybrid GPU clusters use a set of fat rendering nodes con- 2. Related Work nected with each other using one or more interconnects. In A number of general purpose parallel rendering concepts contrast with traditional single-socket, single-GPU cluster and optimizations have been introduced before, such as par- nodes, each node in a hybrid cluster has an internal topology, allel rendering architectures, parallel compositing, load bal- using a non-uniform memory access (NUMA) architecture ancing, data distribution, or scalability. However, only a few and PCI Express bus layout with multiple memory regions, generic APIs and parallel rendering systems exist which in- GPUs and network adapters. Figure1 shows a typical lay- clude VR Juggler [BJH∗01] (and its derivatives), Chromium out, depicting the relative bandwidth through the thickness [HHN∗02], OpenSG [VBRR02], OpenGL Multipipe SDK of the connections between components. [BRE05], Equalizer [EMP09] and CGLX [DK11] of which Node Equalizer is used as the basis for the work presented in this Processor 1 GPU 1 RAM paper. Core 1 Core 4 Core 2 Core 5 RAM GPU 2 RAM For sort-last rendering, a number of parallel com- Core 3 Core 6 positing algorithm improvements have been proposed in Processor 2 GPU 3 RAM [MPHK94, LRN96, SML∗03, EP07, PGR∗09, MEP10]. In Core 1 Core 4 Core 2 Core 5 Network 1 Network 1 RAM ... this work, however, we focus on improving the rendering Core 3 Core 6 Network 2 Network 2 performance by optimizations and data reduction techniques applicable to the sort-first and the parallel direct-send sort- Figure 1: Example hybrid GPU cluster topology. last compositing methods. For sort-first parallel rendering, the total image data exchange load is fairly simple and ap- proaches O(1) for larger N. Sort-last direct-send composit- Future hardware architectures are announced which will ing exchanges exactly two full images concurrently in total make this topology more complex. The interconnect topol- between the N rendering nodes. Message contention in mas- ogy for most visualization clusters uses typically a fully non- sive parallel rendering [YWM08] is not considered as it is blocking, switched backplane. not a major bottleneck in medium-sized clusters. To reduce transmission cost of pixel data, image com- 3.3. Asynchronous Compositing pression [AP98, YYC01, TIH03, SKN04] and screen-space Compositing in a distributed parallel rendering system is de- bounding rectangles [MPHK94, LRN96, YYC01] have been composed into readback of the produced pixel data (1), op- proposed. Benchmarks done by [SMV11] show that exploit- tional compression of this pixel data (2), transmission to the ing the topology of Hybrid GPU clusters with NUMA archi- destination node consisting of send (3) and receive (4), op- tecture increases the performance of CUDA applications. tional decompression (5) and composition consisting of up- load (6) and assembly (7) in the destination . 3. Scalable Rendering on Hybrid Architectures In a naive implementation, operations 1 to 3 are exe- 3.1. Parallel Rendering Framework cuted serially on one processor, and 4 to 7 on another. In our parallel rendering system, operations 2 to 5 are executed We chose Equalizer for our parallel rendering framework, asynchronously to the rendering and operations 1, 6 and 7. due its scalability and configuration flexibility. The archi- Furthermore, we use a latency of one frame which means tecture and design decisions of Equalizer are described in that two rendering frames are always in execution, allowing [EMP09] and the results described in this paper build upon the pipelining of these operations, as shown in Figures 2(a) that foundation. and 2(c). Furthermore we have implemented asynchronous However, the basic principles of parallel rendering are readback using OpenGL pixel buffer objects, further increas- similar for most approaches, and the analysis, experiments ing the parallelism by pipelining the rendering and pixel and improvements presented in this paper are generally ap- transfers, as shown in Figure 2(b). We did not implement plicable and thus also useful for other parallel rendering sys- asynchronous uploads as shown in Figure 2(d), since it has tems. a minimal impact when using a many-to-one configuration In a cluster-parallel distributed rendering environment the but complicates the actual implementation significantly. general execution flow is as follows (omitting event handling In the asynchronous case, the rendering thread performs and other application tasks): clear, draw, readback, transmit only application-specific rendering operations, since the and composite. Clear and draw optimizations are largely ig- overhead of starting an asynchronous readback becomes nored in this work. Consequently, in the following we fo- negligible.

c The Eurographics Association 2012. Eilemann & Bilgili & Abdellah & Hernando & Makhinya & Pajarola & Schürmann / Hybrid Parallel Rendering

Render n 1 Transmit Render 1 1 Download n 1 Transmit Receive 1 1 Command 1 n Render Receive 1 1 Command 1 n Upload 1 1 Render Thread Thread Thread Thread Thread Thread Thread Thread Thread Thread Thread Thread

draw draw receive receive draw decompr. decompr. start RB draw readback node finish RB upload draw node composite

compress compress from source draw destinationto receive assemble from source from source send destinationto start RB send receive decompr. finish RB node draw draw readback node decompr. upload compress assemble compress send composite send

(a) (b) (c) (d)

Figure 2: Synchronous (a) and asynchronous (b) readback as well as synchronous(c) and asynchronous (d) upload for two overlapping frames.

Equalizer uses a plugin system to implement GPU-CPU Processor 1 GPU 1 transfer modules which are runtime loadable. We extended Core 1 Core 4 Coredraw 2 Coreread 5 this plugin API to allow the creation of asynchronous trans- RAM draw read GPU 2 fer plugins, and implemented such a plugin using OpenGL Core 3 Core 6 pixel buffer objects (PBO). At runtime, one rendering thread Processor 2 and one download thread is used for each GPU, as well as GPU 3 Coremain 1 Coredraw 4 one transmit thread per process. The download threads are Corerecv 2 Coreread 5 Network 1 RAM created lazy, when needed. Corecmd 3 Corexmit 6 Network 2

3.4. Automatic Thread and Memory Placement Figure 3: Mapping threads within a fat cluster node. In a parallel rendering system, a number of threads are used to drive a single process in the cluster. In our software we use one main thread (main), one rendering thread for each GPU (draw) and one to finish the download operation dur- ing asynchronous downloads (read), one thread for receiv- 3.5. Region of Interest ing network data (recv), one auxilary command processing The region of interest (ROI) is the screen-space 2D bounding thread (cmd) and one thread for image transmission to other box enclosing the geometry rendered by a single resource. nodes (xmit). We have extended the core parallel rendering framework to Thread placement is critical to achieve optimal perfor- transparently use an application-provided ROI to optimize mance. Due to internals of the driver implementations and load balancing and image compositing. the hardware topology, running data transfers from a ’re- Equalizer uses an automatic load-balancing algorithm mote’ processor in the system has severe performance penal- similar to [ACCC04] and [SZF∗99] for 2D decompositions. ties. We have implemented automatic thread placement by The algorithm is based on a load grid constructed from the extending and using the hwloc [MPI12] library in Equalizer tiles and timing statistics of the last finished frame. The per- using the following rules: tile load is computed from the draw and readback time, • Restrict all node threads (main, receive, command, image or compress and transmit, whichever is higher. The load- transmission) to the cores of the processor local to the net- balancer integrates over this load grid to predict the optimal work card. tile layout for the next frame. In Equalizer, the screen is de- • Restrict each GPU rendering and download thread to the composed into a kd-tree where in [ACCC04] a dual-level cores of the processor close to the respective GPU. tree is used, first decomposed into horizontal tiles and then each horizontal tile into vertical sub-tiles. Figure3 shows the thread placement for the topology ex- ample used in Figure1. The threads are placed to all cores We use the ROI of each rendering source node to automat- of the respective processor, and the ratio of cores to threads ically refine the load grid, see also Figure4. In cases where varies with the used hardware and software configuration. the rendered data projects only to a limited screen area, this Furthermore, many of the threads do not occupy a full core ROI refinement provides the load-balancer with a more ac- at runtime, especially the node threads which are mostly idle. curate load indicator leading to a better load prediction. The load-balancer has to operate on the assumption that The memory placement follows the thread placement the load is uniform within one load grid tile. This leads nat- when using the default ’first-touch’ allocation scheme, since urally to estimation errors, since in reality the load is not all GPU-specific memory allocations are done by the render uniformly distributed, e.g., it tends to increase towards the threads. center of the screen in Figure4. Refining the load grid tile

c The Eurographics Association 2012. Eilemann & Bilgili & Abdellah & Hernando & Makhinya & Pajarola & Schürmann / Hybrid Parallel Rendering size using ROI decreases this inherent error and makes the rons. Eventually, the simulation is run in a supercomputer load prediction more accurate. and different variables, such as membrane voltage and cur- rents, spike times etc. are reported.

Figure 5: RTNeuron example rendering of the 1,000 inner- most cells of a cortical circuit. Figure 4: Load distribution areas with (top) and without (bottom) using the ROI of the right-hand 2D parallel ren- The outcome of these simulations is analyzed in several dering. ways, including direct visualization of the simulator out- put. Interactive rendering of cortical circuits presents se- Furthermore, the ROI can also be used to minimize pixel varal technical challenges, the geometric complexity being transport for compositing [MEP10]. This is particularly im- the most relevant for this study. A typical morphological portant for spatially compact DB decompositions, since the shape representing a neuron consists, on average consisting per-resource ROI becomes more compact with the number of more than 4,200 branch segments, translates to a mesh of of used resources. 140,000 polygons. A typical circuit consist of 10,000 neu- rons chosen from hundreds of unique morphologies which 3.6. Multi-threaded and Multi-process Rendering leads hundreds of million geometric elements to be rendered in each frame. Equalizer [EMP09] supports a flexible configuration system, allowing an application to be used with one single process per GPU (referred to as multi-process), with one process 4.2. RTNeuron Implementation per node and one rendering thread for each GPU (referred There are two geometric features of a cortical circuit that to as multi-threaded). This flexibility and the comparison of make its visualization difficult. Neurons have a large spatial the two configurations allows us to observe the impact of extent but exhibit a very low space occupancy and they are memory bandwidth bottlenecks in a multi-threaded applica- spatially interleaved in a complex manner, making it prac- tion versus increased memory usage in a multi-process ap- tically impossible to visually track a single cell. This also plication, as explained in Section 5.5. Furthermore, a multi- causes aliasing problems and makes visibility and occlusion process configuration is representative of typical MPI-based culling hard. parallel rendering systems, which are relatively common. To address the raw geometric complexity, level-of-detail (LOD) techniques can be used [LRC∗03]. These techniques 4. Visualization of Cortical Circuit Simulations are applied on a single workstation scenario, however for the 4.1. Neuroscience Background purpose of this paper we have focused on the higher quality and scalability constraints of fully detailed meshes. RTNeuron [HSMd08] is a scientific application used in the Blue Brain Project [Mar06] to visualize, navigate and in- For scalability, we exploit parallel rendering using 2D teract with cortical neuron circuit models and correspond- (sort-first) and DB (sort-last) task decomposition, as well as ing simulation results, see also Figure5. These detailed cir- combinations of both. Time-multiplexing can increase the cuit models accurately model the structure and function of throughput but does not improve the interaction latency, and a portion of a young rat’s cortex and the activity is be- pixel decomposition does not improve scalability for this ing mapped back to the structure [LHS∗12]. The cells are type of rendering. Subpixel decomposition for anti-aliasing simulated using cable models [Ral62], which divide neuron does increase visual quality but not rendering performance. branches in electrical compartments. The cell membrane of For 2D decomposition, the application is required to im- these compartments is modeled using a set of Hodgkin and plement efficient view frustum culling to allow scalability Huxley equations [HHS∗11]. Detailed and statistically vary- with smaller 2D viewports. For this, our application uses the ing shapes of the neurons are used to layout a cortical cir- culling techniques presented in [Her11]. The view frustum cuit and establish synapses, the connections between neu- culling algorithm used is based on a skeleton of capsules

c The Eurographics Association 2012. Eilemann & Bilgili & Abdellah & Hernando & Makhinya & Pajarola & Schürmann / Hybrid Parallel Rendering

(spherically-capped cylinders) extracted from the neuronal 5. Experimental Analysis skeletons. The frustum-capsule intersection test that deter- Any parallel rendering system is fundamentally limited by mines the visibility is performed on the GPU. Then, for two factors: the rendering itself, which includes transforma- each neuron the visibility of the capsules is used to compute tion, shading and illumination; as well as compositing mul- which intervals of its polygon list need to be rendered. This tiple partial rendering results into a final display, i.e. image. culling operation is relatively expensive for large circuits, While an exceeding task load of the former is the major and its result is cached if the projection-modelview matrix cause for parallelization in the first place, the latter is often is unchanged from the previous frame. This is typically the a limiting bottleneck due to constrained image data through- case during simulation data playback, as the user observes put or synchronization overhead. the time series from a fixed viewpoint. The simulation data, In our experiments we investigate the image transmis- typically voltage information, is applied using a color map sion and compositing throughput of sort-first (2D) and sort- on the geometry. last (DB) parallel rendering. For DB we study direct-send To address both memory constraints and rendering scala- (DS) compositing, which has shown to be highly scalable for bility for large data, DB decomposition is the most promis- medium-sized clusters [EP07]. For benchmarking we used ing approach. We have implemented and compared two dif- two applications, eqPly and RTNeuron. The first is a text- ferent decomposition schemes, a round-robin division of book parallel triangle mesh rendering application allowing neurons between rendering resources and a spatially ordered to observe and establish a baseline for the various experi- data division using a kd-tree. ments, while the second is a real-world application used by scientists to visualize large-scale HPC simulation results. The round-robin decomposition is easy to implement and leads to a fairly even geometry workload for large neuron All tests were carried out on a 11 node cluster with circuits. However, it fails to guarantee a total depth-order the following technical node specifications: dual six-core between the source frames to be assembled. This requires the 3.47GHz processors (Intel Xeon X5690), 24GB of RAM, usage of the depth buffer for compositing, and precludes the three NVidia GeForce GTX580 GPUs and a Full-HD pas- use of transparency due to the entanglement of the neurons. sive stereo display connected to two GPUs on the head node; 1Gbit/s, 10Gbit/s ethernet and 40Gbit/s QDR InfiniBand. The spatial partition DB decomposition approach is based The GPUs are attached each using a dedicated 16x PCIe 2.0 on a balanced kd-tree constructed from the end points of the link, the InfiniBand on a dedicated 8x PCIe 2.0 link and the segments that describe the shape of the neurons (the mor- 10 Gbit/s Ethernet using a 4x PCIe 2.0 link. phological skeletons). For a given static data set and fixed number of rendering source nodes, a balanced kd-tree is con- Using netperf, we evaluated a realistic achievable data structed over the segment endpoints such that the number transmission rate of n = 1100MB/s for the 10 GBit intercon- of leaves matches the number of rendering nodes. The split nect. Using pipeperf, we evaluated the real-world GPU-CPU position in each node is chosen such that the points are dis- transfer rates of p = 1000MB/s for synchronous and asyn- tributed proportionally to the rendering nodes assigned to the chronous transfers. Note that in the case of asynchronous resulting regions. transfers almost all of the time is spent in the finish oper- ation, which is pipelined with the rendering thread. Thus, The kd-tree leaf nodes define the spatial data region that unless the rendering time is smaller than the combined read- each rendering node has to load and process. Each leaf oc- back and transmission time, the readback has no cost for cupies a rectangular region in space, and clip planes restrict the overall performance. Thus for a frame buffer size sFB fragments to be generated for the exact spatial region. There- of 1920 × 1080 (6MB) we expect up to 160 fps during ren- fore, this DB decomposition can be composited using only dering. the RGBA frame buffer, and solves any issues with alpha- All 2D tests where conducted using a load balancer. All blending since a coherent compositing order can be estab- DB tests used an even static distribution of the data across lished for any viewpoint. the resources. For the eqPly tests, we rendered four times Both DB decompositions modes are difficult to load bal- the David 1 mm model consisting of 56 million triangles ance at run-time. In particular, dynamic spatial repartition- with a camera path using 400 frames, as shown in Figure6. ing and on-demand loading have been omitted because of For the RTNeuron tests we used a full neocortical column their implementation challenges. Furthermore, the readback, using 10,000 neurons (around 1.4 billion triangles). transmission and compositing costs are higher for DB than for 2D decomposition. 5.1. Thread Placement A possible way to improve load balancing in a static DB We tested the influence of thread placement by explicitly decomposition is to combine it with a nested 2D decomposi- placing the threads either on the correct or incorrect proces- tion. This way, a group of nodes which is assigned the same sor. We found that this leads to a performance improvement static data partition can use a dynamic 2D decomposition to of more than 6% in real-world rendering loads, as shown balance the rendering work locally. in Figure7. Thread placement mostly affects the GPU-CPU

c The Eurographics Association 2012. Eilemann & Bilgili & Abdellah & Hernando & Makhinya & Pajarola & Schürmann / Hybrid Parallel Rendering

2D, Full Cortical Column 7 static 2D 140% dynamic 2D 6 120% speedup 5 100%

4 80%

3 60%

Frames per Second 2 40%

1 20% speedup due to optimization

0 0% 3 9 15 21 27 33 Number of GPUs (a) Figure 6: Screenshots along the eqPly camera path. DB, Full Cortical Column Sp linear ROI off ROI on correct incorrect linear ROI off ROI on DB AFF DB bad AFF speedup speedup speedup improvement 3 10.1655 10.1655 10.4781 10.4899 10.4652 76.6491 6.6491 6.64151 4.99638 4.96776 70%0% 3% -1.14 0.58 9 30.4965 24.9284 26.2072 26.5807 26.0064 19.9473 14.783 14.9979 6.07068 6.0736 2% 5% 14.54 -0.05 15 50.8275 37.3315 38.4213 38.6116 37.8874 33.24556 19.8859 20.1192 6.501 6.52079 60%2% 3% 11.73 -0.30 download21 performance,71.1585 45.4641 and therefore47.3652 has a47.2156 small influence45.3518 46.5437 12.6358 11.2892 7.43733 7.5014 4% 4% -106.57 -0.85 on27 the overall91.4895 performance,48.9431 with54.1541 a growing54.4186 importance52.4529 at 59.84195 9.3947 9.66107 50%4% 11% 28.35 high33 framerates111.8205 as the draw47.5766 time decreases.58.5061 59.2201 55.5736 73.1401 8.06721 7.19165 7% 23% -108.53 4 spatial 40% round-robin 2D, Thread Affinity, 4xDavid 3 2D, Region of Interest , 4xDavid 30% speedup 60 7% 60 25% ROI off incorrect Frames per Second 2 20% ROI on correct 50 21% 48 6% 1 speedup 10% speedup due to optimization speedup 400 17%0% 36 4% 3 9 15 21 27 33 30 Number of GPUs 13% 24 3% (b) 20 8% Frames per Second Frames per Second 12 1% Figure 8: RTNeuron scalability for 2D (a) and DB (b) de- Speedup due to optimization 10 4% Speedup due to optimization composition 0 0% 0 0% 3 9 15 21 27 33 3 9 15 21 27 33 Number of GPUs Number of GPUs

Figure 7: Influence of thread affinity on the rendering per- 5.3. Region of InterestDB, Region of Interest , 4xDavid formance 15% Applying20 ROI for 2D compounds has a small, constant influ-5% ence when a small number of resource is used, as shown in 16 4% 12% Figure 9(a). With our camera path and model, the improve- 5.2. RTNeuron scalability 9% ment12 is about 5% due to the optimized compositing. As3% the We tested the basic scalability of RTNeuron using a static number of resources increases, the ROI becomes more im- and a load-balanced 2D compound as well as one direct send6% portant8 since load imbalances have a higher impact on2% the ROI on

overallFrames per Second performance. With ROI enabled we observed per- DB compound with the round-robin and spatial DB decom- ROI off 1% position, as shown in Figure8. 3% formance4 improvements when using all 33 GPUs,speedup reaching 60hz. Without ROI, the framerate peaked atlinear less than 50hz In the static 2D and both DB modes RTNeuron can cache 0 the results of the cull operation, as it is the case in the typical when3 using 279 GPUs.15 21 27 33 use case of this application. Nevertheless the load-balanced For depth-based DBNumber compositing of GPUs the ROI optimization 2D decompositions provides a better performance than the during readback is especially important, as shown in Fig- static decomposition. Due to the culling overhead, this mode ure 9(b). The readback and upload paths for the depth buffer does not scale as well as it does in the more simple eqPly are not optimized and incur a substantial cost when com- benchmark. positing the depth buffer at full HD resolution. Further- For the DB rendering modes, the spatial decomposition more, the messaging overhead on TCP causes significant delivers better performance due to the reduced compositing contention when using 21 GPUs or more. cost of not using the depth buffer and increasingly sparser re- For the round-robin DB mode in RTNeuron, adding a re- gion of interest. The spatial mode delivers about one frame gion of interest has predictably very little influence, since per second better performance than the round-robin DB de- each resource has only a marginally reduced region with this composition throughout all configurations, resulting in a decomposition, as shown in Figure 10(a). On the other hand, speedup of 12-60% at the observed framerate. the spatial DB mode can benefit of more than 25% due to the

c The Eurographics Association 2012. Sp linear ROI off ROI on correct incorrect linear ROI off ROI on DB AFF DB bad AFF speedup speedup speedup improvement 3 10.1655 10.1655 10.4781 10.2424 10.1572 6.6491 6.6491 6.64151 4.99638 4.96776 1% 3% -1.14 0.58 9 30.4965 24.9284 26.2072 25.6116 25.0967 19.9473 14.783 14.9979 6.07068 6.0736 2% 5% 14.54 -0.05 15 50.8275 37.3315 38.4213 38.1282 36.7865 33.2455 19.8859 20.1192 6.501 6.52079 4% 3% 11.73 -0.30 21 71.1585 45.4641 47.3652 45.3566 44.2131 46.5437 12.6358 11.2892 7.43733 7.5014 3% 4% -106.57 -0.85 27 91.4895 48.9431 54.1541 48.9431 47.7336 59.8419 9.3947 9.66107 3% 11% 28.35 33 111.8205 47.5766 58.5061 47.5766 44.5481 73.1401 Eilemann8.06721 &7.19165 Bilgili & Abdellah & Hernando & Makhinya7% &23% Pajarola-108.53 & Schürmann / Hybrid Parallel Rendering

2D, Thread Affinity, 4xDavid 2D, Region of Interest , 4xDavid Round-Robin DB, Region of Interest, Full Cortical Column 50 7% 60 25% 6 4.0% ROI off incorrect ROI off ROI on 3.3% correct 50 21% ROI on 40 6% speedup 5 speedup speedup 2.7% 40 17% 4 2.0% 30 4% 1.3% 30 13% 3 0.7% 20 3% 20 8% 2 0% Frames per Second Frames per Second Frames per Second 10 1% -0.7% speedup due to optimization

Speedup due to optimization 10 4% Speedup due to optimization 1 -1.3%

0 0% 0 0% 0 -2.0% 3 9 15 21 27 33 3 9 15 21 27 33 3 9 15 21 27 33 Number of GPUs Number of GPUs Number of GPUs (a) (a) DB, Region of Interest , 4xDavid 20 5% 15% DB Direct Send, Region of Interest, 4xDavid Spatial DB, Region of Interest, Full Cortical Column 7 30% 20 500% ROI off 4% 12% 16 ROI off ROI on ROI on 6 25% 16 400% speedup speedup 9% 12 3% 5 20%

12 300% 4 15% 6% 8 2% ROI on 3 10%

Frames per Second 8 200% ROI off 3% 4 1%

Frames per Second 2 5% Frames per Second speedup 4 linear 100% speedup due to optimization speedup due to optimization 1 0% 0 3 9 15 21 27 33 0 0% 0 -5% 3 9 15Number of21 GPUs 27 33 3 9 15 21 27 33 Number of GPUs Number of GPUs

linear synchronous(b) asynchronous correct incorrect DB DB ROI DB AFF DB bad(b) AFF DB async improvement speedup improvement improvement 3 10.1655 10.1655 10.9995 10.2424 10.1572 6.6491 6.64151 4.60805 4.96776 5.35856 4.19 8% -0.11 -7.24 9 30.4965 24.9284 27.815 25.6116 25.0967 14.783 14.9979 5.64881 6.0736 9.78578 10.26 12% 1.45 -6.99 Figure15 9: Influence50.8275 of ROI37.3315 on eqPly40.6761 2D (a) and38.1282 DB (B) ren-36.7865 Figure19.8859 10:20.1192Influence6.16123 of ROI6.52079 on RTNeuron7.56609 round-robin18.24 (a)9% 1.17 -5.51 21 71.1585 45.4641 49.2602 45.3566 44.2131 and12.6358 spatial11.2892 (b) DB7.05779 rendering7.5014 performance7.99894 12.93 8% -10.66 -5.91 dering27 performance91.4895 48.9431 54.5398 48.9431 47.7336 9.3947 9.66107 7.65548 7.7944 12.67 11% 2.84 33 111.8205 47.5766 60.6196 47.5766 44.5481 8.06721 7.19165 6.08177 6.50634 33.99 27% -10.85

2D, Thread Affinity, 4xDavid 2D, Asynchronous Readback , 4xDavid 50 10% 70 30% increasingly spareincorrect image regions as resources are added, as synchronous correct shown in Figure 10(b). 60 asynchronous 26% 40 improvement 8% speedup linear 50 21%

5.4. Asynchronous30 Readback 6% 40 17% Asynchronous readbacks are, together with ROI, one of the 30 13% most influential20 optimizations. When being mostly render-4%

Frames per Second Frames per Second 20 9% ing bound,10 pipelining the readback with the rendering yields2% a performance again of about 10%. At higher framerates, 10 4% Speedup due to optimization when the0 rendering time of a single resource decreases, 0 0% 3 9 15 21 27 33 3 9 15 21 27 33 asynchronous readback hasNumber even of aGPUs higher impact, of over Number of GPUs 25% in our setup, as shown in Figure 11. DB, Region of Interest , 4xDavid Figure 11: Difference between synchronous and asyn- 20 15% 5.5. Multi-threaded versus Multi-process chronous readback

Using three16 processes instead of three rendering threads on12% a single machine yields a slight performance improvement of 12 9% up to 6% for eqPly, as shown in Figure 12. This is due to de- mance in DB decompositions, contrary to common sense. creased memory contention and driver overhead with multi- 8 6% Consequently, we extended our implementation to insert a threaded rendering, but comes at the cost of using three times Frame per Second glFinish after the application’s draw callback. The finish as much memory4 for 2D compounds. 3% might allow the driver to be more explicit about the synchro-

0 nization, since it enforces the completion of all outstanding 5.6. FinishDB 3 after Draw9DB ROI 15 21improvement27 33 rendering commands. We could not investigate this issue fur- Category Title During benchmarking we discovered that a glFinish after the ther due to the closed nature the OpenGL drivers used. Fig- draw operation dramatically increases the rendering perfor- ure 13 shows a speedup of up to 200%.

c The Eurographics Association 2012. linear Multithreaded Multiprocess speedup 3 10.464 10.464 10.7337 3% 9 31.392 26.4688 27.4492 4% 15 52.32 38.7724 40.427 4% 21 73.248 47.5555 49.3087 4% 27 94.176 54.6644 55.2898 1% 33 115.104 57.188 60.8846 6% 39 136.032

Eilemann & Bilgili & Abdellah & Hernando & Makhinya & Pajarola & Schürmann / Hybrid Parallel Rendering

2D, Multi-Process , 4xDavid network transport overhead when using TCP over 10 Gbit 60 7% Multithreaded ethernet. Multiprocess 50 6% speedup Regarding RTNeuron, our tests showed that there is room

40 5% for improvement in several areas. The view frustum culling implementation has been identified as one of the major ob- 30 4% stacles for scalability in some configurations. We will eval-

20 2% uate if culling overhead can be reduced by using a differ- Frames per Second ent algorithm, such as clustering the cell segments and using

10 1% Speedup due to optimization an independent capsule skeleton for each cluster. Also, the

0 0% current visualizations target circuit models where the cells 3 9 15 21 27 33 geometries are not unique. The simulation is moving to- Number of GPUs wards completely unique morphologies, which is a complete Figure 12: Difference between multi-threaded and multi- paradigm shift with new challenges to be addressed. process rendering We want to further analyse the use of subpixel compounds to improve the visual quality for RTNeuron, which requires DB Direct Send, 4xDavid strong anti-aliasing for good visual results. 20 30% multithreaded multiprocess Acknowledgments 16 speedup 24% The authors would like to thank and acknowledge the fol-

12 18% lowing institutions and projects for providing 3D test data sets: the Digital Michelangelo Project and the Stanford 3D 8 12% Scanning Repository. This work was supported in part by the

Frames per Second Blue Brain Project, the Swiss National Science Foundation 4 6% speedup due to optimization under Grant 200020-129525 and by the Spanish Ministry of Science and Innovation under grant (TIN2010-21289-C02- 0 0% 3 9 15 21 27 33 01/02) and the Cajal Blue Brain Project. Number of GPUs We would also like to thank github for providing an Figure 13: Influence of glFinish on DB decomposition per- excellent infrastructure hosting the Equalizer project at formance http://github.com/Eyescale/Equalizer/.

References 6. Conclusions and Future Work [ACCC04]A BRAHAM F., CELES W., CERQUEIRA R.,CAMPOS J. L.: A load-balancing strategy for sort-first distributed render- This paper presents, analyses and evaluates a number of ing. In IN PROCEEDINGS OF SIBGRAPI 2004 (2004), IEEE different optimizations for interactive parallel rendering on Computer Society, pp. 292–299.3 medium-sized, hybrid visualization clusters. We have shown [AP98]A HRENS J.,PAINTER J.: Efficient sort-last rendering that to reach interactive frame rates with large data sets not using compression-based image compositing. In Proceedings only the application rendering code has to be carefully stud- Eurographics Workshop on Parallel Graphics and Visualization (1998).2 ied but also that the parallel rendering framework requires ∗ careful optimization. [BJH 01]B IERBAUM A.,JUST C.,HARTLING P., MEINERT K., BAKER A.,CRUZ-NEIRA C.: VR Juggler: A virtual platform for Apart from implementing and evaluating known opti- virtual reality application development. In Proceedings of IEEE mizations such as asynchronous readbacks and automatic Virtual Reality (2001), pp. 89–96.2 thread placement in a generic parallel rendering framework, [BRE05]B HANIRAMKA P., ROBERT P. C. D., EILEMANN S.: we presented a novel algorithm for the optimization of 2D OpenGL Multipipe SDK: A toolkit for scalable parallel render- load-balancing re-using region of interest information from ing. In Proceedings IEEE Visualization (2005), pp. 119–126.2 the compositing stage for refined load distribution. [DK11]D OERR K.-U.,KUESTER F.: CGLX: A scalable, high- performance visualization framework for networked display en- The study of the impact of the individual optimizations vironments. IEEE Transactions on Visualization and Computer provides valuable insight into the influence of the various Graphics 17, 2 (March 2011), 320–332.2 features on real-world rendering performance. Our perfor- [EMP09]E ILEMANN S.,MAKHINYA M.,PAJAROLA R.: Equal- mance study on a hybrid visualization cluster demonstrates izer: A scalable parallel rendering framework. IEEE Transactions the typical scalability and common roadblocks towards high- on Visualization and Computer Graphics (May/June 2009).2,4 performance large data visualization. [EP07]E ILEMANN S.,PAJAROLA R.: Direct send compositing for parallel sort-last rendering. In Proceedings Eurographics We want to benchmark the use of RDMA over InfiniBand, Symposium on Parallel Graphics and Visualization (2007).2, in particular for DB decompositions which have a significant 5

c The Eurographics Association 2012. Eilemann & Bilgili & Abdellah & Hernando & Makhinya & Pajarola & Schürmann / Hybrid Parallel Rendering

[Her11]H ERNANDO J.B.: Interactive Visualization of Detailed [SMV11]S PAFFORD K.,MEREDITH J.S.,VETTER J. S.: Quan- Large Neocortical Circuit Simulations. PhD thesis, Facultad de tifying numa and contention effects in multi-gpu systems. In Pro- Informática, Universidad Politécnica de Madrid, 2011.4 ceedings of the Fourth Workshop on General Purpose Process- ing on Graphics Processing Units (New York, NY, USA, 2011), [HHN∗02]H UMPHREYS G.,HOUSTON M.,NG R.,FRANK R., GPGPU-4, ACM, pp. 11:1–11:7.2 AHERN S.,KIRCHNER P. D., KLOSOWSKI J. T.: Chromium: A stream-processing framework for interactive rendering on clus- [SZF∗99]S AMANTA R.,ZHENG J.,FUNKHOUSER T., LI K., ters. ACM Transactions on Graphics 21, 3 (2002), 693–702.2 SINGH J. P.: Load balancing for multi-projector rendering sys- tems. In Proceedings of the ACM SIGGRAPH/EUROGRAPHICS [HHS∗11]H AY E.,HILL S.,SCHÜRMANN F., MARKRAM H., workshop on Graphics hardware (New York, NY, USA, 1999), SEGEV I.: Models of neocortical layer 5b pyramidal cells captur- HWWS ’99, ACM, pp. 107–116.3 ing aawide˘ range of dendritic and perisomatic active properties. PLoS Comput Biol 7, 7 (07 2011), e1002107.4 [TIH03]T AKEUCHI A.,INO F., HAGIHARA K.: An improved binary-swap compositing for sort-last parallel rendering on dis- [HSMd08]H ERNANDO J.B.,SCHÜRMANN F., MARKRAM H., tributed memory multiprocessors. 29, 11-12 DE MIGUEL P.: RTNeuron, an application for interactive visual- (2003), 1745–1762.2 ization of detailed cortical column simulations. XVIII Jornadas de Paralelismo, Spain (2008).4 [VBRR02]V OSSG.,BEHR J.,REINERS D.,ROTH M.: A multi- thread safe foundation for scene graphs and its extension to [LHS∗12]L ASSERRE S.,HERNANDO J.,SCHÜRMANN ˝ clusters. In Proceedings of the Fourth Eurographics Workshop F., DE MIGUEL ANASAGASTI P., ABOU-JAOUDÂOG., on Parallel Graphics and Visualization (Aire-la-Ville, Switzer- MARKRAM H.: A neuron membrane mesh representation land, Switzerland, 2002), EGPGV ’02, Eurographics Associa- for visualization of electrophysiological simulations. IEEE tion, pp. 33–37.2 Transactions on Visualization and Computer Graphics 18, 2 (2012), 214–227.4 [YWM08]Y U H.,WANG C.,MA K.-L.: Massively parallel vol- ∗ ume rendering using 2-3 swap image compositing. In Proceed- [LRC 03]L UEBKE D.,REDDY M.,COHEN J.D.,VARSHNEY ings IEEE/ACM Supercomputing (2008).2 A.,WATSON B.,HUEBNER R.: Level of Detail for 3D Graphics. Morgan Kaufmann Publishers, San Francisco, California, 2003. [YYC01]Y ANG D.-L.,YU J.-C.,CHUNG Y.-C.: Efficient com- 4 positing methods for the sort-last-sparse parallel volume render- ing system on distributed memory multicomputers. Journal of [LRN96]L EE T.-Y., RAGHAVENDRA C.,NICHOLAS J. B.: Im- Supercomputing 18, 2 (February 2001), 201–22–.2 age composition schemes for sort-last polygon rendering on 2D mesh multicomputers. IEEE Transactions on Visualization and Computer Graphics 2, 3 (July-September 1996), 202–217.2 [Mar06]M ARKRAM H.: The Blue Brain Project. Nature Reviews Neuroscience 7, 2 (2006), 153–160. http://bluebrain. epfl.ch.4 [MCEF94]M OLNAR S.,COX M.,ELLSWORTH D.,FUCHS H.: A sorting classification of parallel rendering. IEEE Computer Graphics and Applications 14, 4 (1994), 23–32.1 [MEP10]M AKHINYA M.,EILEMANN S.,PAJAROLA R.: Fast compositing for cluster-parallel rendering. In Proceedings Eu- rographics Symposium on Parallel Graphics and Visualization (2010), pp. 111–120.2,4 [MPHK94]M A K.-L.,PAINTER J.S.,HANSEN C.D.,KROGH M. F.: Parallel volume rendering using binary-swap image com- position. IEEE Computer Graphics and Applications 14, 4 (July 1994), 59–68.2 [MPI12] MPI O.: Portable Hardware Locality. http://www.open- mpi.org/projects/hwloc/, 2012.3 [PGR∗09]P ETERKA T., GOODELL D.,ROSS R.,SHEN H.- W., THAKUR R.: A configurable algorithm for parallel image- compositing applications. In Proceedings ACM/IEEE Confer- ence on High Performance Networking and Computing (2009), pp. 1–10.2 [Ral62]R ALL W.: Theory of Physiological Properties of Den- drites. Annals of the New York Academy of Science 96 (1962), 1071–1092.4 [SKN04]S ANO K.,KOBAYASHI Y., NAKAMURA T.: Differen- tial coding scheme for efficient parallel image composition on a pc cluster system. Parallel Computing 30, 2 (2004), 285–299.2 [SML∗03]S TOMPEL A.,MA K.-L.,LUM E.B.,AHRENS J., PATCHETT J.: SLIC: Scheduled linear image compositing for parallel volume rendering. In Proceedings IEEE Symposium on Parallel and Large-Data Visualization and Graphics (2003), pp. 33–40.2

c The Eurographics Association 2012.