Parallel Rendering on Hybrid Multi-GPU Clusters
Total Page:16
File Type:pdf, Size:1020Kb
Eurographics Symposium on Parallel Graphics and Visualization (2012) H. Childs and T. Kuhlen (Editors) Parallel Rendering on Hybrid Multi-GPU Clusters S. Eilemann†1, A. Bilgili1, M. Abdellah1, J. Hernando2, M. Makhinya3, R. Pajarola3, F. Schürmann1 1Blue Brain Project, EPFL; 2CeSViMa, UPM; 3Visualization and MultiMedia Lab, University of Zürich Abstract Achieving efficient scalable parallel rendering for interactive visualization applications on medium-sized graphics clusters remains a challenging problem. Framerates of up to 60hz require a carefully designed and fine-tuned parallel rendering implementation that fits all required operations into the 16ms time budget available for each rendered frame. Furthermore, modern commodity hardware embraces more and more a NUMA architecture, where multiple processor sockets each have their locally attached memory and where auxiliary devices such as GPUs and network interfaces are directly attached to one of the processors. Such so called fat NUMA processing and graphics nodes are increasingly used to build cost-effective hybrid shared/distributed memory visualization clusters. In this paper we present a thorough analysis of the asynchronous parallelization of the rendering stages and we derive and implement important optimizations to achieve highly interactive framerates on such hybrid multi-GPU clusters. We use both a benchmark program and a real-world scientific application used to visualize, navigate and interact with simulations of cortical neuron circuit models. Categories and Subject Descriptors (according to ACM CCS): I.3.2 [Computer Graphics]: Graphics Systems— Distributed Graphics; I.3.m [Computer Graphics]: Miscellaneous—Parallel Rendering 1. Introduction Hybrid GPU clusters are often build from so called fat render nodes using a NUMA architecture with multiple The decomposition of parallel rendering systems across mul- CPUs and GPUs per machine. This configuration poses tiple resources can be generalized and classified, accord- unique challenges to achieve optimal distributed parallel ren- ing to Molnar et al. [MCEF94], based on the sorting stage dering performance since the topology of both the individual of the image synthesis pipeline: Sort-first (2D) decompo- single node as well as the full cluster has to be taken into sition divides a single frame in the image domain and as- account to optimally exploit locality in parallel rendering. signs the resulting image tiles to different processes; Sort- In particular, the configuration of using multiple rendering last database (DB) decomposition performs a data domain source nodes feeding one or a few display destination nodes decomposition across the participating rendering processes. for scalable rendering offers room for improvement in the A sort-middle approach cannot typically be implemented ef- rendering stages management and synchronization. ficiently with current hardware architectures, as one needs to intercept the transformed and projected geometry in scan- In this paper we study in detail different performance lim- space after primitive assembly. In addition to this classifica- iting factors for parallel rendering on hybrid GPU clusters, tion, frame-based approaches distribute entire frames, i.e. by such as thread placements, asynchronous compositing, net- time-multiplexing (DPlex) or stereo decomposition (Eye), to work transport and configuration variation in both a straight- different rendering resources. Variations of sort-first render- forward benchmark tool as well as a real-world scientific vi- ing are pixel or subpixel decomposition as well as tile-based sualization application. decomposition frequently used for interactive raytracing. The contributions presented in this paper not only consist of an experimental analysis on a medium-sized visualization cluster but also introduce and evaluate important optimiza- tions to improve the scalability of interactive applications † stefan.eilemann@epfl.ch, ahmet.bilgili@epfl.ch, marwan.ab- for large parallel data visualization. Medium-sized visual- dellah@epfl.ch, jhernando@fi.upm.es, makhinya@ifi.uzh.ch, ization clusters are currently among the most practically rel- [email protected], felix.schuermann@epfl.ch evant configurations as these are cost efficient, widely avail- c The Eurographics Association 2012. Eilemann & Bilgili & Abdellah & Hernando & Makhinya & Pajarola & Schürmann / Hybrid Parallel Rendering able and reasonably easy to manage. Most notable, we ob- cus on the readback, transmission, compositing and load- served that asynchronous readbacks and regions of interest balancing stages. can contribute substantially towards high-performance inter- active visualization. 3.2. Hybrid GPU Clusters Hybrid GPU clusters use a set of fat rendering nodes con- 2. Related Work nected with each other using one or more interconnects. In A number of general purpose parallel rendering concepts contrast with traditional single-socket, single-GPU cluster and optimizations have been introduced before, such as par- nodes, each node in a hybrid cluster has an internal topology, allel rendering architectures, parallel compositing, load bal- using a non-uniform memory access (NUMA) architecture ancing, data distribution, or scalability. However, only a few and PCI Express bus layout with multiple memory regions, generic APIs and parallel rendering systems exist which in- GPUs and network adapters. Figure1 shows a typical lay- clude VR Juggler [BJH∗01] (and its derivatives), Chromium out, depicting the relative bandwidth through the thickness [HHN∗02], OpenSG [VBRR02], OpenGL Multipipe SDK of the connections between components. [BRE05], Equalizer [EMP09] and CGLX [DK11] of which Node Equalizer is used as the basis for the work presented in this Processor 1 GPU 1 RAM paper. Core 1 Core 4 Core 2 Core 5 RAM GPU 2 RAM For sort-last rendering, a number of parallel com- Core 3 Core 6 positing algorithm improvements have been proposed in Processor 2 GPU 3 RAM [MPHK94, LRN96, SML∗03, EP07, PGR∗09, MEP10]. In Core 1 Core 4 Core 2 Core 5 Network 1 Network 1 RAM ... this work, however, we focus on improving the rendering Core 3 Core 6 Network 2 Network 2 performance by optimizations and data reduction techniques applicable to the sort-first and the parallel direct-send sort- Figure 1: Example hybrid GPU cluster topology. last compositing methods. For sort-first parallel rendering, the total image data exchange load is fairly simple and ap- proaches O(1) for larger N. Sort-last direct-send composit- Future hardware architectures are announced which will ing exchanges exactly two full images concurrently in total make this topology more complex. The interconnect topol- between the N rendering nodes. Message contention in mas- ogy for most visualization clusters uses typically a fully non- sive parallel rendering [YWM08] is not considered as it is blocking, switched backplane. not a major bottleneck in medium-sized clusters. To reduce transmission cost of pixel data, image com- 3.3. Asynchronous Compositing pression [AP98, YYC01, TIH03, SKN04] and screen-space Compositing in a distributed parallel rendering system is de- bounding rectangles [MPHK94, LRN96, YYC01] have been composed into readback of the produced pixel data (1), op- proposed. Benchmarks done by [SMV11] show that exploit- tional compression of this pixel data (2), transmission to the ing the topology of Hybrid GPU clusters with NUMA archi- destination node consisting of send (3) and receive (4), op- tecture increases the performance of CUDA applications. tional decompression (5) and composition consisting of up- load (6) and assembly (7) in the destination framebuffer. 3. Scalable Rendering on Hybrid Architectures In a naive implementation, operations 1 to 3 are exe- 3.1. Parallel Rendering Framework cuted serially on one processor, and 4 to 7 on another. In our parallel rendering system, operations 2 to 5 are executed We chose Equalizer for our parallel rendering framework, asynchronously to the rendering and operations 1, 6 and 7. due its scalability and configuration flexibility. The archi- Furthermore, we use a latency of one frame which means tecture and design decisions of Equalizer are described in that two rendering frames are always in execution, allowing [EMP09] and the results described in this paper build upon the pipelining of these operations, as shown in Figures 2(a) that foundation. and 2(c). Furthermore we have implemented asynchronous However, the basic principles of parallel rendering are readback using OpenGL pixel buffer objects, further increas- similar for most approaches, and the analysis, experiments ing the parallelism by pipelining the rendering and pixel and improvements presented in this paper are generally ap- transfers, as shown in Figure 2(b). We did not implement plicable and thus also useful for other parallel rendering sys- asynchronous uploads as shown in Figure 2(d), since it has tems. a minimal impact when using a many-to-one configuration In a cluster-parallel distributed rendering environment the but complicates the actual implementation significantly. general execution flow is as follows (omitting event handling In the asynchronous case, the rendering thread performs and other application tasks): clear, draw, readback, transmit only application-specific rendering operations, since the and composite. Clear and draw optimizations are largely ig- overhead of starting an asynchronous readback becomes nored in this work. Consequently, in the following we fo- negligible. c The Eurographics Association 2012. Eilemann &