Parallel Rendering Technologies for HPC Clusters

HIGH-PERFORMANCE COMPUTING PARALLEL RENDERING TECHNOLOGIES FOR HPC CLUSTERS BY LI OU, PH.D. Using parallel rendering technologies with clusters of YUNG-CHIN FANG high-performance computing (HPC) workstations con- ONUR CELEBIOGLU figured with high-end graphics processors helps scale VICTOR MASHAYEKHI, PH.D. out graphics capabilities by exploiting and coordinating distributed computing resources. This article discusses parallel rendering architectures and highlights open source utilities that can help meet rendering requirements for large-scale data sets. upercomputers and high-performance computing describes how open source utilities such as Chromium and (HPC) clusters enable demanding software—such Distributed Multihead X (DMX) can help meet large-scale Sas real-time simulation, animation, virtual reality, rendering requirements. and scientific visualization applications—to generate high- resolution data sets at sizes that have not typically been Understanding parallel rendering techniques feasible in the past. However, efficiently rendering these large, There are two different ways to build parallel architectures Related Categories: dynamic data sets, especially those with high-resolution dis- for high-performance rendering. The first method is to use Clustering play requirements, can be a significant challenge. a large symmetric multiprocessing computer with extremely High-performance Rendering is the process of converting an abstract high-end graphics capabilities. The downside of this computing (HPC) description of a scene (a data set) to an image. For com- approach is its cost—these systems can be prohibitively System architecture plex data sets or high-resolution images, the rendering expensive. Visit DELL.COM/PowerSolutions process can be highly compute intensive, and applica- The second method is to utilize the aggregate perfor- for the complete category index. tions with requirements for rapid turnaround time and mance of commodity graphics accelerators in clusters of HPC human perception place additional demands on process- workstations. The advantages of this architecture include ing power. State-of-the-art graphics hardware can signifi- the following: cantly enhance rendering performance, but a single piece of hardware is often limited by processor performance • Cost-effectiveness: Commodity graphics hardware and and amount of memory. If very high resolution is required, workstations remain far less expensive than high-end the rendering task can simply be too large for one piece parallel rendering computers, and some PC graphics of hardware to handle. accelerators can provide performance levels comparable Exploiting multiple processing units—a technique known to those of high-end graphics hardware. as parallel rendering—can provide the necessary computa- • Scalability: As long as the network is not saturated, the tional power to accelerate these rendering tasks. This article aggregate hardware capacity of a visualization cluster discusses the major architectures and methodologies for grows linearly as the number of HPC workstations parallel rendering with HPC workstation clusters and increases. DELL POWER SOLUTIONS | November 2007 1 Reprinted from Dell Power Solutions, November 2007. Copyright © 2007 Dell Inc. All rights reserved. must be integrated later into a final image. Image decomposition, in contrast, forms tasks by partitioning the image space: each task renders only the geometric objects that contribute to the pixels that physically belong to the space Multiple frames Multiple subsets assigned to the task. of a data set Figure 3 illustrates how a data set could be partitioned by these two approaches. In object decomposition, each workstation renders a Figure 1. Frames rendered across multiple Figure 2. Data subsets rendered across systems using temporal parallelism multiple systems using data parallelism single object in the data set: one renders the rectangle, and the other renders the circle. In • Flexibility: The performance of commodity to render these subsets simultaneously (see image decomposition, each workstation renders graphics hardware can increase rapidly, and Figure 2). High-performance interconnects route half of the final image: one renders the left side, its development cycles are typically much the data subsets between the processing work- and the other renders the right side. shorter than those of custom-designed, stations, and one or more controlling units syn- There are no absolute guidelines when choos- high-end parallel hardware. In addition, chronize the distributed rendering tasks. When ing between object decomposition and image open interfaces for hardware, such as PCI the rendering process completes, the final image decomposition. Generally, object decomposition Express (PCIe), and open interfaces for soft- can be compiled from the subsets on each work- is suitable for applications with large-scale data ware, such as Open Graphics Library station for display. Data parallelism is widely sets, while image decomposition is suitable for (OpenGL), allow organizations to easily take used by research industries and in software such applications requiring high resolution and a advantage of new hardware to help increase as real-time simulation, virtual reality, virtual large image, such as a tiled display integrating cluster performance. environment simulation, and scientific visualiza- multiple screens into a single display device. tion applications. Object decomposition can enhance load Temporal and data parallelism balancing and scalability compared with image Two common approaches to parallel rendering Object decomposition and decomposition by helping ease preprocessing are temporal parallelism and data parallelism. image decomposition and the distribution of objects evenly among Temporal parallelism divides up work into A key step in data parallelism is decomposing processors. However, it does require a post- single sequential frames that are assigned to large data sets, a step that can utilize one of two composition process to integrate the image systems and rendered in order; data parallel- major approaches: object decomposition and subsets, because objects assigned to different ism divides a large data set into subsets of image decomposition. In object decomposition, processors may map to the same screen space. work that are rendered by multiple systems and tasks are formed by partitioning the geometric For example, in Figure 3, after rendering the then recombined. description of the scene. Individual worksta- circle and the rectangle individually, this post- In temporal parallelism, the basic unit of tions partition and render subsets of the geo- composition step determines how they overlap work is the rendering of a single complete image metric data in parallel, producing pixels that to form the final image. With large numbers of or frame, and each processor is assigned a number of frames to render in sequence (see Figure 1). Because this method is not geared Object Object toward rendering individual images but can decomposition composition increase performance when rendering an entire sequence, film industries often use it for animation applications and similar software, in which Rendering workstations the time it takes to render individual frames may Data set not be as important as the overall time required Final image to render all frames. Image Image decomposition composition The basic concept of data parallelism, on the other hand, is to divide and conquer. Data parallelism decomposes a large data set into many small subsets, then uses multiple workstations Figure 3. Parallel rendering processes using object decomposition and image decomposition /PowerSolutions Reprinted from Dell Power Solutions, November 2007. Copyright © 2007 Dell Inc. All rights reserved. DELL.COM 2 HIGH-PERFORMANCE COMPUTING algorithms are typically combined with image Tiled display decomposition, and sort-last algorithms are typically combined with object decomposition Master node (see Figures 4 and 5). Application Enhancing HPC cluster graphics Image with open source utilities decomposition Parallel rendering typically requires a special Sorting Rendering Rendering Rendering software layer to exploit and coordinate distrib- server server server uted computational resources. Chromium and Network DMX, two popular open source utilities, can provide these capabilities.1 Chromium, developed by Lawrence Livermore Figure 4. Parallel rendering process for display on tiled screens using sort-first algorithms with National Laboratory, Stanford University, the image decomposition University of Virginia, and Tungsten Graphics, is an open source software stack for parallel render- partitions, this step can place heavy demands algorithms perform the space mapping early in ing on clusters of workstations. It runs on the on communication networks and require a the rendering process, and the sort operation Microsoft® Windows®, Linux®, IBM® AIX, Solaris, huge amount of computation power for the happens before the data set is partitioned and and IRIX operating systems, and is designed to composition units. distributed. Sort-last algorithms are less sensi- increase three aspects of graphics scalability: Image decomposition helps eliminate the tive to the distribution of objects within the complexity of image integration by only requiring image than sort-first algorithms because the • Data scalability: Chromium can process a final composition step to physically map the payload distribution is based on object

Parallel Rendering Technologies for HPC Clusters

Parallel Particle Rendering: a Performance Comparison Between Chromium and Aura

A Load-Balancing Strategy for Sort-First Distributed Rendering

Cross-Segment Load Balancing in Parallel Rendering

Chromium Renderserver: Scalable and Open Remote Rendering Infrastructure Brian Paul, Member, IEEE, Sean Ahern, Member, IEEE, E

Hybrid Sort-First and Sort-Last Parallel Rendering with a Cluster of Pcs

Scalable Rendering on PC Clusters

Recent Developments in Parallel Rendering

Dynamic Load Balancing for Parallel Volume Rendering

Cross-Segment Load Balancing in Parallel Rendering

Load Balancing for Multi-Projector Rendering Systems

ON DISTRIBUTED NETWORK RENDERING SYSTEMS 1. Introduction the Establishment and Performance of a Distributed Computer Rendering S

An Analysis of Parallel Rendering Systems