<<

high-Performance computing

Parallel rendering Technologies for HPC Clusters

By Li Ou, PH.D. Using parallel rendering technologies with clusters of Yung-Chin Fang high-performance computing (HPC) workstations con- Onur Celebioglu figured with high-end graphics processors helps scale Victor Mashayekhi, Ph.D. out graphics capabilities by exploiting and coordinating resources. This article discusses parallel rendering architectures and highlights open source utilities that can help meet rendering require- ments for large-scale data sets.

upercomputers and high-performance computing describes how open source utilities such as Chromium and (HPC) clusters enable demanding software—such Distributed Multihead X (DMX) can help meet large-scale Sas real-time simulation, animation, , rendering requirements. and scientific applications—to generate high- resolution data sets at sizes that have not typically been Understanding parallel rendering techniques feasible in the past. However, efficiently rendering these large, There are two different ways to build parallel architectures Related Categories: dynamic data sets, especially those with high-resolution dis- for high-performance rendering. The first method is to use Clustering play requirements, can be a significant challenge. a large symmetric multiprocessing computer with extremely

High-performance Rendering is the process of converting an abstract high-end graphics capabilities. The downside of this computing (HPC) description of a scene (a data set) to an image. For com- approach is its cost—these systems can be prohibitively

System architecture plex data sets or high-resolution images, the rendering expensive.

Visit DELL.COM/PowerSolutions process can be highly compute intensive, and applica- The second method is to utilize the aggregate perfor- for the complete category index. tions with requirements for rapid turnaround time and mance of commodity graphics accelerators in clusters of HPC human perception place additional demands on process- workstations. The advantages of this architecture include ing power. State-of-the-art graphics hardware can signifi- the following: cantly enhance rendering performance, but a single piece of hardware is often limited by processor performance • Cost-effectiveness: Commodity graphics hardware and and amount of memory. If very high resolution is required, workstations remain far less expensive than high-end the rendering task can simply be too large for one piece parallel rendering computers, and some PC graphics of hardware to handle. accelerators can provide performance levels comparable Exploiting multiple processing units—a technique known to those of high-end graphics hardware. as parallel rendering—can provide the necessary computa- • Scalability: As long as the network is not saturated, the tional power to accelerate these rendering tasks. This article aggregate hardware capacity of a visualization cluster discusses the major architectures and methodologies for grows linearly as the number of HPC workstations parallel rendering with HPC workstation clusters and increases.

DELL POWER SOLUTIONS | November 2007 1 Reprinted from Dell Power Solutions, November 2007. Copyright © 2007 Dell Inc. All rights reserved. must be integrated later into a final image. Image decomposition, in contrast, forms tasks by partitioning the image space: each task ren- ders only the geometric objects that contribute to the pixels that physically belong to the space Multiple frames Multiple subsets assigned to the task. of a data set Figure 3 illustrates how a data set could be partitioned by these two approaches. In object decomposition, each workstation renders a Figure 1. Frames rendered across multiple Figure 2. Data subsets rendered across systems using temporal parallelism multiple systems using data parallelism single object in the data set: one renders the rectangle, and the other renders the circle. In • Flexibility: The performance of commodity to render these subsets simultaneously (see image decomposition, each workstation renders graphics hardware can increase rapidly, and Figure 2). High-performance interconnects route half of the final image: one renders the left side, its development cycles are typically much the data subsets between the processing work- and the other renders the right side. shorter than those of custom-designed, stations, and one or more controlling units syn- There are no absolute guidelines when choos- high-end parallel hardware. In addition, chronize the distributed rendering tasks. When ing between object decomposition and image open interfaces for hardware, such as PCI the rendering process completes, the final image decomposition. Generally, object decomposition Express (PCIe), and open interfaces for soft- can be compiled from the subsets on each work- is suitable for applications with large-scale data ware, such as Open Graphics Library station for display. Data parallelism is widely sets, while image decomposition is suitable for (OpenGL), allow organizations to easily take used by research industries and in software such applications requiring high resolution and a advantage of new hardware to help increase as real-time simulation, virtual reality, virtual large image, such as a tiled display integrating cluster performance. environment simulation, and scientific visualiza- multiple screens into a single display device. tion applications. Object decomposition can enhance load Temporal and data parallelism balancing and scalability compared with image Two common approaches to parallel rendering Object decomposition and decomposition by helping ease preprocessing are temporal parallelism and data parallelism. image decomposition and the distribution of objects evenly among Temporal parallelism divides up work into A key step in data parallelism is decomposing processors. However, it does require a post- single sequential frames that are assigned to large data sets, a step that can utilize one of two composition process to integrate the image systems and rendered in order; data parallel- major approaches: object decomposition and subsets, because objects assigned to different ism divides a large data set into subsets of image decomposition. In object decomposition, processors may map to the same screen space. work that are rendered by multiple systems and tasks are formed by partitioning the geometric For example, in Figure 3, after rendering the then recombined. description of the scene. Individual worksta- circle and the rectangle individually, this post- In temporal parallelism, the basic unit of tions partition and render subsets of the geo- composition step determines how they overlap work is the rendering of a single complete image metric data in parallel, producing pixels that to form the final image. With large numbers of or frame, and each processor is assigned a number of frames to render in sequence (see Figure 1). Because this method is not geared Object Object toward rendering individual images but can decomposition composition increase performance when rendering an entire sequence, film industries often use it for anima- tion applications and similar software, in which Rendering workstations the time it takes to render individual frames may Data set not be as important as the overall time required Final image to render all frames. Image Image decomposition composition The basic concept of data parallelism, on the other hand, is to divide and conquer. Data paral- lelism decomposes a large data set into many small subsets, then uses multiple workstations Figure 3. Parallel rendering processes using object decomposition and image decomposition

/PowerSolutions Reprinted from Dell Power Solutions, November 2007. Copyright © 2007 Dell Inc. All rights reserved. DELL.COM 2 high-Performance computing

algorithms are typically combined with image Tiled display decomposition, and sort-last algorithms are typically combined with object decomposition Master node (see Figures 4 and 5).

Application Enhancing HPC cluster graphics Image with open source utilities decomposition Parallel rendering typically requires a special Sorting Rendering Rendering Rendering software layer to exploit and coordinate distrib- server server server uted computational resources. Chromium and Network DMX, two popular open source utilities, can provide these capabilities.1 Chromium, developed by Lawrence Livermore Figure 4. Parallel rendering process for display on tiled screens using sort-first algorithms with National Laboratory, Stanford University, the image decomposition University of Virginia, and Tungsten Graphics, is an open source software stack for parallel render- partitions, this step can place heavy demands algorithms perform the space mapping early in ing on clusters of workstations. It runs on the on communication networks and require a the rendering process, and the sort operation Microsoft® Windows®, Linux®, IBM® AIX, Solaris, huge amount of computation power for the happens before the data set is partitioned and and IRIX operating systems, and is designed to composition units. distributed. Sort-last algorithms are less sensi- increase three aspects of graphics scalability: Image decomposition helps eliminate the tive to the distribution of objects within the complexity of image integration by only requiring image than sort-first algorithms because the • Data scalability: Chromium can process a final composition step to physically map the payload distribution is based on object modes. increasingly larger data sets by distributing image parts together. However, this approach The mapping of the object space to the image workloads to increasingly larger clusters. may have a potential side effect: loss of spatial space in sort-last algorithms occurs during com- • Rendering performance: Chromium can coherence. This loss can occur because in image position, when pixels from individual proces- scale out rendering performance by aggre- decomposition, a single geometric object may sors are integrated into a final image. Sort-first gating commodity graphics hardware. map to multiple regions in the image space, which requires such objects to be shared by multiple independent processors. Rendering server Rendering server

Sort-first and sort-last algorithms Object Object Application decomposition Application decomposition In , object space data includes geometric descriptions of the scene, such as polygons in 3D models. One of the chal- lenges in parallel rendering is mapping this Network data from object space to image space. Because the original data set is partitioned and distrib- Display uted to multiple processors, the basic parallel rendering algorithms must be established as Composition server sort-first or sort-last, depending on where the function mapping the object space to the image Sorting Compositor space occurs. Sort-first is an initial preprocessing step for assigning objects to the appropriate processors Figure 5. Parallel rendering process for display on a single screen using sort-last algorithms with based on a cross-space mapping policy. Sort-first object decomposition

1 For more information, see chromium.sourceforge.net and dmx.sourceforge.net.

DELL POWER SOLUTIONS | November 2007 3 Reprinted from Dell Power Solutions, November 2007. Copyright © 2007 Dell Inc. All rights reserved. • Display performance: Chromium helps the rendering, as well as open source utilities such

system output large, high-resolution images Applications as Chromium and DMX to present graphics on such as those for a tiled display. high-resolution displays and provide the neces- VTK GLUT sary data scalability. DMX Chromium Chromium enhances HPC cluster graphics The combination of HPC cluster architec- capabilities by supporting sort-first and sort-last X Window System OpenGL tures and parallel rendering can also accelerate

rendering as well as hybrid parallel rendering, Driver the pace of research projects—for example, by which combines both sort-first and sort-last allowing aerospace researchers to visualize the Graphics card algorithms. Hybrid parallel rendering in heat generated by wind on an airplane shell to Chromium uses Python scripts to help increase help them increase airplane safety and effi- system flexibility. For example, a single hard- Figure 6. Example software stack for parallel ciency, enabling geologists to see through seis- rendering on Linux- and UNIX-based clusters ware architecture can support a Python script mic zones and enhance crude oil yield from to configure a system with a sort-first image can present images across many physical dis- existing wells, and letting pharmaceutical decomposition policy, the configuration of the plays. For example, a cluster of 12 workstations researchers see how human genes interact with system to support large tiled displays with high running one X server can provide an image to a medicines to help accelerate drug development. image resolution, or the modification of a script large tiled display in a 4 × 3 screen configura- By using HPC clusters and parallel rendering in with a sort-last object decomposition algorithm tion. Working with DMX, Chromium can render these and other industries, organizations can to render a large data set in parallel. data sets and output large images to a unified successfully address large-scale problems on Flexible interfaces are a key feature of X server, which controls multiple graphics cards high-resolution data sets. Chromium, and allow it to support multiple connected to the physical displays and allows application behavior requirements. One of these logical windows to cross the display’s physical Li Ou, Ph.D., is a systems engineer in the interfaces is a library that supports OpenGL boundaries. Scalable Systems Group at Dell. He has a B.S. application programming interfaces (), but Open source utilities such as Chromium and in Electrical Engineering and an M.S. in simply dispatches the OpenGL calls to the fol- DMX help simplify the deployment of parallel ren- Computer Science from the University of lowing processing chain for parallel rendering. dering on high-performance workstation clusters. Electronics Science and Technology of China, This mechanism is transparent to applications, Figure 6 shows an example software stack for and a Ph.D. in Computer Engineering from offering an easy way to deploy traditional Linux and UNIX operating systems. Each worksta- Tennessee Technological University. OpenGL applications to parallel rendering envi- tion in a cluster requires the three bottom layers— ronments, particularly for applications requiring the graphics card, the driver, and the X Window Yung-Chin Fang is a senior consultant in the large tiled displays. System and OpenGL. Chromium and DMX create Scalable Systems Group at Dell. He has pub- Rendering very large, complex models another layer to provide parallel rendering by uti- lished more than 30 articles on HPC and cyber- requires applications with native parallel coding lizing the rendering resources of individual work- infrastructure management, and participates in techniques. Chromium provides the necessary stations. Adding a layer containing toolkits such HPC cluster–related open source groups as a data scalability to perform these tasks with a as the open source Visualization Toolkit (VTK) and Dell representative. set of parallel APIs that synchronize rendering OpenGL Utility Toolkit (GLUT) can help applications processes of multiple entities. These APIs can utilize the bottom layers. When running on Onur Celebioglu is an engineering manager in integrate seamlessly with standard OpenGL Windows-based systems, DMX and the X Window the Scalable Systems Group at Dell. He has an interfaces to support truly parallel rendering System are not required; the system can use OS M.S. in Electrical and Computer Engineering applications on clusters. services to provide the necessary functionality. from Carnegie Mellon University. Using Chromium with Linux or UNIX® operat- ing systems to utilize large display walls with Creating scalable HPC architectures Victor Mashayekhi, Ph.D., is the engineering the X Window System introduces a minor prob- for parallel rendering manager for the Scalable Systems Group at Dell, lem: although the large images can span mul- Deploying parallel rendering technologies and is responsible for product development for tiple screens, each screen is still managed by can help increase the cost-effectiveness, flexi- HPC clusters, remote computing, unified com- an independent X server. Integrating Chromium bility, and scalability of HPC architectures. munication, virtualization, custom solutions, with DMX can help solve this problem. DMX Organizations can take advantage of common and solutions advisors. Victor has a B.A., M.S., allows a single X server to run across a cluster approaches such as temporal parallelism and and Ph.D. in Computer Science from the of systems such that the X display or desktop data parallelism to help streamline data set University of Minnesota.

DELL.COM/PowerSolutions Reprinted from Dell Power Solutions, November 2007. Copyright © 2007 Dell Inc. All rights reserved. 4