Volume Rendering on Scalable Shared-Memory MIMD Architectures
Total Page:16
File Type:pdf, Size:1020Kb
Volume Rendering on Scalable Shared-Memory MIMD Architectures Jason Nieh and Marc Levoy Computer Systems Laboratory Stanford University Abstract time. Examples of these optimizations are hierarchical opacity enumeration, early ray termination, and spatially adaptive image Volume rendering is a useful visualization technique for under- sampling [13, 14, 16, 181. Optimized ray tracing is the fastest standing the large amounts of data generated in a variety of scien- known sequential algorithm for volume rendering. Nevertheless, tific disciplines. Routine use of this technique is currently limited the computational requirements of the fastest sequential algorithm by its computational expense. We have designed a parallel volume still preclude real-time volume rendering on the fastest computers rendering algorithm for MIMD architectures based on ray tracing today. and a novel task queue image partitioning technique. The combi- Parallel machines offer the computational power to achieve real- nation of ray tracing and MIMD architectures allows us to employ time volume rendering on usefully large datasets. This paper con- algorithmic optimizations such as hierarchical opacity enumera- siders how parallel computing can be brought to bear in reducing tion, early ray termination, and adaptive image sampling. The use the computation time of volume rendering. We have designed of task queue image partitioning makes these optimizations effi- and implemented an optimized volume ray tracing algorithm that cient in aparallel framework. We have implemented our algorithm employs a novel task queue image partitioning technique on the on the Stanford DASH Multiprocessor, a scalable shared-memory Stanford DASH Multiprocessor [lo], a scalable shared-memory MIMD machine. Its single address-space and coherent caches pro- MIMD (Multiple Instruction, Multiple Data) machine consisting vide programming ease and good performance for our algorithm. of up to 64 (currently 48) high-performance RISC microproces- With only a few days of programming effort, we have obtained sors. Shared-memory architectures permit straightforward imple- nearly linear speedups and near real-time frame update rates on a mentations of our algorithm. The optimized ray tracer required 48 processor machine. Since DASH is constructed from Silicon only a few days to parallelize on DASH, as compared to several Graphics multiprocessors, our code runs on any Silicon Graphics weeks to implement a similar algorithm using message-passing workstation without modification. on the Intel iPSC/860 [4]. Performance results demonstrate that our parallel ray tracer achieves good speedups despite requiring communication of 3D voxel values. When the 64 processor ver- 1 Introduction sion of DASH is available in a few months, we expect to obtain Volume visualization techniques are becoming of key importance frame update rates of 4 frames/set on 2563 voxel datssets and 15 in the analysis and understanding of multidimensional sampled frames/set on 1283 voxel datasets. data. The usefulness of volume rendering for visualizing such The paper is organized as follows. Section 2 presents the de- data has been demonstrated [6,12]. but the computational expense velopment of our parallel volume rendering algorithm for MIMD of this technique limits its routine and interactive use. In most machines. Section 3 describes the architecture of DASH and dis- volume rendering algorithms, resampling and compositing of the cusses implementation issues for the algorithm on the target ma- voxel array to form image space samples constitutes the single chine. Section 4 discusses the performance of the algorithm on greatest computational expense. DASH. Finally, we present conclusions and possibilities for future A key advantage of ray tracing [12] over other volume resam- work. pling techniques [6, 8. 191 is that algorithmic optimizations have been developed which significantly reduce its image generation 2 A Parallel Rendering Algorithm Author’s addresses:Jason Nieh, Stanford University, ERL 411, Stanford, CA 94305. e-mail [email protected];Marc Levoy, Stanford Uni- versity, CIS 207. Stanford, CA 94305, email [email protected]. Optlmlzed Ray Tracing. Our parallel volume rendering algo- rithm is based on ray tracing as described in [12]. Rays are cast Permission to copy without fee all or part of this material is granted provided that the copies are not made or distributed for from the viewing position through the volume data. The data is direct commercial advantage, the ACM copyright notice and the resampled at evenly spaced locations along each ray by trilinearly title of the publication and its date appear, and notice is given interpolating values of surrounding voxels. Finally, ray samples that copying is by permission of the Association for Computing are composited to produce an image. The algorithms used for hier- Machinery. To copy otherwise, or to republish, requires a fee and/or specific permission. 1992 Workshop on Volume Visualization/l 0/92/Boston, MA @ 1992 ACM 0-89791-528-3/92/0010/0017...$1.50 17 archical opacity enumeration, early ray termination, and adaptive and grabs tiles in that processor’s image block to ray trace. image sampling are from [13. 141. Hierarchical opacity enumer- Processors compute until all the tiles are completed. ation is performed by using a binary octree to avoid unneces- sary resampling in transparent regions of the volume. Early ray Figure 1 illustrates a four processor example of how the image termination is performed by terminating resampling along a ray plane is partiti.oned. when its accumulated opacity exceeds a user-selected threshold of opaqueness. Adaptive image sampling is performed by dividing ~a& proc..,or ie initially amniqnad on. of thorn. imag. blocks. the image plane into square sample regions measuring w mao pix- I els on a side and casting rays only from the four comer pixel of 1 I each region. Additional rays are cast only in those sample regions with high image complexity, as measured by the color difference of the comer pixels of the region. Any untraced pixels are then bilinearly interpolated from the traced pixels. Normals, opacities, and the octree are computed as a preprocessing step. A lookup table containing color as a function of local voxel gradient is used to accelerate shading. Task Queue Image Partitioning. Using a MlMD architecture, rays can be efficiently traced in parallel. Strategies for partitioning the ray tracing computation among the processors generally fall into two classes: static and dynamic. There are two basic options for static partitioning: contiguous and interleaved. Contiguous If . proc.m.or finisham l rly, it partitioning divides the image plane into a few large blocks and grabs m unprocamod image tile much statically assigns one block to each processor [5, 201. Interleaved .a thim on. from l nothar proc...or.- partitioning divides the image plane into many small image tiles and assigns the tiles to processors in round-robin order such that Figure 1: Task queue image partitioning example. each procesrmr computes multiple tiles from different regions of the image plane. Contiguous schemes for volume rendering suf- The motivation for using image tiles as the unit of parallelism fer from poor load distribution because different sections of the instead of scan lines [3, 5. 201 and for assigning spatially adja- image differ in image complexity, and the distribution of image cent tiles to the same processor is to minimize overlapping voxel complexity changes with viewing position. Interleaved partition- accesses and to reduce the costs associated with pixel sharing ing distributes load better, but image tiles must be so small to in adaptive image sampling. The tiles, initially assigned to one achieve even modest load distribution that there will be excessive processor but available for grabbing by other processors, can be overlapping voxel accesses. thought of as a queue of tasks. Because it is dynamic, such a task Dynamic partitioning divides the image plane into image tiles queue image partitioning scheme provides better load distribution and places the tiles on a single work queue. Processors grab than static partitioning schemes, as demonstrated in Section 41. In tiles from the work queue to ray trace until the queue is empty. addition, there are as many task queues as there are processors Dynamic partitioning provides better load distribution than static and the management of the queues is distributed, thereby provid- partitioning, but the centralized point of control becomes a bot- ing better scalability by avoiding the bottleneck of a centralized tleneck as more processors are used. Dynamic as well as static point of control in typical dynamic schemes. interleaved partitioning also incur excessive costs associated with pixel sharing in adaptive image sampling, as discussed later this section. Pixel Sharing in Adaptive Image Sampling. While there are To distributing the ray tracing computation among the proces- no computational dependencies between image tiles in nonadap- sors, we employ a hybrid scheme that provides good load distribu- tive ray tracing, there are such dependencies with adaptive image tion and reduces overlapping voxel accesses, described as follows: sampling. This pixel sharing arises because adjacent image sam- ple regions share the corner pixel values required for measuring 1. For P processors, the image plane is statically partitioned their image complexity. Note that sample regions are equal to into