MITSUBISHI ELECTRIC RESEARCH LABORATORIES http://www.merl.com
The VolumePro Real-Time Ray-Casting System
Hanspeter Pfister, Jan Hardenbergh, Jim Knittel, Hugh Lauer, Larry Seiler
TR99-19 April 1999
Abstract This paper describes VolumePro, the worlds´ first single-chip real-time volume rendering system for PC class computers. VolumePro implements object-space ray-casting with parallel slice-by- slice processing. Our discussion of the architecture focuses on the rendering pipeline, the mem- ory organization, and sectioning, which is a data partitioning scheme to reduce on-chip buffer requirements. We describe several advanced features of VolumePro, such as gradient magnitude modulation of opacity and illumination, supersampling, supervolumes, and crop and cut planes. The system renders more than 500 million interpolated, Phong illuminated, composited samples per second. This is sufficient to render a 256 x 256 x 256 volume at 30 frames per second.
SIGGRAPH 99´
This work may not be copied or reproduced in whole or in part for any commercial purpose. Permission to copy in whole or in part without payment of fee is granted for nonprofit educational and research purposes provided that all such whole or partial copies include the following: a notice that such copying is by permission of Mitsubishi Electric Research Laboratories, Inc.; an acknowledgment of the authors and individual contributions to the work; and all applicable portions of the copyright notice. Copying, reproduction, or republishing for any other purpose shall require a license with payment of fee to Mitsubishi Electric Research Laboratories, Inc. All rights reserved.
Copyright c Mitsubishi Electric Research Laboratories, Inc., 1999 201 Broadway, Cambridge, Massachusetts 02139 MERLCoverPageSide2 MERL – A MITSUBISHI ELECTRIC RESEARCH LABORATORY http://www.merl.com
The VolumePro Real-Time Ray-Casting System
Hanspeter Pfister Jan Hardenbergh Jim Knittel Hugh Lauer Larry Seiler
TR-99-19 April 1999
Abstract This paper describes VolumePro, the world’s first single-chip real-time volume render- ing system for PC class computers. VolumePro implements object-space ray-casting with parallel slice-by-slice processing. Our discussion of the architecture focuses on the rendering pipeline, the memory organization, and sectioning, which is a data par- titioning scheme to reduce on-chip buffer requirements. We describe several advanced features of VolumePro, such as gradient magnitude modulation of opacity and illumina- tion, supersampling, supervolumes, and crop and cut planes. The system renders more
than 500 million interpolated, Phong illuminated, composited samples per second. This 3
is sufficient to render a 256 volume at 30 frames per second.
Submitted to SIGGRAPH ’99
This work may not be copied or reproduced in whole or in part for any commercial purpose. Permission to copy in whole or in part without payment of fee is granted for nonprofit educational and research purposes provided that all such whole or partial copies include the following: a notice that such copying is by permission of Mitsubishi Electric Information Technology Center America; an acknowledgment of the authors and individual contributions to the work; and all applicable portions of the copyright notice. Copying, reproduction, or republishing for any other purpose shall require a license with payment of fee to Mitsubishi Electric Information Technology Center America. All rights reserved.
Copyright c Mitsubishi Electric Information Technology Center America, 1999 201 Broadway, Cambridge, Massachusetts 02139 1. First printing, TR99-19, January 1999 2. Accepted to SIGGRAPH ’99, April 1999 1
Figure 1: Several volumes rendered at 30 frames per second on VolumePro.
1 Introduction
Over the last decade, direct volume rendering has become an invaluable visualization technique for a wide variety of applications. Examples include visualization of 3D sampled medical data (CT, MRI), seismic data from oil and gas exploration, or computed finite element models. While volume rendering is a very popular visualization technique, the lack of interactive frame-rates has limited its widespread use. To render one frame typically takes several seconds due to the tremendous storage and processing requirements. Highly optimized software techniques for volume rendering try to address this problem by using pre- processing to compute, for example, object illumination or regions of the volume that can be skipped dur- ing rendering [6, 8, 14]. However, pre-processing prohibits immediate visual feedback during parameter changes. Each time rendering parameters such as voxel transparency or illumination are changed, the lengthy pre-processing must be repeated before the new image is displayed. Furthermore, the pre-computed values typically increase the storage requirements by a factor of three to four. Another approach to render volume data uses texture mapping hardware, a common feature of modern 3D polygon graphics accelerators. It can be used for volume rendering by applying a method called planar texture resampling [1, 4]. The volume is stored in 3D texture memory and resampled during rendering by extracting textured planes parallel to the image plane. However, current texture hardware does not support gradient estimation and per-sample illumination for real-time shading. In addition, although recent PC graphics cards support 3D texture mapping, interactive performance still requires expensive high-end graphics workstations. To overcome these limitations, we have developed VolumePro, a special-purpose hardware system for real-time volume rendering on PC class computers. VolumePro produces 500 million interpolated, Phong
illuminated, composited samples per second. That is sufficient to render datasets with up to 16 million voxels 3
(e.g., 256 ) at 30 frames per second. A VolumePro system consists of the VolumePro PCI card, a companion 3D graphics card, and software. The VolumePro PCI card contains up to 128 MB of volume memory and the VoxelBlaster vg500 rendering chip. The VoxelBlaster Interface (VLI) is a collection of C++ objects and provides the application programming interface to the VolumePro features. VolumePro is based on the Cube-4 volume rendering architecture developed by SUNY at Stony Brook [10]. It draws liberally from Cube-4 and also from another published system, Enhanced Memory Cube-4 (EM- Cube) [9], but it makes major enhancements to both. This paper is the first detailed description of the Vol- umePro architecture and system, including a complete description of features and rendering results. The next section describes the ray-casting algorithm and the slice-by-slice processing order that are at the heart of VolumePro. In Section 3 we take an in-depth look at the VolumePro rendering pipeline. We describe novel features such as per-sample modulation of illumination by the gradient magnitude. Section 4 describes several of the advanced features of VolumePro. In Section 5 we discuss the architecture of the vg500 rendering chip, which contains four rendering pipelines and uses a novel high-bandwidth memory organization for volume data. Section 6 gives an overview of the VolumePro PCI card and Section 7 briefly discusses the main classes of the VLIapplication interface.
MERL-TR-99-19 April 1999 2
2 Rendering Algorithm
VolumePro implements ray-casting [7], one of the most commonly used image-order volume rendering algo- rithms. The current version of VolumeProsupports parallel projections of rectilinear isotropic and anisotropic volume data sets. Ray-casting offers high image quality and is easy to parallelize. To achieve uniform data access and to get a one-to-one mapping of ray-samples onto voxels we use a ray-casting technique with object-order data traversal similar to shear-warp rendering [6]. Instead of casting rays in image space, rays are sent into the dataset from each pixel on the base plane, which is the face of the volume data that is most parallel to the viewing plane. The resulting base plane image is then projected onto the image plane using a 3D transformation and a 2D image re-sampling operation. VolumePro uses 2D texture mapping on a companion graphics card for this final image warp (see Section 6). The main advantage of the shear-warp algorithm is that voxels can be read and processed in planes of voxels, called slices, that are parallel to the base plane (see Figure 2). Within a slice, voxels are read from
z
x
y
Beams within aslice Slices Parallel processing of four voxels
Figure 2: Slice-by-slice processing. memory a scanline of voxels (called a voxel beam) at a time, in top to bottom order. This leads to regular, object-order data access. In addition, it allows parallelism by having multiple rendering pipelines work on several voxels in a voxel scanline at the same time. This concept is explained further in Section 5. The next section describes the pipelined approach to ray casting in VolumePro. Our explanation deliber- ately and temporarily omits some important aspects of the VolumePro architecture. For example, we describe only one pipeline, even though VolumePro uses four parallel rendering pipelines. This is rectified later in Section 5 when the vg500 chip architecture is discussed.
MERL-TR-99-19 April 1999 3
3 The Ray-Casting Pipeline
A characteristic of VolumePro is that each voxel is read from memory exactly once per frame. Therefore, voxel values must be recirculated through the processing stages of the VolumePro pipeline so that they be- come available for calculations precisely when needed. This recirculation is at the heart of the VolumePro architecture and is described in this section. A second characteristic of VolumePro is that the reading and processing of voxel data and calculation of pixel values on rays are highly pipelined. One VolumePro processing pipeline can accept a new voxel every cycle, and can forward intermediate results to subsequent pipeline stages every cycle, without any pipeline delays. The net effect is that VolumePro can render a volume data set at the speed of reading voxels. Figure 3 illustrates a flow diagram of an idealized version of the VolumePro processing pipeline. In this
Address Voxel addresses in Voxel Generator slice-by-slice order Memory Weight Voxel values in slice-by-slice order Generator w ,w, and w ,values x y z in slice-by-slice order Inter- polation FIFO Voxel Beam
FIFO Voxel Shift Register Slice Voxel
Sample values in slice-by-slice order Slice FIFO FIFO Sample Beam Sample Gradient Sample Shift Register Estimation
Sample Shift Register FIFO Beam Slice FIFO Sample D Sample Delay
Sample values in Gradient at sample points delayed slice-by-slice order (in slice-by-slice order)
Color and opacity Classification Reflectance lookup tables & Shading Maps Color and opacity assignments to sample points MUX
Ray Ray Ray Shift selectors Alignment
Ray Register FIFO Beam
MUX Com- positing Ray Slice FIFO Partially accumulated rays Pixel values of rays
Figure 3: Conceptual ray-casting pipeline.
figure, the flow and processing of voxel data is shown in black, and control information is shown in red. The figure shows interpolation followed by gradient estimation. In actual practice, estimation of z-gradients happens before interpolation (see Section 3.2). This would complicate the flow diagram and so has been omitted from this figure.
MERL-TR-99-19 April 1999 4
(a) [0, 0, 0] (b) [0, 0, 1] (c) [1, 0, 0] (d) [1, 1, 1]
Gmim Gmim s Figure 4: Images rendered with different values of [Gmom, d , ]. Notice how the appearance of surfaces changes.
3.1 Interpolating Voxel Values At the top of Figure 3 is Volume Memory, also called Voxel Memory. The current generation of VolumePro supports 8- and 12-bit scalar voxels. All internal voxel datapaths of the vg500 chip are 12-bits wide. Voxels are read from the memory and are presented to the Interpolation unit in the slice-by-slice, beam-by-beam order described in Section 2, one voxel per cycle. The purpose of the Interpolation unit is to convert the stream of voxel values into a stream of sample values, also in slice-by-slice, beam-by-beam order, at the rate of one sample value per cycle. Each sample value is derived from its eight nearest neighboring voxels by trilinear interpolation. There- fore, the Interpolation Unit must have on hand not only the current voxel from the input stream, but also all other voxels from the same tri-linear neighborhood. It does this by storing input voxels and other intermediate values in the Voxel Slice FIFO, Voxel Beam FIFO, and Voxel Shift Register. Each of these is a data storage element sized so that when a newly calculated value is inserted, it emerges again at precisely the time needed
for a subsequent calculation.
wy wz Trilinear interpolation also requires a set of weights wx, ,and . These are generated by the Weight Generator in Figure 3. This generator is based on a digital differential analyzer (DDA) algorithm [3] to calculate each new set of weights from the previous weights. Since all rays are parallel to each other and have exactly the same spacing as the pixel positions on the base plane (and hence, as the voxels of each slice), a
single set of weights is sufficient for all of the samples of a slice. The Weight Generator, therefore, computes
wy wz anewwx, ,and prior to the beginning of each slice, based on the angle of the rays to the base plane and the initial offset between rays and voxels on the first slice. The interpolated sample values are rounded to 12-bit precision.
3.2 Gradient Estimation In our idealized model, the output of the Interpolation unit is a stream of sample values, one per cycle in slice-by-slice, beam-by-beam order. This stream is presented to the Gradient Estimation unit for estimating the x-, y-, and z-components of the gradient on each sample using a central difference gradient filter [5]. In principal, gradients can be estimated either of two ways with equal results:
Interpolate voxels to produce samples, then take central differences or some other (linear) convolution to estimate gradients on those samples; or
Estimate gradients on voxels, then interpolate those gradients to obtain the gradients at the sample points. VolumePro divides the estimation of gradients. The z-components of gradients (simply called the z-gradient in the rest of this document) are estimated from voxel values and interpolated to obtain z-gradients at sample points. The x- and y-gradients are estimated from interpolated sample points themselves.
MERL-TR-99-19 April 1999 5
The gradient at a particular sample point depends not only on data in the current slice but also on data in the previous and next slice. The next slice, of course, has not been read when the current sample value arrives
at the input. Therefore, the Gradient Estimation unit operates one voxel, one beam, and one slice behind the
S
x+1;y +1;z +1
Interpolation unit. That is, when sample arrives, the Gradient Estimation Unit finally has S enough information to complete the estimation of the gradient at sample point x;y ;z . It does this by maintaining two full slices of buffering, plus two full beams of buffering, plus two extra samples. These are stored in the Sample Slice FIFOs, Sample Beam FIFOs, and Sample Shift Registers, respectively, which serve roughly the same functions as the Voxel Slice FIFO, Voxel Beam FIFO, and Voxel
Shift Register. Each central difference gradient component has signed 8-bit precision, meaning it can repre-