Ray Casting Architectures for Volume Visualization
Harvey Ray, Hansp eter P ster , Deb orah Silver , Todd A. Co ok
Abstract | Real-time visualization of large volume datasets
demands high p erformance computation, pushing the stor-
age, pro cessing, and data communication requirements to
the limits of current technology. General purp ose paral-
lel pro cessors have b een used to visualize mo derate size
datasets at interactive frame rates; however, the cost and
size of these sup ercomputers inhibits the widespread use
for real-time visualization. This pap er surveys several sp e-
cial purp ose architectures that seek to render volumes at
interactive rates. These sp ecialized visualization accelera-
tors have cost, p erformance, and size advantages over par-
allel pro cessors. All architectures implement ray casting
using parallel and pip elined hardware. Weintro duce a new
metric that normalizes p erformance to compare these ar-
Fig. 1. Volume dataset.
chitectures. The architectures included in this survey are
VOGUE, VIRIM, Array Based Ray Casting, EM-Cub e, and
VIZARD II. We also discuss future applications of sp ecial
ume rendering architectures that seek to achieveinteractive
purp ose accelerators.
volume rendering for rectilinear datasets. A survey of other
metho ds used to achieve real time volume rendering is pre-
I. Introduction
sented in [47]. The motivation for custom volume renderers
OLUME visualization is an imp ortant to ol to view
is discussed in the next section. Several other custom ar-
and analyze large amounts of data from various sci-
V
chitectures exist [1], [8], [10], [16], [17], [19], [36], [38] but
enti c disciplines. It has numerous applications in areas
were not presented b ecause they are either related to the
such as biomedicine, geophysics, computational uid dy-
architectures presented here or are not considered to be
namics, nite element mo dels, and computational chem-
recent. Section III presents three parallel volume render-
istry. Numerical simulations and sampling devices suchas
ing algorithms that are implemented by the architectures
magnetic resonance imaging MRI, computed tomography
in this pap er. Ma jor comp onents of a volume rendering
CT, satellite imaging, and sonar are common sources of
system are discussed in Section IV. Five sp ecialized vol-
large 3D datasets. These datasets are generally anywhere
ume rendering architectures are surveyed in Section V. A
3 3
from 128 to 1024 and may b e non-symmetric i.e., 1024
new metric is intro duced in Section VI to compare each
1024 512.
architecture. A comparison of the surveyed architectures
Volume rendering involves the pro jection of a volume
is presented in Section VI I and a discussion is presented in
dataset onto a 2D image plane. From Figure 1 we see
section VI I I. Future trends for sp ecialized rendering archi-
that a volume dataset is organized as a 3D arrayofvolume
tectures are presented in Section IX.
1
elements, or voxels .
II. Need for Custom Visualization
Voxels representvarious physical characteristics, suchas
Architectures
density, temp erature, velo city, and pressure. Other mea-
surements, such as area and volume, can b e extracted from
A real-time volume rendering system is imp ortant for the
the volume datasets. Volume data may contain more than
following reasons [37]: 1 to visualize rapidly changing 4D
ahundred million voxel values requiring a large amountof
spatial-temp oral datasets, 2 for real-time exploration of
storage. In Figure 1, the voxels are uniform in size and reg-
3D datasets e.g., virtual reality, 3 for interactive ma-
ularly spaced on a rectilinear grid. Other typ es of volume
nipulation of visualization parameters e.g., classi cation,
data can b e classi ed into curvilinear grids, which can b e
and 4 interactivevolume graphics [21]. As the sampling
thought of as resulting from a warping of a regular grid,
rates of devices b ecome faster, it will b e p ossible to gener-
and unstructured grids, which consist of arbitrary shap ed
ate several 3D datasets at interactive rates; real-time vol-
cells. This pap er presents a survey of recent custom vol-
ume rendering is required to visualize these dynamically
changing datasets e.g., for 3D ultrasound [43], [45]. It
Harvey Ray is a Ph.D. student at Rutgers State University, Email:
is often necessary to view the dataset from continuously
changing p ositions to b etter understand the data b eing vi-
Hansp eter P ster is with Mitsubishi Electric Research, Email: p s-
sualized; real-time volume rendering will enhance visual
Deb orah Silver is an asso ciate professor at Rutgers State University,
depth cues through motion and o cclusion as the dataset
Email: [email protected]
is viewed from varying p ositions. Classi cation is imp or-
To dd Co ok is a research and development engineer at Improv Sys-
tem Inc., Email: to [email protected]
tant in correctly visualizing the dataset by con guring ob-
1
Note, the term voxel has b een used to refer to p oint samples and
ject prop erties opacity, color, etc. based on voxel values;
cubic volume elements. The pap ers surveyed here use b oth de nitions
for illustration purp oses. Therefore, gures in this pap er will use a
as necessary. p oint sample representation or a unit volume representation of a voxel
classi cation is an iterative pro cess which will b ene t from this pap er implement ray casting, a common backward-
real-time volume rendering; thus, scientists will be able pro jection algorithm [28]. The ray casting algorithm is ca-
to interactively manipulate opacity and color mappings. pable of pro ducing high-quality images and a large degree
Volume graphics is an emerging area of research that pro- of parallelism can b e exploited from the algorithm.
duces synthetic datasets [21]. Volume graphics challenges
In ray casting, rays are cast into the dataset. Eachray
the way 3D graphics is currently implemented. Traditional
originates at the viewing eye p osition, p enetrates a pixel
3D graphics use p olygonal meshes to mo del ob jects and
in the image plane screen, and passes through the dataset.
these meshes are scan-converted into pixels inside the frame
At evenly spaced intervals along the ray, sample values
bu er. Alternatively,volume graphics mo dels ob jects as a
are computed using interp olation. The sample values are
3D discrete set of p oint samples voxels. These voxels
mapp ed to display prop erties such as opacity and color. A
comprise the 3D dataset. The dataset is rendered using
lo cal gradient is combined with a lo cal illumination mo del
standard volume visualization techniques.
at each sample p oint to provide realistic shading of the
Real-time visualization of large 3D datasets places strin- ob ject. Final pixel values are found by comp ositing color
and opacity values along a ray. Comp osition mo dels the
gent computational demands on mo dern workstations, es-
physical re ection and absorption of light.
p ecially on the memory system. Table I estimates the mem-
ory bandwidth to render di erent size datasets at 30Hz. It
Because of the high computational requirements of vol-
is assumed that the volume rendering algorithm accesses
ume rendering, the data needs to be pro cessed in a
each voxel once per pro jection. The required memory
pip elined and parallel manner. Parallel ray casting algo-
bandwidth can not b e sustained on most mo dern worksta-
rithms use one of the following pro cessing strategies: ob-
tions and p ersonal computers. The dataset must be par-
ject order, image order, or hybrid order [14]. This division
titioned among multiple memory mo dules to achieve the
describ es the manner in which a dataset is pro cessed. Fig-
desired bandwidth and parallel pro cessing must b e used.
ure 2 illustrates the three variations.
TABLE I
Estimated memory bandwidth for real-time volume rendering. Intermediate Plane
Image Plane Image Plane
Frame Rate Hz Memory Bandwidth
Dataset Size Image Plane
3
128 16 30 120 MB/s
3
256 16 30 960 MB/s
3
512 16 30 7.5 GB/s
3
16 30 60 GB/s
1024 A) Image order B) Object order C) Hybrid order
Fig. 2. Ray casting categories.
Massively parallel pro cessors and multipro cessors archi-
tectures [2], [4], [13], [26], [42], [50] have achieved image
A dataset in Figure 2 is organized as a set of parallel
generation rates up to 30Hz on mo derate sized datasets
slices. Image order algorithms cast rays through the image
using algorithmic optimizations; however, the cost of these
plane and re-sample at lo cations along the ray Figure 2A.
machines is prohibitive. In addition, the algorithmic op-
They o er exibility for algorithmic optimizations, but ac-
timizations are usually dataset dep endent. Custom archi-
cessing the volume memory in a non-predictable manner
tectures have the p otential to match or exceed the p erfor-
signi cantly slows down memory p erformance. Ob ject or-
mance of other interactive visualization solutions at a lower
der algorithms require that the dataset be re-sampled so
cost and smaller size. Performance, cost, and size b ene ts
that the slices are aligned with the view direction Fig-
are necessary for a desktop interactive visualization system.
ure 2B. A ma jor advantage of ob ject order algorithms
III. Parallel Ray Casting
is that accesses to the volume memory are predictable,
thereby, leading to ecient memory bandwidth utilization.
Volume rendering involves the direct pro jection of the en-
Hybrid order algorithms pro ject the dataset to the face of
tire 3D dataset onto a 2D display. Volume rendering algo-
the dataset most parallel to the image plane. This also
rithms can simultaneously reveal multiple surfaces, amor-
allows for predictable memory accesses to the volume data
phous structures, and other internal structures of a 3D
where no more than one sample is taken per voxel. The
dataset [18]. These algorithms can be divided into two
intermediate 2D image is warp ed into the nal image Fig-
categories: forward-pro jection and backward-pro jection.
ure 2C. The shear-warp algorithm is an example of a hy-
Forward-pro jection algorithms iterate over the dataset dur-
brid order algorithm [27]. A summary of implementation
ing the rendering pro cess pro jecting voxels onto the image
tradeo s for each parallel scheme is shown in Table I I.
plane. A common forward-pro jection algorithm is splat-
ting [46]. Backward-pro jection algorithms iterate over the
IV. Components of a Ray Casting System
image plane during the rendering pro cess re-sampling the
The following comp onents are needed for anyray casting
dataset at evenly spaced intervals along each viewing ray.
implementation:
In general, ray casting algorithms traverse the dataset in
Memory system provides the necessary voxel values at a a more random manner. All architectures surveyed in
TABLE I I
Tradeoffs for different parallelization methods.
Image order Ob ject order Hybrid order
Advantages - Easy to implement - Regular memory access - Merge b ene ts of
algorithmic optimizations patterns image order and ob ject
e.g., early-ray termination order algorithms
Disadvantages - "Random" memory access - Dicult to implement -Persp ective pro jections
patterns algorithmic optimizations adversely a ect p erformance
- Non-uniform mapping of - Additional 2D image
ray samples to voxels warp required
rate which ultimately dictates the p erformance of the ar-
chitecture.
Ray-path calculation determines the voxels that are p ene-
trated by a given ray; it is tightly coupled with the organi-
zation of the memory system.
Interpolation estimates the value at a re-sample lo cation
using a small neighb orho o d of voxel values.
Gradient estimation estimates a surface normal using a
neighb orho o d of voxels surrounding the re-sample lo cation.
Classi cation maps interp olated sample values and the es-
timated surface normal to a color and opacity.
Shading uses gradient and classi cation information to
compute a color that takes into account the interaction
of light on the estimated surfaces in the dataset.
Composition uses shaded color values and opacity to com-
pute a nal pixel color for display.
Fig. 3. Common memory organization schemes.
A. Memory System
partitioning scheme, the memory throughput is maximized
The memory system is the most imp ortant comp onentof
for the three orthogonal viewing directions.
a visualization architecture. The memory system contains
The maximum p erformance obtained by a volume ren-
the dataset and is resp onsible for supplying the compu-
dering architecture is primarily determined by the de-
tational units with voxel values at a high bandwidth to
gree of parallelism and the memory technology used.
supp ort the target frame rate. Since the dataset will be
Recent memory devices use pip elining to accelerate lin-
visualized from various view p ositions, the throughput of
ear accesses. Synchronous memories, such as Syn-
the memory system should b e as view indep endent as p ossi-
chronous DRAM SDRAM, can sustain memory accesses
ble. Regardless of the parallel pro cessing strategy, eachray
at 150MHz. This is a three-fold sp eed up over previous
casting algorithm requires simultaneous access to multiple
alternatives. More recently, Rambus de ned a high-sp eed
voxels. Ideally, the memory system provides these voxels in
interface that will allow sustainable bandwidths up to 800
a con ict-free manner; otherwise, the overall system may
MB/s using an 8-bit bus. Using a wider 16-bit bus Direct
su er p erformance degradation.
Rambus, these devices are able to sustain 1.6 GB/s data
The architectures surveyed in this pap er use four mem-
throughput [3]. These advanced memories can p otentially
ory partitioning schemes shown in Figure 3 to achieve a
enhance p erformance for any given architecture. A metric
high memory throughput. Sub-blo ck partitioning Fig-
that measures the ability of a volume rendering architec-
ure 3A divides the dataset into smaller volumes. Each
ture to utilize available memory bandwidth is presented in
sub-blo ck is assigned to a di erent memory mo dule. Or-
Section VI.
thogonal slice partitioning Figure 3B assigns each slice
inside the dataset to a memory mo dule. Each slice is p er-
B. Ray-Path Calculation
p endicular to one axis of the dataset. In this partitioning
scheme, memory throughput is maximized for two of the Calculating ray-voxel intersections is tightly coupled
three orthogonal viewing directions. The eight-wayinter- with the memory system design and is related to the typ e
leaved memory system Figure 3C assigns each voxel in of ray casting algorithm used. The appropriate memory
a 2 2 2 blo ck to separate memory banks. The eight- addresses for each voxel that a ray p enetrates must be
wayinterleaved memory partition is limited to eight par- computed. These addresses are calculated by construct-
allel memory accesses. As a result, it can be combined ing a line ray b etween the viewing p osition and a pixel
with sub-blo ck partitioning when additional parallelism is on the image plane and extending the line ray through
necessary. The skewed non-orthogonal slice partitioning the dataset. Based on the pro cessing strategies from the
scheme Figure 3D assigns slices that make a 45 angle previous section, it may be necessary to calculate a sub-
with each axis of the dataset to memory mo dules. In this stantial number of memory addresses in parallel. Lo ok-
neighb oring voxels and interp olated to yield the gradient up tables or templates have b een used to reduce the
at the re-sample lo cation. computation involved in calculating ray paths through the
dataset [49]. For parallel pro jections and hybrid order ar-
In practice, several gradient estimation schemes ex-
chitectures, templates only need to b e generated once p er
ist [15], [18], [31], [34], [51]. A comprehensive considera-
pro jection b ecause all rays have the same slop e.
tion of these metho ds is beyond the scop e of this survey.
In general, high-quality gradient estimation requires ad-
C. Interpolation
ditional computation and memory bandwidth or on-chip
storage that may a ect p erformance and cost.
Estimating the sample value requires evaluation of the
trilinear interp olation equation:
E. Classi cation
S i; j; k = P 1 i1 j 1 k
000
Classi cation maps a color and opacity to sample val-
+ P i1 j 1 k +P 1 ij 1 k
100 010
ues. Opacity values range from 0 transparent to 1:0
+ P ij 1 k +P 1 i1 j k
110 001
opaque [28]. Classi cation is typically implemented in
+ P i1 j k + P 1 ijk + P ij k
101 011 111
hardware using lo ok-up tables LUTs. These LUTs are
1
typically addressed by sample value and/or gradient mag-
i, j , and k are fractional o sets of the sample p osition in
nitude, and they output sample opacity and color. It is de-
the x, y , and z directions, resp ectively. These variables are
sirable to b e able to mo dify the information in these LUTs
between 0 and 1. P is a voxel whose relative p osition in
abc
during the visualization pro cess in real-time. If the archi-
a2 2 2 neighb orho o d of voxels is a; b; c. a, b, and c are
tecture pro cesses multiple re-sample lo cations in parallel,
the least signi cant bit of the x, y , and z sample p osition,
these LUTs must b e duplicated to avoid contention.
resp ectively. From Equation 1, we see that a total of 24
multiplications are necessary and eightvoxel values are re-
F. Shading
quired to compute each re-sample lo cation. The number of
The Phong shading algorithm [40], or variants, are of-
multiplications can b e reduced by approximately one-half
ten used in the shading subsystems of volume rendering
if factors are re-used. If we assume that each1 1 1 unit
architectures. This algorithm requires gradients, light and
volume in a 512 512 512 dataset contains one re-sample
re ection vectors to calculate the shaded color for each re-
lo cation p er pro jection, then more than 1.5 billion multipli-
sample lo cation. The algorithm involves computationally
cations would be necessary for each pro jection. Multiple
exp ensive division, multiplication, and exp onentiation that
pro jections are needed per second for interactive pro jec-
must b e implemented in hardware. In practice, the shading
tion rates, requiring an enormous amount of computational
algorithm is implemented in either arithmetic units for ac-
power. As few as eightmultiplications are necessary if the
curacy or re ectance LUTs for exibility [44]. For color im-
interp olation weights are stored in a lo ok-up table. Higher-
ages, the Phong shading mo dels may b e applied to the red,
order interp olation can be used to improve image-quality
green, and blue comp onents. Also, additional computation
but it is typically not done in hardware b ecause of its com-
may b e necessary if multiple light sources are supp orted.
putational cost.
G. Compositing
D. Gradient Estimation
The comp osition system is resp onsible for summing up
The next step is the determination of gradients to ap-
color and opacity contributions from re-sample lo cations
proximate surface normals for classi cation and shading.
along a ray into a nal pixel color for display [41]. The
x-, y -, and z -gradients may b e computed using central dif-
front-to-back formulation for comp ositing is:
ferences:
C = 1:0 C + C
S i+1;j;k S i 1;j;k
Acc Acc sample Acc
G =
x
x
3
= 1:0 +
Acc Acc sample Acc
S i;j +1;k S i;j 1;k
2
G =
y
y
C is the accumulated color, is the accumulated
Acc Acc
S i;j;k +1 S i;j;k 1
opacity, C is the samples color, and is the
sample sample
G =
z
z
samples opacity. Twomultiplies are needed to comp osite
S i; j; k is the interp olated sample at the lo cation i; j; k
each re-sample lo cation. Comp ositing in a front-to-back
inside the dataset. x , y , and z is the spacing be-
order allows for early ray termination if a desired opacity
tween samples in x, y , and z directions, resp ectively. The
threshold has b een reached. Back-to-front comp osition can
costly divisions are usually avoided b ecause of the regular
be utilized to simplify the calculation; however, early ray
spacing b etween voxels inside the dataset. Two re-sample
termination is not p ossible. Color information pro duced
lo cations adjacent to the sample lo cation in each direction
from the comp ositing system is stored into a frame bu er
are required to compute the gradient using central di er-
for display.
ences. Some algorithms use a larger neighb orho o d of voxels
to generate images that app ear smo other and/or to reduce
V. Architecture Survey
temp oral aliasing. In addition to the gradientvector com-
p onents, the gradient magnitude and the normalized gradi- This section presents ve sp ecial purp ose volume ren-
entvector may b e required. Gradients can also b e taken at dering architectures. A description of each architecture is
interp olation uses a pair of opp osite face-values to compute given along with its p erformance. The following architec-
the nal sample value. The REX is a pip elined unit and tures are surveyed, in chronological order of their develop-
pro duces one interp olation value p er clo ck cycle. ment: VOGUE, VIRIM, Array Based Ray Casting, EM-
Cub e, and VIZARD II. Each gure in this section were
In addition to interp olation, the REX unit also p er-
redrawn from their original publication.
forms gradient calculation. Gradient calculation requires
1 memory accesses for the fastest gradientmode 8-voxel
A. VOGUE
gradient, 4 memory accesses for the intermediate mo de
The VOGUE architecture [22], [24] was develop ed at the
32-voxel gradient, and 7 memory accesses for the highest
University of Tubingen, Germany. One rendering engine
quality gradient mo de 56-voxel gradient. In the fastest
provides high-quality, volume-rendered images with mul-
gradient mo de, opp osite face-values computed during tri-
tiple light sources using four custom VLSI chips. A blo ck
linear interp olation are used to compute gradients. The
diagram of the architecture is shown in Figure 4. The main
higher quality gradient mo des require additional voxels and
interp olation. The REX unit can pro duce one gradientvec-
tor and magnitude p er clo ck cycle. The REX unit contains
three pip elined square units and one square ro ot unit to
compute the gradient magnitude.
Classi cation information is stored in three LUTs: sp ec-
ular co ecient, color, and opacity. These tables are in-
dexed using the sample value, gradient, and gradient mag-
nitude. These values are subsequently used by the shading
unit COLOSSUS and comp ositing unit COMET.
The COLOSSUS shading unit implements the unre-
stricted Phong illumination mo del and depth cueing. The
sp ecular co ecient from the LUT along with the gradi-
entvector, lightvector, and ambient co ecient are passed
Fig. 4. VOGUE architecture.
to the COLOSSUS chip. The COLOSSUS chip internally
converts op erands to logarithms to reduce multiplication
goals of VOGUE are exibility and compactness. VOGUE
and division to simple addition and subtraction, resp ec-
is capable of three rendering mo des based on the gradient
tively. The costly exp onentiation op eration required by the
estimation metho d: a fast 8-voxel gradient, a slower in-
Phong illumination mo del is reduced to a multiply; how-
termediate quality 32-voxel gradient, and a higher quality
ever, fast logarithmic converters are necessary. These units
56-voxel gradient. VOGUE hardware consist of an Ad-
are pip elined to achieve the desired system p erformance.
dress SeQuencer ASQ for memory addressing, a volume
Shaded samples are comp osited in the COMET chip.
memory VoluMem for dataset storage, a Reconstructor
The COMET chip requires an opacity, from a LUT, and
EXtractor REX for interp olation, a COLOSSUS unit for
color values from the COLOSSUS chip. These values are
shading, and a COMET unit for comp osition. VOGUE
comp osited into a nal pixel color that is passed to the
implements an unrestricted Phong illumination mo del in
frame bu er.
addition to depth cueing.
A.2 Performance
A.1 Description
Estimated p erformance of one VOGUE mo dule, contain-
The Volume Memory VoluMem is organized as an
ing the four VLSI units ASQ, REX, COLOSSUS, and
eight-way interleaved memory system see Figure 3C
3
COMET, is 2:5 frames/second for 256 datasets using the
which allows eightvoxels surrounding a trilinear re-sample
fastest rendering mo de. For higher p erformance, several
lo cation to b e retrieved in parallel.
rendering mo dules are connected to other mo dules in a
The ASQ unit provides necessary addresses for the Vol-
ring network. To achieve larger memory throughput, a
uMem. It generates addresses for the voxels involved in
fully-parallel implementation uses sub-blo ck partitioning
re-sampling and gradient estimation. Aray's initial p osi-
to globally partition the dataset. Each sub-blo ck is lo-
tion and incremental values to the next re-sample lo cation
cally partitioned using the eight-way memory interleaving
are computed by the host computer and passed to the ASQ
scheme and is stored into the VoluMem of a given rendering
where they are incremented to compute the address of the
mo dule. Boundary voxels are replicated among adjacent
eightvoxels surrounding the re-sample lo cation.
rendering mo dules to enhance p erformance.
The REX unit p erforms trilinear interp olation using the
eight voxels from VoluMem to compute the re-sampled VOGUE is capable of p ersp ective pro jection and is able
value. The REX contains three stages of linear interp o- to utilize early ray termination. The estimated p erfor-
lators. Adjacent voxels from the trilinear interp olation mance using the fastest 1-access gradient mo de is 20Hz
3
neighb orho o d are used in linear interp olations to compute using eight mo dules for 256 datasets and using 64 mo d-
3
edge-values, then pairs of edge-values are used in linear ules for 512 datasets. VOGUE's highest quality gradient
interp olations to compute face-values, and the last linear mo de improves image-quality,however, p erformance is low-
Digital Signal Pro cessor DSP b oards ray casting unit in ered to 2 frames/second.
Figure 5 using a sp ecialized bus and stored into rst-in
B. VIRIM
rst-out memories FIFOs. The geometry units are much
faster than the DSPs; therefore, the FIFOs are required to
The VIRIM architecture has b een develop ed and assem-
de-couple sp eed di erences b etween the two units.
bled at the University of Mannheim [12] to achieve real-
The DSP b oard implements ray casting with the Hei-
time visualization on mo derate sized datasets 256 256
delb erg illumination mo del. In the Heidelb erg mo del, the
128 with high image quality. VIRIM is an ob ject order
dataset is rotated such that viewer lo oks along a ma jor
ray casting engine that uses the Heidelb erg raytracing al-
axis. Two light sources enter the volume. One light source
gorithm [32] discussed b elow. The VIRIM architecture is
is along the direction of the viewer and the other light
shown in Figure 5. It consist of a geometry unit and a ray
source is 45 from the rst light source. Light intensity
y-slice, and the nal illumination value
Address Generator is calculated slice-b
p er sample is generated by the summation of all light in-
y emitted in the viewers direction. The Heidelb erg 2 Independent Interpolation tensit
Banks of 8 Units Each Weight Memory
raytracing algorithm can account for re ection, absorption,
Density LUT
emission of light, and is capable of pro ducing shadows.
DSP b oard contains eight DSP chips and a CPU. Interpolation Tree The
Geometry
Floating p oint op erations for the shading and visualiza-
Unit X-, Y- Gradient Processor
tion algorithm are p erformed by the DSPs. The DSPs are
programmable and provide exibility for the architecture
to implement di erentvolume rendering and shading algo-
DSPs DSPs rithms.
Ray-casting
Performance
Board Master Unit Board Master B.2
VIRIM is capable of pro ducing shadows and supp orts
Host Bus
p ersp ective pro jections. One VIRIM mo dule with four
b oards has b een assembled and achieves 2:5Hz frame rates
Fig. 5. VIRIM architecture.
for 256 256 128 datasets. Toachieveinteractive frame
rates, multiple rendering mo dules have to b e used; however,
casting unit. The geometry unit is resp onsible for inter-
dataset duplication is required. Four mo dules 16 b oards
p olation and gradient calculation; the ray casting unit is
are estimated to achieve 10Hz for the same dataset size,
resp onsible for implementing the actual ray casting algo-
and eight mo dules 32 b oards are estimated to achieve
rithm.
3
10Hz for 256 datasets [14].
B.1 Description
C. Array BasedRay Casting
The rotation of the dataset o ccurs on dedicated rota-
The Array Based Ray Casting engine develop ed at the
tion hardware called the Rotator Board geometry unit in
Universityof New South Wales [6] is an ob ject order ray
Figure 5. The Rotator Board aligns the dataset with the
casting architecture. This architecture consists of two par-
viewing p osition. The Rotator Board consists of the vol-
allel pip elined arrays used to rotate the dataset and to cast
ume memory, a geometry pro cessor, an interp olation pro-
rays, as illustrated in Figure 6. These rotation arrays are
cessor, and a gradient pro cessor.
The dataset is stored in an eight-wayinterleaved mem-
ory system. The dataset is rotated using backward map-
Double 1.5n
eighted interp ola- ping from the re-sample p osition and a w Warp Array Ray Array Frame
Input Buffered Rendering
tvoxel neighborhood. Arbitrary e.g.,
tion mask on an eigh Buffer
Stream Input Pipelines
terp olation weights can b e used in the 8-voxel
Gaussian in Memory
neighb orho o d instead of trilinear interp olation. The geom-
etry pro cessor generates the addresses for the eight memory
Unlike other architectures,
banks using a rotation matrix. Scanline 1.5n
VIRIM do es an interp olation on classi ed density values.
The mappings are stored in eight LUTs that can b e freely
mo di ed.
A mo di ed 2D Sob el lter is used to estimate the X
and Y comp onents of the gradientvector in the re-sampled
Scanline 0
system. Because of this, the gradient is only co ordinate y
Dataset z Slice Shear Ray Casting
wo-dimensional and view dep endent. The output of the
t x
rotator b oard are the density and gradient values for a
Fig. 6. Array BasedRay Casting architecture.
sample lo cation. These comp onents are transferred to the
connected between n memory mo dules and 1:5n render- lter is used in the Rendering Pip elines to estimated gradi-
ing pip elines, n is the resolution of the dataset. In the ents. The gradient estimation and shading algorithm uses
second array,intersections with voxels are determined by a full 26-voxel neighb orho o d and creates smo othly shaded
using nearest neighbor or zero order interp olation. Each images [9].
rendering pip eline p erforms shading and comp osition for
C.2 Performance
a given scanline. In addition, the system is comp osed of
a double-bu ered input memory, memory swapping array,
If n is the dimension of the volume data, the size of the
and a frame bu er.
Warp Array and Ray Array are 1:5n n and 1:5n 1:5n,
3
resp ectively. For 256 datasets, this corresp onds to approx-
C.1 Description
imately 212,992 pro cessing elements. An additional col-
umn in each array contains the co ordinate initializers. The
The volume dataset is stored in a double-bu ered vol-
architecture also contains 1:5n rendering pip elines. The
ume memory that allows the simultaneous loading of one
Warp Array for this dataset dimension is estimated to t
dataset and visualization of another. The memory sys-
inside a 5 5 array of FPGAs. Pro cessing elements in the
tem uses orthogonal slice partitioning see Figure 3B. The
Ray Array are larger than those in the Warp Array and re-
dataset is stored in memory in a view dep endent manner
quire more hardware. However, a smaller Ray Array i.e.,
using co ordinate swapping. Using a spherical co ordinate
with fewer columns can b e used by time-multiplexing the
system, view p ositions are classi ed as b eing in one of eight
Ray Array and stalling the Warp Array, thereby reducing
primary o ctant regions. As the dataset is loaded, co ordi-
throughput.
nate swapping o ccurs based on the view p osition to allow
This architecture only supp orts parallel rendering and
con ict-free access to b eams. Note that co ordinate swap-
3
is capable of 15Hz frame rates for 256 datasets shar-
ping p erforms a partial rotation. Limited rotations ab out
ing the Ray Array 10 times. The architecture has under-
the X- and Y- axis o ccurs in the Warp Array and Ray Ar-
gone several changes since its publication and is now called
ray, resp ectively. These three partial rotations allow gen-
VIZAR [7].
eral rotation of the dataset.
Vertical b eams, indicated by similar shaded voxels in
D. EM-Cube
Figure 6, are loaded into the Warp Array in one clo ck cy-
EM-Cub e is a commercial version of the high-
cle. The Warp Array rotates the volume by 45 around
p erformance Cub e-4 [36], [37], [39] volume rendering ar-
the X-axis by shearing slices in Y. The shear is accom-
chitecture that was originally develop ed at the State Uni-
plished in the Warp Array by shifting b eams of voxels
versity of New York at Stony Bro ok. EM-Cub e is currently
based on a comparison of the row co ordinate and the
under development at Mitsubishi Electric Research Lab o-
b eam's rotated Y-co ordinate. The rst column of the Warp
ratory [35]. The Cub e family of architectures are charac-
Array computes these Y-co ordinates for each voxel, and
terized by memory skewing. EM-Cub e is a highly parallel
the remaining columns contain simple pro cessing elements.
architecture based on the hybrid order ray casting algo-
These elements p erform three basic functions: shift-right,
rithm shown in Figure 2C. Rays are sentinto the dataset
shift-right-up, and shift-right-down. The rows in b oth the
from each pixel on the base plane, which is co-planar to the
Warp Array and Ray Array corresp ond to a discrete Y-
face of the dataset that is most parallel to the image plane.
co ordinate. However, explicit Y-co ordinate information is
Because the image plane is typically at some angle to the
only stored in the Warp Array.
base-plane, the resulting base-plane image is 2D warp ed
Voxels in the rightmost column of the Warp Array pro-
onto the image plane.
ceed to adjacent pro cessing elements in the leftmost column
The main advantage of this algorithm is that voxels can
of the Ray Array. The Ray Array casts parallel rays into
b e read and pro cessed in planes of voxels so called slices
the sheared YZ-slices. As indicated in Figure 6, a row in-
that are parallel to the base-plane [37]. Within a slice,
side the Ray Array corresp onds to a scanline of rays. Ini-
voxels are read from memory a b eam of voxels at a time,
tializers compute X- and Z-co ordinates for voxels during
in top to b ottom order. This leads to regular, ob ject order
ray casting. Each ray's initial co ordinate and increment
data access. The EM-Cub e architecture utilizes memory
vector is shifted into place inside the Ray Array b efore ray
skewing [20] on a blo ck granularity for con ict-free b eam
casting. The pro cessing elements in the Ray Array im-
access.
plement a Compare-and-Shift-Right function. If a voxels
co ordinate matchesaray's current co ordinate, a ag is set
D.1 Description
which pro ceeds through the remainder of the array with
the voxel and co ordinate data. The Ray Array implements
EM-Cub e will be implemented as a PCI card for Win-
nearest neighb or or zero order interp olation.
dows NT computers. The card will contain one volume ren-
A one dimensional array of rendering pip elines classi es, dering ASIC, 32 Mbytes of volume memory, and 16 Mbytes
shades, and comp osites the voxels along the discrete rays. of lo cal pixel storage. The warping and displayof the -
To estimate gradients, each element in the Ray Arrays has nal image will b e done on an o -the-shelf 3D graphics card
additional registers to bu er voxel information. Voxels tra- with 2D texture mapping. The EM-Cub e volume rendering
verse a row inside the Ray Array three times. A3 3box ASIC, shown in Figure 7, contains eight identical render-
E. VIZARD II ing pip elines, arranged side by side, and interfaces to voxel
memory, pixel memory, and the PCI bus. Each pip eline
The VIZARD II architecture is b eing develop ed at the
UniversityofTubingen to bring interactiveray casting into
SDRAM SDRAM SDRAM SDRAM
the realm of desktop computers [33]. This is the second
EM-Cube These image or- Voxel Memory Interface generation of VIZARD systems [23], [25].
ASIC Interpolation
der architectures are characterized by metho ds to reduce Slice Buffers
Pipeline 0 Pipeline 1 Pipeline 2 Pipeline 3 Pipeline 4 Pipeline 5 Pipeline 6 Pipeline 7 Gradient
memory bandwidth requirements for interactive visualiza-
Estimation
tion. While VIZARD uses a pre-shaded and pre-classi ed
PCI Interface
dataset, VIZARD II only prepro cesses gradi-
Shading & compressed
Classification
ents that are stored into a quantized gradient table. Using
tral di erences as the underlying gradient lter, prepro-
Pixel Memory Interface Compositing cen
cessing the gradient lter requires only a few seconds and
SDRAM SDRAM SDRAM SDRAM
is only p erformed once p er dataset. Gradient quantization
p otentially allows VIZARD II to implement global gradi-
Fig. 7. EM-Cub e architecture with eight identical ray casting
ents. VIZARD I I was designed to interface to a standard
pip elines.
PC system using the PCI bus. The dataset is stored in four
interleaved memory banks along with a pre-computed gra-
communicates with voxel and pixel memory and the two
dient index, segmentation index, and gradient magnitude
neighb oring pip elines. Pip elines on the far left and right
for each voxel. The combination of pre-computed gradi-
are connected to each other in a wrap-around fashion indi-
ents, caching, and early ray termination reduces the band-
cated by grey arrows in Figure 7. A main characteristic of
width requirements of the memory system. Added exi-
EM-Cub e is that eachvoxel is read from volume memory
bilityis obtained by using a DSP and FPGAs as the im-
exactly once per frame. Voxels and intermediate results
plementation technology. This allows the VIZARD I I card
are cached in so called slice bu ers so that they b ecome
to p erform other visualization task such as reconstruction,
available for calculations precisely when needed.
ltering, and segmentation.
EM-Cub e is a parallel pro jection engine with multiple
rendering pip elines. Each pip eline implements the ray cast-
E.1 Description
ing algorithm. Samples along eachray are calculated using
The VIZARD II architecture is illustrated in Figure 8.
trilinear interp olation. A 3D gradient is computed using
VIZARD II consists of four functional blo cks: Control
central di erences b etween trilinear samples. The gradient
is used in the shader stage, which computes the sample
Viewing Parameters Control
according to the unrestricted Phong lighting
illumination Unit Lo ok-up tables in the
mo del using a re ectance LUT [44]. (CU)
classi cation stage assign color and opacity to each sample
p oint. Finally, the illuminated samples are accumulated Address
Memory
to base plane pixels using front-to-back comp ositing. in Multiplexer
Unit (MU)
Volume memory is organized as four 64-Mbit 16-bit wide
SDRAMs for 32 Mbytes of volume storage. The volume
data is stored as a 2 2 2 blo cks of neighb oring voxels.
Volume Memory
ks are read and written in bursts of eight voxels
Miniblo c M1 M2 M3 M4
using the fast burst mo de of SDRAMs. In addition, EM-
Cub e uses linear skewing of these blo cks. Skewing guaran-
tees that the rendering pip elines always have access to four
Trilinear Interpolation Unit (TIU)
t miniblo cks in any of the three slice orientations.
adjacen Sample and Gradient
D.2 Performance
Shade and
r, g, b, α
is a parallel ray casting engine that imple- EM-Cub e Composit Unit
(SCU)
ments ahybrid order algorithm; however, EM-Cub e do es
not supp ort p ersp ective pro jections. Each of the four
Fig. 8. VIZARD II architecture.
SDRAMs provides burst-mo de access at up to 133MHz,
6
for a sustained memory bandwidth of 4 133 10 = 533
million 16-bit voxels p er second. Each rendering pip eline Unit, Memory Unit, Trilinear Interp olation Unit, and
op erates at 66MHz and can accept a new voxel from its Shading/Comp ositing Unit. The Control Unit is deter-
SDRAM memory every cycle. Eight pip elines op erating in mines intersections of the rays with the dataset and cut
6
parallel can pro cess 8 66 10 or approximately 533 mil- planes. The Memory Unit stores the dataset in four
3
lion samples p er second. This is sucient to render 256 SDRAM mo dules, each with its own SRAM cache. The Tri-
volumes at 30 frames p er second. linear Interp olation Unit is resp onsible for re-sampling the
dataset. It also interp olates the gradients at o -grid lo ca- VI. Performance Metrics
tions using eight parallel lo okups to the quantized gradient
The p erformance of a volume rendering architecture is
table. The Shading and Comp ositing Unit supp orts lo ok-
determined by several factors:
up-based shading and multiple classi cation tables. Final
Frame Rate is the numb er of images that can b e generated
pixels values are transferred through the PCI bus to the
p er unit of time and is measured in frames p er second or
host computer.
Hz.
VIZARD II implements an image order algorithm that
Samples Processed Per Second SPPS is the number of
utilizes early ray termination. The algorithm rst pre-
ltered samples that can b e generated p er unit of time. Un-
pro cesses the dataset to compute gradient indices for each
like frame rate, SPPS is not sensitive to image and dataset
voxel. The gradient index contains 9 bits but is not lim-
resolution. SPPS is similar to trilinear interp olated sam-
ited to 9 bits. Alternatively, the full gradient comp onent
ples p er second that is commonly used in the sp eci cation
could b e stored with the voxel memory. The 9-bit gradient
of 3D texture mapping hardware.
index generates 512 table entries. Using 512 entries, the
Latency is the time b etween a change in dataset or viewing
average error in the gradient computation is 2:3 degrees
parameters and the display of the up dated image.
and the maximum error is 7:9 degrees [33]. Larger gradi-
Image quality is mainly a qualitative assessment that is re-
ent tables can be used for greater accuracy. Four voxels
lated to the resolution of the generated images, interp ola-
including gradient index are simultaneously accessed in
tion lter, gradient lter, and illumination mo dels used.
parallel. These voxels are four-wayinterleaved with resp ect
Scalability is the ability of an architecture to extend its
to the YZ-plane. A burst memory access is used to fetch
p erformance by increasing the amount of computational
adjacentvoxels in the X-direction. To access a 2 2 2
and memory throughput. Ideally, a linear increase in the
trilinear neighborhood, four voxels in the YZ-plane are
number of rendering mo dules should linearly increase an
accessed in parallel from each bank, and a sequential burst
architecture's frame rate or maintain its frame rate for a
memory access from each bank provides the remaining four
linear increase in dataset size.
values from an adjacent YZ-plane. These values are cached
Although many architecture's primary goal is to achieve
in separate cache banks to allow parallel access and re-use.
high frame rates and SPPS, image quality may be more
desirable when visualizing static datasets. Frame rate and
The Interp olation Unit uses the fractional x; y ; and z
SPPS are indicators of the amount of acceleration that is
comp onents and the trilinear interp olation neighborhood
provided by the rendering architecture. One drawbackto
from the Memory Unit to re-sample the dataset. The gra-
these metrics is that b oth are prop ortional to the amount
dient index is used to address the gradient lo ok-up table.
of bandwidth available to the architecture. This is unde-
The resulting x; y ; and z gradient comp onents are interp o-
sirable since the architectures surveyed span several gener-
lated in a similar manner. The trilinear interp olation units
ations of memory technology. To address this problem, we
can sustain samples at a rate up to four times faster than
intro duce a simple mo del that measures the abilityof an
the rate of the memory system if the desired voxels reside
architecture to convert voxel throughput memory band-
inside the cache. The sample throughput is enhanced by
width into sample throughput. The mo del can b e derived
sup ersampling the dataset b ecause of a signi cant increase
by accounting for changes in throughput along the process-
in cache hits.
ing path as a voxel is converted into a ltered sample. This
The sample and gradientvalue are used in the Shading
leads to a relationship between p eak memory bandwidth
and Comp ositing Unit to classify and shade the sample.
and e ective sample throughput. Sample throughput can
The classi cation table is chosen using a classi cation in-
b e given by the following equation:
dex and the architecture handles segmentation and mul-
tiple cut-planes. Phong shading is implemented using a
P
lo ok-up table and comp ositing uses the standard "over"
z }| {
4
op erator. Early ray termination is utilized to increase the
1
S sample = B v oxel U
frame rate.
second
second
V v oxel
sample
where S is p erformance measured in SPPS, B is the p eak
bandwidth of the memory system, U is the memory band-
E.2 Performance
width utilization in p ercent, and V is the average number
of voxels that need to b e fetched p er sample. In this survey, VIZARD I I supp orts multiple cut planes, segmentation,
B is held constant to comp ensate for advances in memory parallel, and p ersp ective viewing. It is exp ected to sus-
technology and for varying degrees of parallelism among tain a frame rate of 10Hz. However, its worst-case p er-
the di erent architectures. formance is approximately 1 frame p er second. Worst-case
U , memory bandwidth utilization, is the p ercentage of p erformance o ccurs for 1:1 mapping of samples to voxels
the p eak memory bandwidth that is realized. The pro duct and a transparent classi cation of the dataset. Using four
of U and B is the sustained voxel throughput into the ren- 100MHz SDRAM devices, this architecture is capable of
6
dering comp onents. For maximum p erformance, U should 14 10 samples per second worst-case p erformance and
6
be 1:0. U accounts for any combination of random cycles, 56 10 samples p er second b est-case, assuming 4:1 map-
sequential cycles, and idle time on the memory bus. U is ping of samples to voxels.
given by: will reach a higher p erformance/price ratio than most in-
w
U =
5 teractivevolume rendering options currently available.
r C +1 r
where r is the p ercentage of all accesses that are random
and w is the p ercentage of time that voxels are trans-
VI I. Comparison
ferred to rendering comp onents bus utilization. C is the
Table I I I presents a comparison of the ve architectures.
sp eedup of a memory device obtained by using sequential
The categories include 1 status - the present stage of de-
memory access or burst access instead of random access.
velopment, 2 algorithm - typ e of ray casting algorithm
C is a memory technology dep endent term. For example,
used, 3 memory partitioning - the organization of the
assume a 100MHz SDRAM memory 10ns synchronous
memory system, 4 interp olation hardware - size, typ e,
access time has a 70ns random access time including page
and/or implementation of the interp olation lter, 5 gra-
faults. This leads to a C of 7:0 i.e., 700. In this case,
dient hardware - size, typ e, and/or implementation of the
if all memory accesses are random r =1:0 and the mem-
gradient lter, 6 shading - shading algorithm supp orted,
ory bus is fully utilized w =1:0, the memory bandwidth
7 p ersp ective supp ort - the ability to handle p ersp ective
utilization, U , would b e as lowas14:3. To account for
pro jections, 8 real-time data input - the p otential to sup-
the fact that not all random accesses are page faults we as-
p ort real-time streamed input, 9 target technology - typ e
sume that the memory bandwidth utilization is 20 when
of implementation, 10 p erformance limitations - b ottle-
the bus is fully utilized w = 1:0 with random accesses
necks in the system, 11 scalability- the ability to scale
r =1:0. This is a conservative estimate for newer DRAM
p erformance using multiple rendering pip elines, 12 algo-
memories.
rithmic acceleration - early ray termination supp ort, 13
V is the average number of voxels that are fetched from
sample pro cessing eciency - normalized acceleration met-
memory p er sample. V is related to the size of the gradi-
ric that also accounts for algorithmic sp eedup, 14 pub-
ent and interp olation lter. For example, an architecture
lished SPPS - p erformance numb ers obtained from the re-
that uses a 32-voxel gradient lter could have V = 32 using
sp ective publications, and 15 architectural highlights -
a brute force approach; however, if voxels and intermedi-
features considered imp ortant for next generation volume
ate results are bu ered eciently, V = 1 b ecause access to
rendering architectures.
bu ered or cached voxels can o ccur in parallel with new
The VOGUE architecture supp orts three rendering
voxel accesses. Some architectures use algorithmic acceler-
mo des based on its gradient computation kernel 8, 32,
ation e.g., early ray termination to enhance p erformance.
and 56 voxels. Voxels are re-fetched on average 8, 32, and
The reduction in fetched voxels due to algorithmic sp eedup
56 times for mo de 1, mo de 2, and mo de 3, resp ectively.
is view and dataset dep endent. In our comparison, we as-
VOGUE can accommo date multiple p oint light sources
sume a reduction of V by a factor of 3 when early ray
with an unrestricted Phong illumination mo del. Eachmod-
termination has b een implemented [28].
6
U
ule is memory limited and capable of 40 10 SPPS p er-
We call the term, P = in Equation 4, Sample Pro-
V
formance in the fastest rendering mo de. Because of its
cessing Eciency. It is an architecture sp eci c measure
random memory access pattern, VOGUE's memory band-
of how p eak memory bandwidth B translates to SPPS
width utilization is 20. As a result, VOGUE's sample
S . If eachvoxel is accessed on average once p er ltered
pro cessing eciency is b etween 0:00357 mo de 3 and 0:025
sample V =1:0 and the memory system is fully utilized
mo de 1 without algorithmic acceleration. Assuming that
U =1:0, then the sample pro cessing eciency, P , will b e
an algorithmic sp eedup of 3 is realizable due to early ray
1:0. Sample pro cessing eciency may be larger than 1:0
termination, the sample pro cessing eciency in Table I I I
if compression is used or if the volume dataset is sup er-
has b een multiplied by3:0. In large multi-mo dule con gu-
sampled i.e., multiple re-sample lo cations per unit vol-
rations, VOGUE's sample pro cessing eciency p er mo dule
ume. Sample pro cessing eciency, P , is related to frame
may decrease due to network overhead.
rate, F ,by the following formula:
B v oxel P sample
Image quality is the primary fo cus of the VIRIM archi-
second
v oxel
F f r ame = 6
second
T sample
tecture. This architecture is capable a exible illumina-
f r ame
B is the p eak bandwidth of the memory system and T is tion mo del including shadows. VIRIM uses programmable
the numb er of samples needed to render the frame. Sam- DSPs that supp ort other rendering, shading, and interp o-
6
ple pro cessing eciency only measures the ability to con- lation algorithms. VIRIM is capable of 40 10 SPPS
vert memory bandwidth to pro cessed samples; it is rela- p erformance. Multiple engines can b e used to increase p er-
tively indep endent of memory technology and can therefore formance; however, each engine must duplicate the entire
b e used as an ob jective measure of architecture eciency. dataset. The ob ject order algorithm leads to no random
However, it do es not measure other imp ortant p erformance memory accesses r =0:0. However, sample pro cessing ef-
metrics such as image-quality, cost, scalability, or latency. ciency is limited by a global bus b etween the re-sampling
Since the architectures presented in this pap er span several hardware and the rendering hardware. The sample pro-
generations in VLSI technology, it is dicult to augment cessing eciency, P , is the ratio of the sustainable sam-
the comparison with normalized cost. Consequently,wedo ple throughput of the bus b etween the Geometry and Ray
not explicitly compare cost or use a cost/p erformance ra- Casting Units and the p eak voxel throughput of the mem-
40M S ample
bus bandwidth VME tio. However, it should b e noted that these architectures ory system. P =0:2 for second
TABLE I I I
Architecture comparison.
VOGUE VIRIM Array Based EM-Cub e VIZARD I I
Ray Casting
Started 1993 1994 1995 1997 1998
Status Simulated System built Simulated ASIC in Simulated
development
Algorithm Image order Ob ject order Ob ject order Hybrid order Image order
Memory Eight-way Eight-way Orthogonal Skewed Four-way
Partitioning slice blo ck
Interp olation Trilinear Programmable Nearest Trilinear Trilinear
Hardware neighbor
Gradient Three Mo des 2D Sob el 26-voxel Central Quantized
Hardware 8/32/56 voxels lter neighborhood di erences gradient
Shading Phong Programmable Programmable Phong Phong
Persp ective Yes Yes No No Yes
Pro jections
Real-time Mo derately Dicult Easy Easy Dicult
Data Input Dicult
Target VLSI O -the-shelf FPGA VLSI DSP/FPGA
Technology comp onents
Performance Memory Bus Memory Memory Memory
Limitation
Scalability Mo derate Hard Easy Easy Mo derate
Early Ray Yes No No No Yes
Termination
P , Sample mo de 1: 0.075 0.2 0.1 m = 10 0.95 0.075 1:1 sampling
M V oxels
Pro cessing mo de 2: 0.01875 B = 200 1m =1 0.3 4:1 sup ersampling
second
Eciency mo de 3: 0.01071
6 6 6 6 6
Published SPPS 40 10 40 10 240 10 533 10 56 10
one mo dule m = 10 4:1 sup ersampling
Architectural High-quality Algorithmic Double- High Good
Highlight parallel and p ersp ective exibility bu ering p erformance p erformance
rendering and shadows cost ratio
bus and assuming the eight memory mo dules can sustain times the Ray Array is re-used. Theoretically, m can be
M V oxel
200 . A new memory architecture was intro duced as large as 1:5n, where n is the dataset resolution, greatly
second
in [5] that enhances the memory bandwidth utilization; reducing the size of the ray array and severely limiting the
however, the voxel re-fetch factor V is increased. This new p erformance of the architecture. m can b e assumed to lie
memory system uses bu ers and a pre-fetch mechanism to be between 1 and 10 for practical implementations. Using
achieve a sample pro cessing eciency of 0:125. this assumption, this architecture has a sample pro cessing
eciency of 1 b est-case and 0:1worst-case. m is a de-
The Array Based Ray Casting architecture strives for
sign parameter for this architecture based on target cost,
high p erformance. It uses slice-by-slice pro cessing to ren-
size, implementation technology, and p erformance.
der the dataset. The two arrays used in this architecture
are a 2D array of pro cessing elements. In FPGA imple- EM-Cub e is a highly optimized parallel rendering ar-
mentations, feedback paths in the larger Ray Array limit chitecture and it was designed to render high-resolution
the maximum clo ck sp eed. The double-bu ered memory datasets at real-time frame rates on a PC or workstation.
o ers supp ort for real-time data input. The architecture The skewed memory system and slice-parallel pro cessing al-
is fully pip elined and parallel and uses only lo cal commu- lows EM-Cub e to sustain a large memory bandwidth. EM-
nication. However, the hardware required to implement Cub e uses memory skewing on a blo ck granularity. This
2
the two arrays scales with O n , where n is the dimen- reduces chip pin-out and communication costs. This archi-
6
sion of the volume data. This architecture do es not use tecture can sustain 533 10 SPPS using four 133MHz
interp olation leading to lower image quality and do es not SDRAMs. During p ersp ective pro jections certain view
supp ort p ersp ective pro jections. The p erformance of this p oints may adversely a ect p erformance for hybrid order
architecture is memory limited. In the full implementation, algorithms. As a result, EM-Cub e do es not supp ort p er-
the Ray Array contains 1:5n 1:5n pro cessing elements. sp ective pro jections. EM-Cub e uses central di erences for
In this con guration, memory bandwidth utilization, U ,is gradient estimation which requires 32 voxels; however, no
1:0. However, one drawback is the size and cost to imple- p erformance p enalty is incurred since EM-Cub e uses on-
2
ment the O n Ray Array. One solution is to shorten the chip storage to bu er sample values. Only lo cal commu-
Ray Array reduced columns and to re-use them multi- nication between pro cessing pip elines are used, therefore,
ple times p er pro jection. In this smaller con guration, the EM-Cub e is highly scalable. EM-Cub e's memory utiliza-
memory bandwidth utilization is inversely prop ortional to tion is 1:0 b ecause it can sustain synchronous memory ac-
the numb er of times the Ray Array is shared. The sample cesses for each of its voxels. By pro cessing the dataset
1:0
pro cessing eciency, P , is , where m is the number of in sections, EM-Cub e is able to signi cantly reduce on- m
chip storage [35]. Consequently, EM-Cub e's average voxel cost tradeo s. On the other hand, VIZARD I I trades p er-
re-fetch is increases slightly b ecause some voxels on the formance versus image qualityby using a 9-bit quantized
b oundary of a section must be re-fetched from the mem- central di erence gradient. VIRIM's dataset duplication in
ory system. EM-Cub e's average voxel re-fetch is 1:05 for the fully parallel implementation illustrates p erformance-
eight sections on a 256 256 256 dataset. As a result, cost tradeo s. Furthermore, we see general trade-o s b e-
EM-Cub e's sample pro cessing eciency is 0:95. tween the di erenttyp es of parallel rendering algorithms.
Image order algorithms exhibit greater exibility, ob ject
VIZARD I I uses three metho ds to reduce memory band-
order algorithms typically have higher sample pro cessing
width requirements for interactive visualization: 9-bit
eciencies, and hybrid order algorithms can havecharac-
quantized gradient, caching, and early ray termination.
teristics of b oth. The primary ob jectives of the volume
VIZARD II supp orts p ersp ective viewing, multiple cut
rendering architect is to balance these tradeo s based on
planes, and segmentation. The quantized central di er-
the target applications.
ence gradient requires prepro cessing and may degrade im-
Of the surveyed architectures, EM-CUBE and VIZARD
age quality when compared to traditional central di erence
I I are the only architectures still actively b eing develop ed.
gradients. In general, gradient quantization can b e used to
The primary goal of these architectures is an interactive
supp ort a wide range of gradient lters and the size of the
visualization system that will augment a low-cost work-
gradient table can b e increased to enhance accuracy i.e.,
station or desktop computer. EM-Cub e and VIZARD II
image quality. The VIZARD I I architecture has b een sim-
will t on a PCI card for a standard PC. As such, b oth
ulated and is still at the research stage. A maximum p er-
6
architectures are also designed to b e low-cost.
formance of 56 10 SPPS p erformance can be achieved
assuming 4:1 sup ersampling of the dataset. Maximum p er-
Note that two of the three architectures Array Based
formance is limited by the memory system. Aworst-case
Ray Casting and EM-Cub e do not supp ort p ersp ective
6
p erformance of 1410 SPPS o ccurs for 1:1 sampling of the
pro jections. Although p ersp ective pro jections may have
dataset. The p erformance of this architecture is memory
less imp ortance in medical and scienti c visualization, p er-
limited. Since VIZARD I I uses early ray termination, we
sp ective pro jections are necessary in virtual reality,volume
assume that its algorithmic sp eedup is 3:0. VIZARD I I's
graphics, and stereoscopic rendering applications. A gen-
average voxel re-fetch is 8. The quantized central di er-
eral purp ose volume rendering solution must supp ort b oth
ence gradient prevents the voxel re-fetch from b eing larger
typ es of pro jections.
than the size of the trilinear interp olation neighborhood.
Volume rendering is useful in b oth scienti c visualization
Since VIZARD I I's memory system is clo cked at 20 of
and volume graphics. These two paradigms do not have
its maximum rate, its memory eciency factor is 0:2 as-
the exact same requirements. Scienti c visualization is pri-
suming worst-case 1:1 sampling. The cache memory and
marily parallel rendering with simple illumination mo dels;
Trilinear Interp olation Units are clo cked four times faster
whereas, volume graphics requires greater rendering ex-
than the rate of the memory system. When the dataset
ibility. EM-Cub e addresses many issues related to inter-
is 4:1 sup er-sampled or higher, a substantial increase in
active scienti c visualization. Although VIZARD II has
cache hits increases the memory bandwidth utilization to
lower p erformance, it is a more general rendering architec-
0:8 b est-case. This architecture has an overall sample pro-
ture with supp ort for p ersp ective pro jections, multiple cut
cessing eciency of 0:075 1:1 sampling and 0:3 4:1 sam-
planes, and segmentation.
pling taking into consideration a threefold sp eedup due to
early ray termination.
IX. Future of Special-Purpose Accelerators
Recently, interactive image generation rates have b een
VIII. Discussion
achieved on medium resolution i.e., 512 512 64
In each architecture, maximum p erformance is limited
datasets using 3D texture mapping a forward-pro jection
by either the memory system or global communication
algorithm [2]. One advantage of texture hardware is that
b ottlenecks, such as buses or networks. Memory limita-
volumetric primitives can b e mixed with geometric primi-
tions are inherenttoeach architecture. Global buses and
tives. However, texture mapping hardware do es not readily
networks arise b ecause of the need to communicate voxel
supp ort all asp ects of volume visualization, such as p er-
data or intermediate values over a common medium. The
sample gradient computation and p er-sample illumination.
volume rendering algorithm and memory organization de-
In addition, texture memory is usually limited. Volume
termine whether these p otential b ottlenecks a ect p erfor-
rendering hardware encompasses texture mapping by in-
mance. Architectures that are only memory limited tend
cluding 3D ltering trilinear interp olation and gradient
to b e more scalable.
lters and very high-bandwidth memory; thus, future vol-
There is a tradeo in most of the architectures b etween ume visualization accelerators are likely to b e used for high-
p erformance, quality, and hardware cost. VOGUE's di er- p erformance texture mapping. Texture memory is typi-
ent rendering mo des present p erformance-quality tradeo s. cally four-way eight-way interleaved similar to VIZARD
Practical implementations of the Array Based Ray Casting I I's and VOGUE's memory system to allow con ict-free
and EM-CUBE architectures use slightly mo di ed con gu- 2D 3D re-sampling. A p otential area of research is an
rations from a fully parallel design leading to p erformance- ecient memory organization that readily supp orts b oth
volume rendering, texture mapping, and next generation Acknowledgments
memory technology. In addition, the ability to render b oth
Sp ecial thanks to Meena Bhashyam for help review-
voxel and p olygonal primitives in a single pro jection for
ing the manuscript, and Michael Doggett, Christof Rein-
virtual reality should b e considered.
hart, Michael Meiner, and Gunter Knittel for their useful
The architectures surveyed represent the rst generation
insight and corresp ondence regarding their architectures.
of custom architectures that implement the ray casting al-
This work was supp orted by the Oce of Naval Research
gorithm. These architectures primarily fo cus on high out-
under grantnumb er N00014-92-J-1252 and CAIP research
put frame rates. We b elieve the second generation archi-
center at Rutgers State university.
tectures will address dynamic datasets or large input rates.
Fast sampling devices and interactivevolume graphics will
References
help promote this trend. Future interactive visualization
[1] I. Bitter and A. Kaufman. A Ray-Slice-Sweep Volume Rendering
systems may so on consist of a sp ecial purp ose accelerator
Engine. In Proceedings of the Siggraph/Eurographics Workshop
on Graphics Hardware, pages 121{138, August 1997.
card connected to a real-time data acquisition subsystem.
[2] B. Cabral, N. Cam, and J. Foran. Accelerated Volume Rendering
In volume graphics applications, it is exp ected that vol-
and Tomographic Reconstruction using Texture Mapping Hard-
ume up dates, volume animation, and voxelization will re-
ware. In 1994 Workshop on Volume Visualization, pages 91{98,
Washington, DC, Octob er 1994.
quire input frame rates comparable to output frame rates.
[3] R. Crisp. Direct Rambus Technology: The Next Main Memory
Consequently,it may b e necessary to parallelize voxel in-
Standard. IEEE Micro, 176, Novemb er 1997.
put into the volume memory and the memory partitioning
[4] T. J. Cullip and U. Neumann. Accelerating Volume Reconstruc-
tion with 3D Texture Mapping Hardware. In Technical Report
scheme in a given architecture may not necessarily b e the
TR93-027, Department of Computer Science at the University
optimal partitioning scheme for b oth parallel voxel input
of North Carolina, Chap el Hill, 1993.
and output. Also, double bu ered volume memory see Ar-
[5] M. de Bo er, A. Gropl, T. Gun ther, C. Poliwo da, C. Reinhart J.
Hesser, and R. Manner. Latency-Free and Hazard-Free Volume
ray Based Ray Casting will b e necessary to eliminate ren-
Memory Architecture for Direct Volume Rendering. In Proceed-
dering artifacts and p erformance loss when simultaneously
ings of the 11th Eurographics Hardware Workshop, pages 109{
loading and visualizing dynamic datasets. Furthermore, 118, Poitiers, France, August 1996.
[6] M. C. Doggett. An Array Based Design for Real-Time Volume
3D double-bu ering follows the development of the tradi-
Rendering. In 10th Eurographics Workshop on Graphics Hard-
tional graphics pip eline i.e., double-bu ered frame bu er.
ware, pages 93{101, August 1995.
These are areas for additional research.
[7] M. C. Doggett. Vizar : A VideoRate System for Volume Visu-
alization. PhD thesis, University of New South Wales, 1996.
If volume graphics is to o set traditional p olygonal
[8] M. C. Doggett and G. R. Hellestrand. Video Rate Shading for
Volume Data. In Australian Pattern Recognition Society Digital
graphics, volumetric raytracing [48] and p ersp ective pro-
Image Computing : Techniques and Applications, pages 398{
jections must b e considered. Raytracing is capable of pro-
405, Decemb er 1993.
ducing photo-realistic images using shadows, re ection, re-
[9] M. C. Doggett and G. R. Hellestrand. A Hardware Architecture
for Video Rate Smo oth Shading of Volume Data. In Eurographics
fraction, etc. For instance, VIRIM is capable of shadows.
Hardware Workshop, pages 95{102, Septemb er 1994.
Secondary rays during raytracing have less coherence than
[10] S. M. Goldwasser, R. A. Reynolds, and T. Bapty. Physician's
primary rays, therefore, volumetric raytracing may b e b et-
Workstation with Real-Time Performance. IEEE Computer
Graphics and Applications, 512:44{57, Decemb er 1985.
ter suited for image order architectures. Ray casting archi-
[11] R. Grzeszczuk, C. Henn, and R. Yagel. Advanced Geometric
tectures that rely on lo ck-step coherence b etween cast rays
Techniques for Ray Casting Volumes. In SIGGRAPH 98 Course
i.e., Array Based Ray Casting and EM-Cub e mayhave
Nbr. 4, Orlando, FL, 1998.
[12] T. Gun ther, C. Poliwo da, C. Reinhart, J. Hesser, R. Manner,
diculty incorp orating raytracing.
H.-P. Meinzer, and H.-J Baur. VIRIM: A Massively Parallel
Memory throughput increased an order of magnitude
Pro cessor for Real-Time Volume Visualization in Medicine. In
Proceedings of the 9th Eurographics Hardware Workshop, vol-
e.g., Direct Rambus since most of the architectures in
ume 19, No. 5, pages 705{710, 1995.
this pap er were rst prop osed. Advances in semiconductor
[13] B. M. Hemminger, T. J. Cullip, and M. J. North. Interactive
technologies towards deep-submicron pro cesses will con-
Visualization of 3D Medical Image Data. In Technical Report
TR94-027, Department of Radiology and Radiation Oncology
tinue to promote higher logic sp eeds, higher memory den-
at the University of North Carolina, Chap el Hill, 1994.
sity,aswell as lower memory access times. In addition, the
[14] J. Hesser, R. Manner, G. Knittel, W. Straer, H. P ster, and
ability to embed large amounts of memory on-chip with
A. Kaufman. Three Architectures for Volume Rendering. In
Proceedings of Eurographics '95,volume 14, No. 3, Maastricht,
computational units will further enhance memory through-
The Netherlands, Septemb er 1995. Europ ean Computer Graph-
put. All of these trends will simplify future volume render-
ics Asso ciation.
ing architectures, increase their sp eed, and lower their cost.
[15] K. Hohne, M. Bomans, A. Pommert, M. Riemmer, C. Schiers,
U. Tiede, and G. Wieb ecke. 3D Visualization of Tomographic
Furthermore, as these sp ecial-purp ose accelerators evolve,
Volume Data Using The Generalized Voxel Mo del. The Visual
software and application-program interfaces APIs will b e
Computer, 61:28{36, February 1990.
de ned and develop ed [11], [29], [30]. They will provide [16] D. Jackel. The graphics PARCUM system: A 3D Memory Based
Computer Architecture for Pro cessing and Display of Solid Mo d-
the user with more exibility and additional features, such
els. Computer Graphics Forum, 44:21{32, 1985.
as stereoscopic views. The availability of low-cost real-
[17] S. Juskiw, N. Durdle, V. Raso, and D. Hill. Interactive Rendering
of Volumetric Data Sets. Computer and Graphics, 195:685{
time hardware and industry strength APIs will increase
693, 1995.
the acceptance of volume visualization and volume graph-
[18] A. Kaufman. Volume Visualization. IEEE Computer So ciety
ics. This will certainly lead to the development of new and
Press, 1991.
exciting applications. [19] A. Kaufman and R. Bakalash. CUBE - An Architecture Based on
[43] S. W. Smith, H. G. Pavy, and O. T. von Ramm. High-Sp eed Ul- a3DVoxel Map. Theoretical Foundations of Computer Graphics
trasound Volumetric Imaging System { Part I: Transducer De- and CAD, pages 689{700, 1988.
sign and Beam Steering. IEEE Transactions on Ultrasonics,
[20] A. Kaufman and R. Bakalash. Memory and Pro cessing Archi-
Ferroelectrics, and Frequency Control, 382:100{108, 1991.
tecture for 3D Voxel-based Imagery. IEEE Computer Graphics
[44] J. van Scheltinga, J. Smit, and M. Bosma. Design of an On-
and Applications, 86:10{23, Novemb er 1988.
Chip Re ectance Map. In Proceedings of the 10th Eurographics
[21] A. Kaufman, D. Cohen, and R. Yagel. Volume Graphics. IEEE
Workshop on Graphics Hardware, pages 51{55, Maastricht, The
Computer, 267:51{64, July 1993.
Netherlands, August 1995.
[22] G. Knittel. VERVE: Voxel Engine for Real-Time Visualiza-
[45] O. T. von Ramm, S. W. Smith, and H. G. Pavy. High-Sp eed
tion and Examination. Computer Graphics Forum, 193:37{48,
Ultrasound Volumetric Imaging System { Part I I: Parallel Pro-
Septemb er 1993.
cessing and Image Display. IEEE Transactions on Ultrasonics,
[23] G. Knittel. A PCI-based Volume Rendering Accelerator. In
Ferroelectrics, and Frequency Control, 382:109{115, 1991.
In Proceedings of the 10th Eurographics Workshop on Graphics
[46] L. A. Westover. Splatting: A Paral lel, Feed-Forward Volume
Hardware, pages 73{82, August 1995.
Rendering Algorithm. PhD thesis, The University of North Car-
[24] G. Knittel. A Scalable Architecture for Volume Rendering. Com-
olina at Chap el Hill, Department of Computer Science, jul 1991.
puter and graphics, 195:653{665, 1995.
[47] R. Yagel. Towards Real Time Volume Rendering. In Proceed-
[25] G. Knittel and W. Straer. VIZARD-Visualization Accel-
ings of GRAPHICON 1996 Saint-Petersburg, Russia,volume 1,
erator for Realtime Display. In Proceedings of the Sig-
pages 230{241, July 1996.
graph/Eurographics Workshop on Graphics Hardware, pages
[48] R. Yagel, D. Cohen, and A. Kaufman. Discrete Ray Trac-
139{146, August 1997.
ing. IEEE Computer Graphics and Applications, 129:19{28,
[26] P. Lacroute. Analysis of a Parallel Volume Rendering System
Septemb er 1992.
Based on the Shear-Warp Factorization. IEEE Transactions on
[49] R. Yagel and A. Kaufman. Template-Based Volume Viewing.
Visualization and Computer Graphics, 23:218{231, September
Computer Graphics Forum, 113:153{167, Septemb er 1992.
1996.
[50] T. S. Yo o, U. Neumann, H. Fuchs, S. M. Pizer, T. Cullipand J.
[27] P. Lacroute and M. Levoy. Fast Volume Rendering Using a
Rhoades, and R. Whitaker. Direct Visualization of Volume Data.
Shear-Warp Factorization of the Viewing Transformation. In
IEEE Computer Graphics and Applications, pages 63{71, July
Andrew Glassner, editor, Proceedings of SIGGRAPH '94 Or-
1992.
lando, Florida, July 24-29, 1994, Computer Graphics Pro ceed-
[51] K. J. Zuiderveld. Visualization of Multimodality Medical Vol-
ings, Annual Conference Series, pages 451{458. ACM Press, July
ume Data using Object-Oriented Methods. PhD thesis, Utrecht
1994.
University, March 1995.
[28] M. Levoy. Display of Surfaces From Volume Data. IEEE Com-
puter Graphics and Applications, 58:29{37, May 1988.
[29] B. Lichtenb elt. Design of A High Performance Volume Visual-
ization System. In Siggraph/Eurographics Hardware Workshop,
pages 111{119, August 1997.
[30] B. Lichtenb elt, R. Crane, and S. Naqvi. Introduction to Volume
Rendering. Hewlett-Packard Professional Bo oks. Prentice Hall
PTR, 1998.
[31] B. Lichtenb elt M. Bentum and T. Malzb ender. Frequency Anal-
ysis of Gradient Estimators in Volume Rendering. IEEE Trans-
actions on Visualization and Computer Graphics, 23:242{254,
Septemb er 1996.
[32] H.-P. Meinzer, K. Meetz, D. Schepp elmann, U. Engelmann, and
H. J. Baur. The Heidelb erg RayTracing Mo del. IEEE Computer
Graphics and Applications, pages 34{43, Novemb er 1991.
[33] M. Meiner, U. Kanus, and W. Straer. VIZARD I I: A PCI-
Card for Real-Time Volume Rendering. In Proceedings of the
Siggraph/Eurographics Workshop on Graphics Hardware, pages
61{67, Lisb on, Portugal, August 1998.
[34] T. Moller, R. Machira ju, K. Mueller, and R. Yagel. A Com-
parison of Normal Estimation Schemes. In IEEE Visualization
Proceedings 1997, pages 19{26, Octob er 1997.
[35] R. Osb orne, H. P ster, H. lauer, N. McKenzie, S. Gibson,
W. Hiatt, and T. Ohkami. EM-Cub e: An Architecture for
Low-Cost Real-Time Volume Rendering. In Proceedings of the
Siggraph/Eurographics Workshop on Graphics Hardware, pages
131{138, Los Angeles, CA, August 1997.
[36] H. P ster. Architectures for Real-Time Volume Rendering. PhD
thesis, State University of New York at Stony Bro ok, Computer
Science Department, Stony Bro ok, NY 11794-4400, 1996. MERL
Rep ort No. TR-97-04.
[37] H. P ster and A. Kaufman. Cub e-4 - A Scalable Architecture for
Real-Time Volume Rendering. In Volume Visualization Sympo-
sium Proceedings, pages 47{54, Octob er 1996.
[38] H. P ster, A. Kaufman, and T. Chiueh. Cub e-3: A Real-Time
Architecture for High-Resolution Volume Visualization. In In
1994 Workshop on Volume Visualization, pages 75{83, Wash-
ington, DC, Octob er 1994.
[39] H. P ster, A. Kaufman, and F. Wessels. Towards a Scalable
Architecture for Real-Time Volume Rendering. In Proceedings of
the 10th Eurographics Workshop on Graphics Hardware, pages
123{130, Maastricht, The Netherlands, August 1995.
[40] B. Phong. Illumination for Computer Generated Pictures. Com-
munications of the ACM, 186:311{317, 1975.
[41] T. Porter and T. Du . Comp ositing Digital Images. Computer
Graphics, 183, July 1984.
[42] P.Schroder and G. Stoll. Data Parallel Volume Rendering as
Line Drawing. In 1992 Workshop on Volume Visualization, pages 25{31, Boston, MA, Octob er 1992.