<<

Ray Casting Architectures for Volume Visualization

The Harvard community has made this article openly available. Please share how this access benefits you. Your story matters

Citation Ray, Harvey, Hanspeter Pfister, Deborah Silver, and Todd A. Cook. 1999. Ray casting architectures for volume visualization. IEEE Transactions on Visualization and 5(3): 210-223.

Published Version doi:10.1109/2945.795213

Citable link http://nrs.harvard.edu/urn-3:HUL.InstRepos:4138553

Terms of Use This article was downloaded from Harvard University’s DASH repository, and is made available under the terms and conditions applicable to Other Posted Material, as set forth at http:// nrs.harvard.edu/urn-3:HUL.InstRepos:dash.current.terms-of- use#LAA

Ray Casting Architectures for Volume Visualization

Harvey Ray, Hansp eter P ster , Deb orah Silver , Todd A. Co ok

Abstract | Real-time visualization of large volume datasets

demands high p erformance computation, pushing the stor-

age, pro cessing, and data communication requirements to

the limits of current technology. General purp ose paral-

lel pro cessors have b een used to visualize mo derate size

datasets at interactive frame rates; however, the cost and

size of these sup ercomputers inhibits the widespread use

for real-time visualization. This pap er surveys several sp e-

cial purp ose architectures that seek to render volumes at

interactive rates. These sp ecialized visualization accelera-

tors have cost, p erformance, and size advantages over par-

allel pro cessors. All architectures implement ray casting

using parallel and pip elined hardware. Weintro duce a new

metric that normalizes p erformance to compare these ar-

Fig. 1. Volume dataset.

chitectures. The architectures included in this survey are

VOGUE, VIRIM, Array Based Ray Casting, EM-Cub e, and

VIZARD II. We also discuss future applications of sp ecial

ume rendering architectures that seek to achieveinteractive

purp ose accelerators.

volume rendering for rectilinear datasets. A survey of other

metho ds used to achieve real time volume rendering is pre-

I. Introduction

sented in [47]. The motivation for custom volume renderers

OLUME visualization is an imp ortant to ol to view

is discussed in the next section. Several other custom ar-

and analyze large amounts of data from various sci-

V

chitectures exist [1], [8], [10], [16], [17], [19], [36], [38] but

enti c disciplines. It has numerous applications in areas

were not presented b ecause they are either related to the

such as biomedicine, geophysics, computational uid dy-

architectures presented here or are not considered to be

namics, nite element mo dels, and computational chem-

recent. Section III presents three parallel volume render-

istry. Numerical simulations and sampling devices suchas

ing algorithms that are implemented by the architectures

magnetic resonance imaging MRI, computed tomography

in this pap er. Ma jor comp onents of a volume rendering

CT, satellite imaging, and sonar are common sources of

system are discussed in Section IV. Five sp ecialized vol-

large 3D datasets. These datasets are generally anywhere

ume rendering architectures are surveyed in Section V. A

3 3

from 128 to 1024 and may b e non-symmetric i.e., 1024

new metric is intro duced in Section VI to compare each

 1024  512.

architecture. A comparison of the surveyed architectures

Volume rendering involves the pro jection of a volume

is presented in Section VI I and a discussion is presented in

dataset onto a 2D image plane. From Figure 1 we see

section VI I I. Future trends for sp ecialized rendering archi-

that a volume dataset is organized as a 3D arrayofvolume

tectures are presented in Section IX.

1

elements, or .

II. Need for Custom Visualization

Voxels representvarious physical characteristics, suchas

Architectures

density, temp erature, velo city, and pressure. Other mea-

surements, such as area and volume, can b e extracted from

A real-time volume rendering system is imp ortant for the

the volume datasets. Volume data may contain more than

following reasons [37]: 1 to visualize rapidly changing 4D

ahundred million values requiring a large amountof

spatial-temp oral datasets, 2 for real-time exploration of

storage. In Figure 1, the voxels are uniform in size and reg-

3D datasets e.g., virtual reality, 3 for interactive ma-

ularly spaced on a rectilinear grid. Other typ es of volume

nipulation of visualization parameters e.g., classi cation,

data can b e classi ed into curvilinear grids, which can b e

and 4 interactivevolume graphics [21]. As the sampling

thought of as resulting from a warping of a regular grid,

rates of devices b ecome faster, it will b e p ossible to gener-

and unstructured grids, which consist of arbitrary shap ed

ate several 3D datasets at interactive rates; real-time vol-

cells. This pap er presents a survey of recent custom vol-

ume rendering is required to visualize these dynamically

changing datasets e.g., for 3D ultrasound [43], [45]. It

Harvey Ray is a Ph.D. student at Rutgers State University, Email:

is often necessary to view the dataset from continuously

[email protected]

changing p ositions to b etter understand the data b eing vi-

Hansp eter P ster is with Mitsubishi Electric Research, Email: p s-

[email protected]

sualized; real-time volume rendering will enhance visual

Deb orah Silver is an asso ciate professor at Rutgers State University,

depth cues through motion and o cclusion as the dataset

Email: [email protected]

is viewed from varying p ositions. Classi cation is imp or-

To dd Co ok is a research and development engineer at Improv Sys-

tem Inc., Email: to [email protected]

tant in correctly visualizing the dataset by con guring ob-

1

Note, the term voxel has b een used to refer to p oint samples and

ject prop erties opacity, color, etc. based on voxel values;

cubic volume elements. The pap ers surveyed here use b oth de nitions

for illustration purp oses. Therefore, gures in this pap er will use a

as necessary. p oint sample representation or a unit volume representation of a voxel

classi cation is an iterative pro cess which will b ene t from this pap er implement ray casting, a common backward-

real-time volume rendering; thus, scientists will be able pro jection algorithm [28]. The ray casting algorithm is ca-

to interactively manipulate opacity and color mappings. pable of pro ducing high-quality images and a large degree

Volume graphics is an emerging area of research that pro- of parallelism can b e exploited from the algorithm.

duces synthetic datasets [21]. Volume graphics challenges

In ray casting, rays are cast into the dataset. Eachray

the way 3D graphics is currently implemented. Traditional

originates at the viewing eye p osition, p enetrates a pixel

3D graphics use p olygonal meshes to mo del ob jects and

in the image plane screen, and passes through the dataset.

these meshes are scan-converted into pixels inside the frame

At evenly spaced intervals along the ray, sample values

bu er. Alternatively,volume graphics mo dels ob jects as a

are computed using interp olation. The sample values are

3D discrete set of p oint samples voxels. These voxels

mapp ed to display prop erties such as opacity and color. A

comprise the 3D dataset. The dataset is rendered using

lo cal gradient is combined with a lo cal illumination mo del

standard volume visualization techniques.

at each sample p oint to provide realistic of the

Real-time visualization of large 3D datasets places strin- ob ject. Final pixel values are found by comp ositing color

and opacity values along a ray. Comp osition mo dels the

gent computational demands on mo dern workstations, es-

physical re ection and absorption of light.

p ecially on the memory system. Table I estimates the mem-

ory bandwidth to render di erent size datasets at 30Hz. It

Because of the high computational requirements of vol-

is assumed that the volume rendering algorithm accesses

ume rendering, the data needs to be pro cessed in a

each voxel once per pro jection. The required memory

pip elined and parallel manner. Parallel ray casting algo-

bandwidth can not b e sustained on most mo dern worksta-

rithms use one of the following pro cessing strategies: ob-

tions and p ersonal computers. The dataset must be par-

ject order, image order, or hybrid order [14]. This division

titioned among multiple memory mo dules to achieve the

describ es the manner in which a dataset is pro cessed. Fig-

desired bandwidth and parallel pro cessing must b e used.

ure 2 illustrates the three variations.

TABLE I

Estimated memory bandwidth for real-time volume rendering. Intermediate Plane

Image Plane Image Plane

Frame Rate Hz Memory Bandwidth

Dataset Size Image Plane

3

128  16 30 120 MB/s

3

256  16 30 960 MB/s

3

512  16 30 7.5 GB/s

3

 16 30 60 GB/s

1024 A) Image order B) Object order C) Hybrid order

Fig. 2. Ray casting categories.

Massively parallel pro cessors and multipro cessors archi-

tectures [2], [4], [13], [26], [42], [50] have achieved image

A dataset in Figure 2 is organized as a set of parallel

generation rates up to 30Hz on mo derate sized datasets

slices. Image order algorithms cast rays through the image

using algorithmic optimizations; however, the cost of these

plane and re-sample at lo cations along the ray Figure 2A.

machines is prohibitive. In addition, the algorithmic op-

They o er exibility for algorithmic optimizations, but ac-

timizations are usually dataset dep endent. Custom archi-

cessing the volume memory in a non-predictable manner

tectures have the p otential to match or exceed the p erfor-

signi cantly slows down memory p erformance. Ob ject or-

mance of other interactive visualization solutions at a lower

der algorithms require that the dataset be re-sampled so

cost and smaller size. Performance, cost, and size b ene ts

that the slices are aligned with the view direction Fig-

are necessary for a desktop interactive visualization system.

ure 2B. A ma jor advantage of ob ject order algorithms

III. Parallel Ray Casting

is that accesses to the volume memory are predictable,

thereby, leading to ecient memory bandwidth utilization.

Volume rendering involves the direct pro jection of the en-

Hybrid order algorithms pro ject the dataset to the face of

tire 3D dataset onto a 2D display. Volume rendering algo-

the dataset most parallel to the image plane. This also

rithms can simultaneously reveal multiple surfaces, amor-

allows for predictable memory accesses to the volume data

phous structures, and other internal structures of a 3D

where no more than one sample is taken per voxel. The

dataset [18]. These algorithms can be divided into two

intermediate 2D image is warp ed into the nal image Fig-

categories: forward-pro jection and backward-pro jection.

ure 2C. The shear-warp algorithm is an example of a hy-

Forward-pro jection algorithms iterate over the dataset dur-

brid order algorithm [27]. A summary of implementation

ing the rendering pro cess pro jecting voxels onto the image

tradeo s for each parallel scheme is shown in Table I I.

plane. A common forward-pro jection algorithm is splat-

ting [46]. Backward-pro jection algorithms iterate over the

IV. Components of a Ray Casting System

image plane during the rendering pro cess re-sampling the

The following comp onents are needed for anyray casting

dataset at evenly spaced intervals along each viewing ray.

implementation:

In general, ray casting algorithms traverse the dataset in

Memory system provides the necessary voxel values at a a more random manner. All architectures surveyed in

TABLE I I

Tradeoffs for different parallelization methods.

Image order Ob ject order Hybrid order

Advantages - Easy to implement - Regular memory access - Merge b ene ts of

algorithmic optimizations patterns image order and ob ject

e.g., early-ray termination order algorithms

Disadvantages - "Random" memory access - Dicult to implement -Persp ective pro jections

patterns algorithmic optimizations adversely a ect p erformance

- Non-uniform mapping of - Additional 2D image

ray samples to voxels warp required

rate which ultimately dictates the p erformance of the ar-

chitecture.

Ray-path calculation determines the voxels that are p ene-

trated by a given ray; it is tightly coupled with the organi-

zation of the memory system.

Interpolation estimates the value at a re-sample lo cation

using a small neighb orho o d of voxel values.

Gradient estimation estimates a surface normal using a

neighb orho o d of voxels surrounding the re-sample lo cation.

Classi cation maps interp olated sample values and the es-

timated surface normal to a color and opacity.

Shading uses gradient and classi cation information to

compute a color that takes into account the interaction

of light on the estimated surfaces in the dataset.

Composition uses shaded color values and opacity to com-

pute a nal pixel color for display.

Fig. 3. Common memory organization schemes.

A. Memory System

partitioning scheme, the memory throughput is maximized

The memory system is the most imp ortant comp onentof

for the three orthogonal viewing directions.

a visualization architecture. The memory system contains

The maximum p erformance obtained by a volume ren-

the dataset and is resp onsible for supplying the compu-

dering architecture is primarily determined by the de-

tational units with voxel values at a high bandwidth to

gree of parallelism and the memory technology used.

supp ort the target frame rate. Since the dataset will be

Recent memory devices use pip elining to accelerate lin-

visualized from various view p ositions, the throughput of

ear accesses. Synchronous memories, such as Syn-

the memory system should b e as view indep endent as p ossi-

chronous DRAM SDRAM, can sustain memory accesses

ble. Regardless of the parallel pro cessing strategy, eachray

at 150MHz. This is a three-fold sp eed up over previous

casting algorithm requires simultaneous access to multiple

alternatives. More recently, Rambus de ned a high-sp eed

voxels. Ideally, the memory system provides these voxels in

interface that will allow sustainable bandwidths up to 800

a con ict-free manner; otherwise, the overall system may

MB/s using an 8-bit bus. Using a wider 16-bit bus Direct

su er p erformance degradation.

Rambus, these devices are able to sustain 1.6 GB/s data

The architectures surveyed in this pap er use four mem-

throughput [3]. These advanced memories can p otentially

ory partitioning schemes shown in Figure 3 to achieve a

enhance p erformance for any given architecture. A metric

high memory throughput. Sub-blo ck partitioning Fig-

that measures the ability of a volume rendering architec-

ure 3A divides the dataset into smaller volumes. Each

ture to utilize available memory bandwidth is presented in

sub-blo ck is assigned to a di erent memory mo dule. Or-

Section VI.

thogonal slice partitioning Figure 3B assigns each slice

inside the dataset to a memory mo dule. Each slice is p er-

B. Ray-Path Calculation

p endicular to one axis of the dataset. In this partitioning

scheme, memory throughput is maximized for two of the Calculating ray-voxel intersections is tightly coupled

three orthogonal viewing directions. The eight-wayinter- with the memory system design and is related to the typ e

leaved memory system Figure 3C assigns each voxel in of ray casting algorithm used. The appropriate memory

a 2  2  2 blo ck to separate memory banks. The eight- addresses for each voxel that a ray p enetrates must be

wayinterleaved memory partition is limited to eight par- computed. These addresses are calculated by construct-

allel memory accesses. As a result, it can be combined ing a line ray b etween the viewing p osition and a pixel

with sub-blo ck partitioning when additional parallelism is on the image plane and extending the line ray through

necessary. The skewed non-orthogonal slice partitioning the dataset. Based on the pro cessing strategies from the



scheme Figure 3D assigns slices that make a 45 angle previous section, it may be necessary to calculate a sub-

with each axis of the dataset to memory mo dules. In this stantial number of memory addresses in parallel. Lo ok-

neighb oring voxels and interp olated to yield the gradient up tables or templates have b een used to reduce the

at the re-sample lo cation. computation involved in calculating ray paths through the

dataset [49]. For parallel pro jections and hybrid order ar-

In practice, several gradient estimation schemes ex-

chitectures, templates only need to b e generated once p er

ist [15], [18], [31], [34], [51]. A comprehensive considera-

pro jection b ecause all rays have the same slop e.

tion of these metho ds is beyond the scop e of this survey.

In general, high-quality gradient estimation requires ad-

C. Interpolation

ditional computation and memory bandwidth or on-chip

storage that may a ect p erformance and cost.

Estimating the sample value requires evaluation of the

trilinear interp olation equation:

E. Classi cation

S i; j; k  = P 1 i1 j 1 k 

000

Classi cation maps a color and opacity to sample val-

+ P i1 j 1 k +P 1 ij 1 k 

100 010

ues. Opacity values range from 0 transparent to 1:0

+ P ij 1 k +P 1 i1 j k

110 001

opaque [28]. Classi cation is typically implemented in

+ P i1 j k + P 1 ijk + P ij k

101 011 111

hardware using lo ok-up tables LUTs. These LUTs are

1

typically addressed by sample value and/or gradient mag-

i, j , and k are fractional o sets of the sample p osition in

nitude, and they output sample opacity and color. It is de-

the x, y , and z directions, resp ectively. These variables are

sirable to b e able to mo dify the information in these LUTs

between 0 and 1. P is a voxel whose relative p osition in

abc

during the visualization pro cess in real-time. If the archi-

a2 2  2 neighb orho o d of voxels is a; b; c. a, b, and c are

tecture pro cesses multiple re-sample lo cations in parallel,

the least signi cant bit of the x, y , and z sample p osition,

these LUTs must b e duplicated to avoid contention.

resp ectively. From Equation 1, we see that a total of 24

multiplications are necessary and eightvoxel values are re-

F. Shading

quired to compute each re-sample lo cation. The number of

The Phong shading algorithm [40], or variants, are of-

multiplications can b e reduced by approximately one-half

ten used in the shading subsystems of volume rendering

if factors are re-used. If we assume that each1 1  1 unit

architectures. This algorithm requires gradients, light and

volume in a 512  512  512 dataset contains one re-sample

re ection vectors to calculate the shaded color for each re-

lo cation p er pro jection, then more than 1.5 billion multipli-

sample lo cation. The algorithm involves computationally

cations would be necessary for each pro jection. Multiple

exp ensive division, multiplication, and exp onentiation that

pro jections are needed per second for interactive pro jec-

must b e implemented in hardware. In practice, the shading

tion rates, requiring an enormous amount of computational

algorithm is implemented in either arithmetic units for ac-

power. As few as eightmultiplications are necessary if the

curacy or re ectance LUTs for exibility [44]. For color im-

interp olation weights are stored in a lo ok-up table. Higher-

ages, the Phong shading mo dels may b e applied to the red,

order interp olation can be used to improve image-quality

green, and blue comp onents. Also, additional computation

but it is typically not done in hardware b ecause of its com-

may b e necessary if multiple light sources are supp orted.

putational cost.

G. Compositing

D. Gradient Estimation

The comp osition system is resp onsible for summing up

The next step is the determination of gradients to ap-

color and opacity contributions from re-sample lo cations

proximate surface normals for classi cation and shading.

along a ray into a nal pixel color for display [41]. The

x-, y -, and z -gradients may b e computed using central dif-

front-to-back formulation for comp ositing is:

ferences:

C = 1:0   C + C

S i+1;j;k S i1;j;k 

Acc Acc sample Acc

G =

x

x

3

= 1:0   +

Acc Acc sample Acc

S i;j +1;k S i;j 1;k 

2

G =

y

y

C is the accumulated color, is the accumulated

Acc Acc

S i;j;k +1S i;j;k 1

opacity, C is the samples color, and is the

sample sample

G =

z

z

samples opacity. Twomultiplies are needed to comp osite

S i; j; k  is the interp olated sample at the lo cation i; j; k 

each re-sample lo cation. Comp ositing in a front-to-back

inside the dataset. x , y , and z is the spacing be-

order allows for early ray termination if a desired opacity

tween samples in x, y , and z directions, resp ectively. The

threshold has b een reached. Back-to-front comp osition can

costly divisions are usually avoided b ecause of the regular

be utilized to simplify the calculation; however, early ray

spacing b etween voxels inside the dataset. Two re-sample

termination is not p ossible. Color information pro duced

lo cations adjacent to the sample lo cation in each direction

from the comp ositing system is stored into a frame bu er

are required to compute the gradient using central di er-

for display.

ences. Some algorithms use a larger neighb orho o d of voxels

to generate images that app ear smo other and/or to reduce

V. Architecture Survey

temp oral aliasing. In addition to the gradientvector com-

p onents, the gradient magnitude and the normalized gradi- This section presents ve sp ecial purp ose volume ren-

entvector may b e required. Gradients can also b e taken at dering architectures. A description of each architecture is

interp olation uses a pair of opp osite face-values to compute given along with its p erformance. The following architec-

the nal sample value. The REX is a pip elined unit and tures are surveyed, in chronological order of their develop-

pro duces one interp olation value p er clo ck cycle. ment: VOGUE, VIRIM, Array Based Ray Casting, EM-

Cub e, and VIZARD II. Each gure in this section were

In addition to interp olation, the REX unit also p er-

redrawn from their original publication.

forms gradient calculation. Gradient calculation requires

1 memory accesses for the fastest gradientmode 8-voxel

A. VOGUE

gradient, 4 memory accesses for the intermediate mo de

The VOGUE architecture [22], [24] was develop ed at the

32-voxel gradient, and 7 memory accesses for the highest

University of Tubingen, Germany. One rendering engine

quality gradient mo de 56-voxel gradient. In the fastest

provides high-quality, volume-rendered images with mul-

gradient mo de, opp osite face-values computed during tri-

tiple light sources using four custom VLSI chips. A blo ck

linear interp olation are used to compute gradients. The

diagram of the architecture is shown in Figure 4. The main

higher quality gradient mo des require additional voxels and

interp olation. The REX unit can pro duce one gradientvec-

tor and magnitude p er clo ck cycle. The REX unit contains

three pip elined square units and one square ro ot unit to

compute the gradient magnitude.

Classi cation information is stored in three LUTs: sp ec-

ular co ecient, color, and opacity. These tables are in-

dexed using the sample value, gradient, and gradient mag-

nitude. These values are subsequently used by the shading

unit COLOSSUS and comp ositing unit COMET.

The COLOSSUS shading unit implements the unre-

stricted Phong illumination mo del and depth cueing. The

sp ecular co ecient from the LUT along with the gradi-

entvector, lightvector, and ambient co ecient are passed

Fig. 4. VOGUE architecture.

to the COLOSSUS chip. The COLOSSUS chip internally

converts op erands to logarithms to reduce multiplication

goals of VOGUE are exibility and compactness. VOGUE

and division to simple addition and subtraction, resp ec-

is capable of three rendering mo des based on the gradient

tively. The costly exp onentiation op eration required by the

estimation metho d: a fast 8-voxel gradient, a slower in-

Phong illumination mo del is reduced to a multiply; how-

termediate quality 32-voxel gradient, and a higher quality

ever, fast logarithmic converters are necessary. These units

56-voxel gradient. VOGUE hardware consist of an Ad-

are pip elined to achieve the desired system p erformance.

dress SeQuencer ASQ for memory addressing, a volume

Shaded samples are comp osited in the COMET chip.

memory VoluMem for dataset storage, a Reconstructor

The COMET chip requires an opacity, from a LUT, and

EXtractor REX for interp olation, a COLOSSUS unit for

color values from the COLOSSUS chip. These values are

shading, and a COMET unit for comp osition. VOGUE

comp osited into a nal pixel color that is passed to the

implements an unrestricted Phong illumination mo del in

frame bu er.

addition to depth cueing.

A.2 Performance

A.1 Description

Estimated p erformance of one VOGUE mo dule, contain-

The Volume Memory VoluMem is organized as an

ing the four VLSI units ASQ, REX, COLOSSUS, and

eight-way interleaved memory system see Figure 3C

3

COMET, is 2:5 frames/second for 256 datasets using the

which allows eightvoxels surrounding a trilinear re-sample

fastest rendering mo de. For higher p erformance, several

lo cation to b e retrieved in parallel.

rendering mo dules are connected to other mo dules in a

The ASQ unit provides necessary addresses for the Vol-

ring network. To achieve larger memory throughput, a

uMem. It generates addresses for the voxels involved in

fully-parallel implementation uses sub-blo ck partitioning

re-sampling and gradient estimation. Aray's initial p osi-

to globally partition the dataset. Each sub-blo ck is lo-

tion and incremental values to the next re-sample lo cation

cally partitioned using the eight-way memory interleaving

are computed by the host computer and passed to the ASQ

scheme and is stored into the VoluMem of a given rendering

where they are incremented to compute the address of the

mo dule. Boundary voxels are replicated among adjacent

eightvoxels surrounding the re-sample lo cation.

rendering mo dules to enhance p erformance.

The REX unit p erforms trilinear interp olation using the

eight voxels from VoluMem to compute the re-sampled VOGUE is capable of p ersp ective pro jection and is able

value. The REX contains three stages of linear interp o- to utilize early ray termination. The estimated p erfor-

lators. Adjacent voxels from the trilinear interp olation mance using the fastest 1-access gradient mo de is 20Hz

3

neighb orho o d are used in linear interp olations to compute using eight mo dules for 256 datasets and using 64 mo d-

3

edge-values, then pairs of edge-values are used in linear ules for 512 datasets. VOGUE's highest quality gradient

interp olations to compute face-values, and the last linear mo de improves image-quality,however, p erformance is low-

Digital Signal Pro cessor DSP b oards ray casting unit in ered to 2 frames/second.

Figure 5 using a sp ecialized bus and stored into rst-in

B. VIRIM

rst-out memories FIFOs. The geometry units are much

faster than the DSPs; therefore, the FIFOs are required to

The VIRIM architecture has b een develop ed and assem-

de-couple sp eed di erences b etween the two units.

bled at the University of Mannheim [12] to achieve real-

The DSP b oard implements ray casting with the Hei-

time visualization on mo derate sized datasets 256  256 

delb erg illumination mo del. In the Heidelb erg mo del, the

128 with high image quality. VIRIM is an ob ject order

dataset is rotated such that viewer lo oks along a ma jor

ray casting engine that uses the Heidelb erg raytracing al-

axis. Two light sources enter the volume. One light source

gorithm [32] discussed b elow. The VIRIM architecture is

is along the direction of the viewer and the other light

shown in Figure 5. It consist of a geometry unit and a ray



source is 45 from the rst light source. Light intensity

y-slice, and the nal illumination value

Address Generator is calculated slice-b

p er sample is generated by the summation of all light in-

y emitted in the viewers direction. The Heidelb erg 2 Independent Interpolation tensit

Banks of 8 Units Each Weight Memory

raytracing algorithm can account for re ection, absorption,

Density LUT

emission of light, and is capable of pro ducing shadows.

DSP b oard contains eight DSP chips and a CPU. Interpolation Tree The

Geometry

Floating p oint op erations for the shading and visualiza-

Unit X-, Y- Gradient Processor

tion algorithm are p erformed by the DSPs. The DSPs are

programmable and provide exibility for the architecture

to implement di erentvolume rendering and shading algo-

DSPs DSPs rithms.

Ray-casting

Performance

Board Master Unit Board Master B.2

VIRIM is capable of pro ducing shadows and supp orts

Host Bus

p ersp ective pro jections. One VIRIM mo dule with four

b oards has b een assembled and achieves 2:5Hz frame rates

Fig. 5. VIRIM architecture.

for 256  256  128 datasets. Toachieveinteractive frame

rates, multiple rendering mo dules have to b e used; however,

casting unit. The geometry unit is resp onsible for inter-

dataset duplication is required. Four mo dules 16 b oards

p olation and gradient calculation; the ray casting unit is

are estimated to achieve 10Hz for the same dataset size,

resp onsible for implementing the actual ray casting algo-

and eight mo dules 32 b oards are estimated to achieve

rithm.

3

10Hz for 256 datasets [14].

B.1 Description

C. Array BasedRay Casting

The rotation of the dataset o ccurs on dedicated rota-

The Array Based Ray Casting engine develop ed at the

tion hardware called the Rotator Board geometry unit in

Universityof New South Wales [6] is an ob ject order ray

Figure 5. The Rotator Board aligns the dataset with the

casting architecture. This architecture consists of two par-

viewing p osition. The Rotator Board consists of the vol-

allel pip elined arrays used to rotate the dataset and to cast

ume memory, a geometry pro cessor, an interp olation pro-

rays, as illustrated in Figure 6. These rotation arrays are

cessor, and a gradient pro cessor.

The dataset is stored in an eight-wayinterleaved mem-

ory system. The dataset is rotated using backward map-

Double 1.5n

eighted interp ola- ping from the re-sample p osition and a w Warp Array Ray Array Frame

Input Buffered Rendering

tvoxel neighborhood. Arbitrary e.g.,

tion mask on an eigh Buffer

Stream Input Pipelines

terp olation weights can b e used in the 8-voxel

Gaussian in Memory

neighb orho o d instead of trilinear interp olation. The geom-

etry pro cessor generates the addresses for the eight memory

Unlike other architectures,

banks using a rotation matrix. Scanline 1.5n

VIRIM do es an interp olation on classi ed density values.

The mappings are stored in eight LUTs that can b e freely

mo di ed.

A mo di ed 2D Sob el lter is used to estimate the X

and Y comp onents of the gradientvector in the re-sampled

Scanline 0

system. Because of this, the gradient is only co ordinate y

Dataset z Slice Shear Ray Casting

wo-dimensional and view dep endent. The output of the

t x

rotator b oard are the density and gradient values for a

Fig. 6. Array BasedRay Casting architecture.

sample lo cation. These comp onents are transferred to the

connected between n memory mo dules and 1:5n render- lter is used in the Rendering Pip elines to estimated gradi-

ing pip elines, n is the resolution of the dataset. In the ents. The gradient estimation and shading algorithm uses

second array,intersections with voxels are determined by a full 26-voxel neighb orho o d and creates smo othly shaded

using nearest neighbor or zero order interp olation. Each images [9].

rendering pip eline p erforms shading and comp osition for

C.2 Performance

a given scanline. In addition, the system is comp osed of

a double-bu ered input memory, memory swapping array,

If n is the dimension of the volume data, the size of the

and a frame bu er.

Warp Array and Ray Array are 1:5n  n and 1:5n  1:5n,

3

resp ectively. For 256 datasets, this corresp onds to approx-

C.1 Description

imately 212,992 pro cessing elements. An additional col-

umn in each array contains the co ordinate initializers. The

The volume dataset is stored in a double-bu ered vol-

architecture also contains 1:5n rendering pip elines. The

ume memory that allows the simultaneous loading of one

Warp Array for this dataset dimension is estimated to t

dataset and visualization of another. The memory sys-

inside a 5  5 array of FPGAs. Pro cessing elements in the

tem uses orthogonal slice partitioning see Figure 3B. The

Ray Array are larger than those in the Warp Array and re-

dataset is stored in memory in a view dep endent manner

quire more hardware. However, a smaller Ray Array i.e.,

using co ordinate swapping. Using a spherical co ordinate

with fewer columns can b e used by time-multiplexing the

system, view p ositions are classi ed as b eing in one of eight

Ray Array and stalling the Warp Array, thereby reducing

primary o ctant regions. As the dataset is loaded, co ordi-

throughput.

nate swapping o ccurs based on the view p osition to allow

This architecture only supp orts parallel rendering and

con ict-free access to b eams. Note that co ordinate swap-

3

is capable of 15Hz frame rates for 256 datasets shar-

ping p erforms a partial rotation. Limited rotations ab out

ing the Ray Array 10 times. The architecture has under-

the X- and Y- axis o ccurs in the Warp Array and Ray Ar-

gone several changes since its publication and is now called

ray, resp ectively. These three partial rotations allow gen-

VIZAR [7].

eral rotation of the dataset.

Vertical b eams, indicated by similar shaded voxels in

D. EM-Cube

Figure 6, are loaded into the Warp Array in one clo ck cy-



EM-Cub e is a commercial version of the high-

cle. The Warp Array rotates the volume by 45 around

p erformance Cub e-4 [36], [37], [39] volume rendering ar-

the X-axis by shearing slices in Y. The shear is accom-

chitecture that was originally develop ed at the State Uni-

plished in the Warp Array by shifting b eams of voxels

versity of New York at Stony Bro ok. EM-Cub e is currently

based on a comparison of the row co ordinate and the

under development at Mitsubishi Electric Research Lab o-

b eam's rotated Y-co ordinate. The rst column of the Warp

ratory [35]. The Cub e family of architectures are charac-

Array computes these Y-co ordinates for each voxel, and

terized by memory skewing. EM-Cub e is a highly parallel

the remaining columns contain simple pro cessing elements.

architecture based on the hybrid order ray casting algo-

These elements p erform three basic functions: shift-right,

rithm shown in Figure 2C. Rays are sentinto the dataset

shift-right-up, and shift-right-down. The rows in b oth the

from each pixel on the base plane, which is co-planar to the

Warp Array and Ray Array corresp ond to a discrete Y-

face of the dataset that is most parallel to the image plane.

co ordinate. However, explicit Y-co ordinate information is

Because the image plane is typically at some angle to the

only stored in the Warp Array.

base-plane, the resulting base-plane image is 2D warp ed

Voxels in the rightmost column of the Warp Array pro-

onto the image plane.

ceed to adjacent pro cessing elements in the leftmost column

The main advantage of this algorithm is that voxels can

of the Ray Array. The Ray Array casts parallel rays into

b e read and pro cessed in planes of voxels so called slices

the sheared YZ-slices. As indicated in Figure 6, a row in-

that are parallel to the base-plane [37]. Within a slice,

side the Ray Array corresp onds to a scanline of rays. Ini-

voxels are read from memory a b eam of voxels at a time,

tializers compute X- and Z-co ordinates for voxels during

in top to b ottom order. This leads to regular, ob ject order

ray casting. Each ray's initial co ordinate and increment

data access. The EM-Cub e architecture utilizes memory

vector is shifted into place inside the Ray Array b efore ray

skewing [20] on a blo ck granularity for con ict-free b eam

casting. The pro cessing elements in the Ray Array im-

access.

plement a Compare-and-Shift-Right function. If a voxels

co ordinate matchesaray's current co ordinate, a ag is set

D.1 Description

which pro ceeds through the remainder of the array with

the voxel and co ordinate data. The Ray Array implements

EM-Cub e will be implemented as a PCI card for Win-

nearest neighb or or zero order interp olation.

dows NT computers. The card will contain one volume ren-

A one dimensional array of rendering pip elines classi es, dering ASIC, 32 Mbytes of volume memory, and 16 Mbytes

shades, and comp osites the voxels along the discrete rays. of lo cal pixel storage. The warping and displayof the -

To estimate gradients, each element in the Ray Arrays has nal image will b e done on an o -the-shelf 3D graphics card

additional registers to bu er voxel information. Voxels tra- with 2D . The EM-Cub e volume rendering

verse a row inside the Ray Array three times. A3 3box ASIC, shown in Figure 7, contains eight identical render-

E. VIZARD II ing pip elines, arranged side by side, and interfaces to voxel

memory, pixel memory, and the PCI bus. Each pip eline

The VIZARD II architecture is b eing develop ed at the

UniversityofTubingen to bring interactiveray casting into

SDRAM SDRAM SDRAM SDRAM

the realm of desktop computers [33]. This is the second

EM-Cube These image or- Voxel Memory Interface generation of VIZARD systems [23], [25].

ASIC Interpolation

der architectures are characterized by metho ds to reduce Slice Buffers

Pipeline 0 Pipeline 1 Pipeline 2 Pipeline 3 Pipeline 4 Pipeline 5 Pipeline 6 Pipeline 7 Gradient

memory bandwidth requirements for interactive visualiza-

Estimation

tion. While VIZARD uses a pre-shaded and pre-classi ed

PCI Interface

dataset, VIZARD II only prepro cesses gradi-

Shading & compressed

Classification

ents that are stored into a quantized gradient table. Using

tral di erences as the underlying gradient lter, prepro-

Pixel Memory Interface Compositing cen

cessing the gradient lter requires only a few seconds and

SDRAM SDRAM SDRAM SDRAM

is only p erformed once p er dataset. Gradient quantization

p otentially allows VIZARD II to implement global gradi-

Fig. 7. EM-Cub e architecture with eight identical ray casting

ents. VIZARD I I was designed to interface to a standard

pip elines.

PC system using the PCI bus. The dataset is stored in four

interleaved memory banks along with a pre-computed gra-

communicates with voxel and pixel memory and the two

dient index, segmentation index, and gradient magnitude

neighb oring pip elines. Pip elines on the far left and right

for each voxel. The combination of pre-computed gradi-

are connected to each other in a wrap-around fashion indi-

ents, caching, and early ray termination reduces the band-

cated by grey arrows in Figure 7. A main characteristic of

width requirements of the memory system. Added exi-

EM-Cub e is that eachvoxel is read from volume memory

bilityis obtained by using a DSP and FPGAs as the im-

exactly once per frame. Voxels and intermediate results

plementation technology. This allows the VIZARD I I card

are cached in so called slice bu ers so that they b ecome

to p erform other visualization task such as reconstruction,

available for calculations precisely when needed.

ltering, and segmentation.

EM-Cub e is a parallel pro jection engine with multiple

rendering pip elines. Each pip eline implements the ray cast-

E.1 Description

ing algorithm. Samples along eachray are calculated using

The VIZARD II architecture is illustrated in Figure 8.

trilinear interp olation. A 3D gradient is computed using

VIZARD II consists of four functional blo cks: Control

central di erences b etween trilinear samples. The gradient

is used in the shader stage, which computes the sample

Viewing Parameters Control

according to the unrestricted Phong lighting

illumination Unit Lo ok-up tables in the

mo del using a re ectance LUT [44]. (CU)

classi cation stage assign color and opacity to each sample

p oint. Finally, the illuminated samples are accumulated Address

Memory

to base plane pixels using front-to-back comp ositing. in Multiplexer

Unit (MU)

Volume memory is organized as four 64-Mbit 16-bit wide

SDRAMs for 32 Mbytes of volume storage. The volume

data is stored as a 2  2  2 blo cks of neighb oring voxels.

Volume Memory

ks are read and written in bursts of eight voxels

Miniblo c M1 M2 M3 M4

using the fast burst mo de of SDRAMs. In addition, EM-

Cub e uses linear skewing of these blo cks. Skewing guaran-

tees that the rendering pip elines always have access to four

Trilinear Interpolation Unit (TIU)

t miniblo cks in any of the three slice orientations.

adjacen Sample and Gradient

D.2 Performance

Shade and

r, g, b, α

is a parallel ray casting engine that imple- EM-Cub e Composit Unit

(SCU)

ments ahybrid order algorithm; however, EM-Cub e do es

not supp ort p ersp ective pro jections. Each of the four

Fig. 8. VIZARD II architecture.

SDRAMs provides burst-mo de access at up to 133MHz,

6

for a sustained memory bandwidth of 4  133  10 = 533

million 16-bit voxels p er second. Each rendering pip eline Unit, Memory Unit, Trilinear Interp olation Unit, and

op erates at 66MHz and can accept a new voxel from its Shading/Comp ositing Unit. The Control Unit is deter-

SDRAM memory every cycle. Eight pip elines op erating in mines intersections of the rays with the dataset and cut

6

parallel can pro cess 8  66  10 or approximately 533 mil- planes. The Memory Unit stores the dataset in four

3

lion samples p er second. This is sucient to render 256 SDRAM mo dules, each with its own SRAM cache. The Tri-

volumes at 30 frames p er second. linear Interp olation Unit is resp onsible for re-sampling the

dataset. It also interp olates the gradients at o -grid lo ca- VI. Performance Metrics

tions using eight parallel lo okups to the quantized gradient

The p erformance of a volume rendering architecture is

table. The Shading and Comp ositing Unit supp orts lo ok-

determined by several factors:

up-based shading and multiple classi cation tables. Final

Frame Rate is the numb er of images that can b e generated

pixels values are transferred through the PCI bus to the

p er unit of time and is measured in frames p er second or

host computer.

Hz.

VIZARD II implements an image order algorithm that

Samples Processed Per Second SPPS is the number of

utilizes early ray termination. The algorithm rst pre-

ltered samples that can b e generated p er unit of time. Un-

pro cesses the dataset to compute gradient indices for each

like frame rate, SPPS is not sensitive to image and dataset

voxel. The gradient index contains 9 bits but is not lim-

resolution. SPPS is similar to trilinear interp olated sam-

ited to 9 bits. Alternatively, the full gradient comp onent

ples p er second that is commonly used in the sp eci cation

could b e stored with the voxel memory. The 9-bit gradient

of 3D texture mapping hardware.

index generates 512 table entries. Using 512 entries, the

Latency is the time b etween a change in dataset or viewing

average error in the gradient computation is 2:3 degrees

parameters and the display of the up dated image.

and the maximum error is 7:9 degrees [33]. Larger gradi-

Image quality is mainly a qualitative assessment that is re-

ent tables can be used for greater accuracy. Four voxels

lated to the resolution of the generated images, interp ola-

including gradient index are simultaneously accessed in

tion lter, gradient lter, and illumination mo dels used.

parallel. These voxels are four-wayinterleaved with resp ect

Scalability is the ability of an architecture to extend its

to the YZ-plane. A burst memory access is used to fetch

p erformance by increasing the amount of computational

adjacentvoxels in the X-direction. To access a 2  2  2

and memory throughput. Ideally, a linear increase in the

trilinear neighborhood, four voxels in the YZ-plane are

number of rendering mo dules should linearly increase an

accessed in parallel from each bank, and a sequential burst

architecture's frame rate or maintain its frame rate for a

memory access from each bank provides the remaining four

linear increase in dataset size.

values from an adjacent YZ-plane. These values are cached

Although many architecture's primary goal is to achieve

in separate cache banks to allow parallel access and re-use.

high frame rates and SPPS, image quality may be more

desirable when visualizing static datasets. Frame rate and

The Interp olation Unit uses the fractional x; y ; and z

SPPS are indicators of the amount of acceleration that is

comp onents and the trilinear interp olation neighborhood

provided by the rendering architecture. One drawbackto

from the Memory Unit to re-sample the dataset. The gra-

these metrics is that b oth are prop ortional to the amount

dient index is used to address the gradient lo ok-up table.

of bandwidth available to the architecture. This is unde-

The resulting x; y ; and z gradient comp onents are interp o-

sirable since the architectures surveyed span several gener-

lated in a similar manner. The trilinear interp olation units

ations of memory technology. To address this problem, we

can sustain samples at a rate up to four times faster than

intro duce a simple mo del that measures the abilityof an

the rate of the memory system if the desired voxels reside

architecture to convert voxel throughput memory band-

inside the cache. The sample throughput is enhanced by

width into sample throughput. The mo del can b e derived

sup ersampling the dataset b ecause of a signi cant increase

by accounting for changes in throughput along the process-

in cache hits.

ing path as a voxel is converted into a ltered sample. This

The sample and gradientvalue are used in the Shading

leads to a relationship between p eak memory bandwidth

and Comp ositing Unit to classify and shade the sample.

and e ective sample throughput. Sample throughput can

The classi cation table is chosen using a classi cation in-

b e given by the following equation:

dex and the architecture handles segmentation and mul-

tiple cut-planes. Phong shading is implemented using a

P

lo ok-up table and comp ositing uses the standard "over"

z }| {

4

op erator. Early ray termination is utilized to increase the

1

S sample = B v oxel  U 



frame rate.

second

second

V v oxel

sample

where S is p erformance measured in SPPS, B is the p eak

bandwidth of the memory system, U is the memory band-

E.2 Performance

width utilization in p ercent, and V is the average number

of voxels that need to b e fetched p er sample. In this survey, VIZARD I I supp orts multiple cut planes, segmentation,

B is held constant to comp ensate for advances in memory parallel, and p ersp ective viewing. It is exp ected to sus-

technology and for varying degrees of parallelism among tain a frame rate of 10Hz. However, its worst-case p er-

the di erent architectures. formance is approximately 1 frame p er second. Worst-case

U , memory bandwidth utilization, is the p ercentage of p erformance o ccurs for 1:1 mapping of samples to voxels

the p eak memory bandwidth that is realized. The pro duct and a transparent classi cation of the dataset. Using four

of U and B is the sustained voxel throughput into the ren- 100MHz SDRAM devices, this architecture is capable of

6

dering comp onents. For maximum p erformance, U should 14  10 samples per second worst-case p erformance and

6

be 1:0. U accounts for any combination of random cycles, 56  10 samples p er second b est-case, assuming 4:1 map-

sequential cycles, and idle time on the memory bus. U is ping of samples to voxels.

given by: will reach a higher p erformance/price ratio than most in-

w

U =

5 teractivevolume rendering options currently available.

r C +1r 

where r is the p ercentage of all accesses that are random

and w is the p ercentage of time that voxels are trans-

VI I. Comparison

ferred to rendering comp onents bus utilization. C is the

Table I I I presents a comparison of the ve architectures.

sp eedup of a memory device obtained by using sequential

The categories include 1 status - the present stage of de-

memory access or burst access instead of random access.

velopment, 2 algorithm - typ e of ray casting algorithm

C is a memory technology dep endent term. For example,

used, 3 memory partitioning - the organization of the

assume a 100MHz SDRAM memory 10ns synchronous

memory system, 4 interp olation hardware - size, typ e,

access time has a 70ns random access time including page

and/or implementation of the interp olation lter, 5 gra-

faults. This leads to a C of 7:0 i.e., 700. In this case,

dient hardware - size, typ e, and/or implementation of the

if all memory accesses are random r =1:0 and the mem-

gradient lter, 6 shading - shading algorithm supp orted,

ory bus is fully utilized w =1:0, the memory bandwidth

7 p ersp ective supp ort - the ability to handle p ersp ective

utilization, U , would b e as lowas14:3. To account for

pro jections, 8 real-time data input - the p otential to sup-

the fact that not all random accesses are page faults we as-

p ort real-time streamed input, 9 target technology - typ e

sume that the memory bandwidth utilization is 20 when

of implementation, 10 p erformance limitations - b ottle-

the bus is fully utilized w = 1:0 with random accesses

necks in the system, 11 scalability- the ability to scale

r =1:0. This is a conservative estimate for newer DRAM

p erformance using multiple rendering pip elines, 12 algo-

memories.

rithmic acceleration - early ray termination supp ort, 13

V is the average number of voxels that are fetched from

sample pro cessing eciency - normalized acceleration met-

memory p er sample. V is related to the size of the gradi-

ric that also accounts for algorithmic sp eedup, 14 pub-

ent and interp olation lter. For example, an architecture

lished SPPS - p erformance numb ers obtained from the re-

that uses a 32-voxel gradient lter could have V = 32 using

sp ective publications, and 15 architectural highlights -

a brute force approach; however, if voxels and intermedi-

features considered imp ortant for next generation volume

ate results are bu ered eciently, V = 1 b ecause access to

rendering architectures.

bu ered or cached voxels can o ccur in parallel with new

The VOGUE architecture supp orts three rendering

voxel accesses. Some architectures use algorithmic acceler-

mo des based on its gradient computation kernel 8, 32,

ation e.g., early ray termination to enhance p erformance.

and 56 voxels. Voxels are re-fetched on average 8, 32, and

The reduction in fetched voxels due to algorithmic sp eedup

56 times for mo de 1, mo de 2, and mo de 3, resp ectively.

is view and dataset dep endent. In our comparison, we as-

VOGUE can accommo date multiple p oint light sources

sume a reduction of V by a factor of 3 when early ray

with an unrestricted Phong illumination mo del. Eachmod-

termination has b een implemented [28].

6

U

ule is memory limited and capable of 40  10 SPPS p er-

We call the term, P = in Equation 4, Sample Pro-

V

formance in the fastest rendering mo de. Because of its

cessing Eciency. It is an architecture sp eci c measure

random memory access pattern, VOGUE's memory band-

of how p eak memory bandwidth B  translates to SPPS

width utilization is 20. As a result, VOGUE's sample

S . If eachvoxel is accessed on average once p er ltered

pro cessing eciency is b etween 0:00357 mo de 3 and 0:025

sample V =1:0 and the memory system is fully utilized

mo de 1 without algorithmic acceleration. Assuming that

U =1:0, then the sample pro cessing eciency, P , will b e

an algorithmic sp eedup of 3 is realizable due to early ray

1:0. Sample pro cessing eciency may be larger than 1:0

termination, the sample pro cessing eciency in Table I I I

if compression is used or if the volume dataset is sup er-

has b een multiplied by3:0. In large multi-mo dule con gu-

sampled i.e., multiple re-sample lo cations per unit vol-

rations, VOGUE's sample pro cessing eciency p er mo dule

ume. Sample pro cessing eciency, P , is related to frame

may decrease due to network overhead.

rate, F ,by the following formula:

B v oxel  P sample

Image quality is the primary fo cus of the VIRIM archi-

second

v oxel

F f r ame = 6

second

T sample

tecture. This architecture is capable a exible illumina-

f r ame

B is the p eak bandwidth of the memory system and T is tion mo del including shadows. VIRIM uses programmable

the numb er of samples needed to render the frame. Sam- DSPs that supp ort other rendering, shading, and interp o-

6

ple pro cessing eciency only measures the ability to con- lation algorithms. VIRIM is capable of 40  10 SPPS

vert memory bandwidth to pro cessed samples; it is rela- p erformance. Multiple engines can b e used to increase p er-

tively indep endent of memory technology and can therefore formance; however, each engine must duplicate the entire

b e used as an ob jective measure of architecture eciency. dataset. The ob ject order algorithm leads to no random

However, it do es not measure other imp ortant p erformance memory accesses r =0:0. However, sample pro cessing ef-

metrics such as image-quality, cost, scalability, or latency. ciency is limited by a global bus b etween the re-sampling

Since the architectures presented in this pap er span several hardware and the rendering hardware. The sample pro-

generations in VLSI technology, it is dicult to augment cessing eciency, P , is the ratio of the sustainable sam-

the comparison with normalized cost. Consequently,wedo ple throughput of the bus b etween the Geometry and Ray

not explicitly compare cost or use a cost/p erformance ra- Casting Units and the p eak voxel throughput of the mem-

40M S ample

bus bandwidth VME tio. However, it should b e noted that these architectures ory system. P =0:2 for second

TABLE I I I

Architecture comparison.

VOGUE VIRIM Array Based EM-Cub e VIZARD I I

Ray Casting

Started 1993 1994 1995 1997 1998

Status Simulated System built Simulated ASIC in Simulated

development

Algorithm Image order Ob ject order Ob ject order Hybrid order Image order

Memory Eight-way Eight-way Orthogonal Skewed Four-way

Partitioning slice blo ck

Interp olation Trilinear Programmable Nearest Trilinear Trilinear

Hardware neighbor

Gradient Three Mo des 2D Sob el 26-voxel Central Quantized

Hardware 8/32/56 voxels lter neighborhood di erences gradient

Shading Phong Programmable Programmable Phong Phong

Persp ective Yes Yes No No Yes

Pro jections

Real-time Mo derately Dicult Easy Easy Dicult

Data Input Dicult

Target VLSI O -the-shelf FPGA VLSI DSP/FPGA

Technology comp onents

Performance Memory Bus Memory Memory Memory

Limitation

Scalability Mo derate Hard Easy Easy Mo derate

Early Ray Yes No No No Yes

Termination

P , Sample mo de 1: 0.075 0.2 0.1 m = 10 0.95 0.075 1:1 sampling

M V oxels

Pro cessing mo de 2: 0.01875 B = 200  1m =1 0.3 4:1 sup ersampling

second

Eciency mo de 3: 0.01071

6 6 6 6 6

Published SPPS 40  10 40  10 240  10 533  10 56  10

one mo dule m = 10 4:1 sup ersampling

Architectural High-quality Algorithmic Double- High Good

Highlight parallel and p ersp ective exibility bu ering p erformance p erformance

rendering and shadows cost ratio

bus and assuming the eight memory mo dules can sustain times the Ray Array is re-used. Theoretically, m can be

M V oxel

200 . A new memory architecture was intro duced as large as 1:5n, where n is the dataset resolution, greatly

second

in [5] that enhances the memory bandwidth utilization; reducing the size of the ray array and severely limiting the

however, the voxel re-fetch factor V is increased. This new p erformance of the architecture. m can b e assumed to lie

memory system uses bu ers and a pre-fetch mechanism to be between 1 and 10 for practical implementations. Using

achieve a sample pro cessing eciency of 0:125. this assumption, this architecture has a sample pro cessing

eciency of 1 b est-case and 0:1worst-case. m is a de-

The Array Based Ray Casting architecture strives for

sign parameter for this architecture based on target cost,

high p erformance. It uses slice-by-slice pro cessing to ren-

size, implementation technology, and p erformance.

der the dataset. The two arrays used in this architecture

are a 2D array of pro cessing elements. In FPGA imple- EM-Cub e is a highly optimized parallel rendering ar-

mentations, feedback paths in the larger Ray Array limit chitecture and it was designed to render high-resolution

the maximum clo ck sp eed. The double-bu ered memory datasets at real-time frame rates on a PC or workstation.

o ers supp ort for real-time data input. The architecture The skewed memory system and slice-parallel pro cessing al-

is fully pip elined and parallel and uses only lo cal commu- lows EM-Cub e to sustain a large memory bandwidth. EM-

nication. However, the hardware required to implement Cub e uses memory skewing on a blo ck granularity. This

2

the two arrays scales with O n , where n is the dimen- reduces chip pin-out and communication costs. This archi-

6

sion of the volume data. This architecture do es not use tecture can sustain 533  10 SPPS using four 133MHz

interp olation leading to lower image quality and do es not SDRAMs. During p ersp ective pro jections certain view

supp ort p ersp ective pro jections. The p erformance of this p oints may adversely a ect p erformance for hybrid order

architecture is memory limited. In the full implementation, algorithms. As a result, EM-Cub e do es not supp ort p er-

the Ray Array contains 1:5n  1:5n pro cessing elements. sp ective pro jections. EM-Cub e uses central di erences for

In this con guration, memory bandwidth utilization, U ,is gradient estimation which requires 32 voxels; however, no

1:0. However, one drawback is the size and cost to imple- p erformance p enalty is incurred since EM-Cub e uses on-

2

ment the O n Ray Array. One solution is to shorten the chip storage to bu er sample values. Only lo cal commu-

Ray Array reduced columns and to re-use them multi- nication between pro cessing pip elines are used, therefore,

ple times p er pro jection. In this smaller con guration, the EM-Cub e is highly scalable. EM-Cub e's memory utiliza-

memory bandwidth utilization is inversely prop ortional to tion is 1:0 b ecause it can sustain synchronous memory ac-

the numb er of times the Ray Array is shared. The sample cesses for each of its voxels. By pro cessing the dataset

1:0

pro cessing eciency, P , is , where m is the number of in sections, EM-Cub e is able to signi cantly reduce on- m

chip storage [35]. Consequently, EM-Cub e's average voxel cost tradeo s. On the other hand, VIZARD I I trades p er-

re-fetch is increases slightly b ecause some voxels on the formance versus image qualityby using a 9-bit quantized

b oundary of a section must be re-fetched from the mem- central di erence gradient. VIRIM's dataset duplication in

ory system. EM-Cub e's average voxel re-fetch is 1:05 for the fully parallel implementation illustrates p erformance-

eight sections on a 256  256  256 dataset. As a result, cost tradeo s. Furthermore, we see general trade-o s b e-

EM-Cub e's sample pro cessing eciency is 0:95. tween the di erenttyp es of parallel rendering algorithms.

Image order algorithms exhibit greater exibility, ob ject

VIZARD I I uses three metho ds to reduce memory band-

order algorithms typically have higher sample pro cessing

width requirements for interactive visualization: 9-bit

eciencies, and hybrid order algorithms can havecharac-

quantized gradient, caching, and early ray termination.

teristics of b oth. The primary ob jectives of the volume

VIZARD II supp orts p ersp ective viewing, multiple cut

rendering architect is to balance these tradeo s based on

planes, and segmentation. The quantized central di er-

the target applications.

ence gradient requires prepro cessing and may degrade im-

Of the surveyed architectures, EM-CUBE and VIZARD

age quality when compared to traditional central di erence

I I are the only architectures still actively b eing develop ed.

gradients. In general, gradient quantization can b e used to

The primary goal of these architectures is an interactive

supp ort a wide range of gradient lters and the size of the

visualization system that will augment a low-cost work-

gradient table can b e increased to enhance accuracy i.e.,

station or desktop computer. EM-Cub e and VIZARD II

image quality. The VIZARD I I architecture has b een sim-

will t on a PCI card for a standard PC. As such, b oth

ulated and is still at the research stage. A maximum p er-

6

architectures are also designed to b e low-cost.

formance of 56  10 SPPS p erformance can be achieved

assuming 4:1 sup ersampling of the dataset. Maximum p er-

Note that two of the three architectures Array Based

formance is limited by the memory system. Aworst-case

Ray Casting and EM-Cub e do not supp ort p ersp ective

6

p erformance of 1410 SPPS o ccurs for 1:1 sampling of the

pro jections. Although p ersp ective pro jections may have

dataset. The p erformance of this architecture is memory

less imp ortance in medical and scienti c visualization, p er-

limited. Since VIZARD I I uses early ray termination, we

sp ective pro jections are necessary in virtual reality,volume

assume that its algorithmic sp eedup is 3:0. VIZARD I I's

graphics, and stereoscopic rendering applications. A gen-

average voxel re-fetch is 8. The quantized central di er-

eral purp ose volume rendering solution must supp ort b oth

ence gradient prevents the voxel re-fetch from b eing larger

typ es of pro jections.

than the size of the trilinear interp olation neighborhood.

Volume rendering is useful in b oth scienti c visualization

Since VIZARD I I's memory system is clo cked at 20 of

and volume graphics. These two paradigms do not have

its maximum rate, its memory eciency factor is 0:2 as-

the exact same requirements. Scienti c visualization is pri-

suming worst-case 1:1 sampling. The cache memory and

marily parallel rendering with simple illumination mo dels;

Trilinear Interp olation Units are clo cked four times faster

whereas, volume graphics requires greater rendering ex-

than the rate of the memory system. When the dataset

ibility. EM-Cub e addresses many issues related to inter-

is 4:1 sup er-sampled or higher, a substantial increase in

active scienti c visualization. Although VIZARD II has

cache hits increases the memory bandwidth utilization to

lower p erformance, it is a more general rendering architec-

0:8 b est-case. This architecture has an overall sample pro-

ture with supp ort for p ersp ective pro jections, multiple cut

cessing eciency of 0:075 1:1 sampling and 0:3 4:1 sam-

planes, and segmentation.

pling taking into consideration a threefold sp eedup due to

early ray termination.

IX. Future of Special-Purpose Accelerators

Recently, interactive image generation rates have b een

VIII. Discussion

achieved on medium resolution i.e., 512  512  64

In each architecture, maximum p erformance is limited

datasets using 3D texture mapping a forward-pro jection

by either the memory system or global communication

algorithm [2]. One advantage of texture hardware is that

b ottlenecks, such as buses or networks. Memory limita-

volumetric primitives can b e mixed with geometric primi-

tions are inherenttoeach architecture. Global buses and

tives. However, texture mapping hardware do es not readily

networks arise b ecause of the need to communicate voxel

supp ort all asp ects of volume visualization, such as p er-

data or intermediate values over a common medium. The

sample gradient computation and p er-sample illumination.

volume rendering algorithm and memory organization de-

In addition, texture memory is usually limited. Volume

termine whether these p otential b ottlenecks a ect p erfor-

rendering hardware encompasses texture mapping by in-

mance. Architectures that are only memory limited tend

cluding 3D ltering trilinear interp olation and gradient

to b e more scalable.

lters and very high-bandwidth memory; thus, future vol-

There is a tradeo in most of the architectures b etween ume visualization accelerators are likely to b e used for high-

p erformance, quality, and hardware cost. VOGUE's di er- p erformance texture mapping. Texture memory is typi-

ent rendering mo des present p erformance-quality tradeo s. cally four-way eight-way interleaved similar to VIZARD

Practical implementations of the Array Based Ray Casting I I's and VOGUE's memory system to allow con ict-free

and EM-CUBE architectures use slightly mo di ed con gu- 2D 3D re-sampling. A p otential area of research is an

rations from a fully parallel design leading to p erformance- ecient memory organization that readily supp orts b oth

volume rendering, texture mapping, and next generation Acknowledgments

memory technology. In addition, the ability to render b oth

Sp ecial thanks to Meena Bhashyam for help review-

voxel and p olygonal primitives in a single pro jection for

ing the manuscript, and Michael Doggett, Christof Rein-

virtual reality should b e considered.

hart, Michael Meiner, and Gunter Knittel for their useful

The architectures surveyed represent the rst generation

insight and corresp ondence regarding their architectures.

of custom architectures that implement the ray casting al-

This work was supp orted by the Oce of Naval Research

gorithm. These architectures primarily fo cus on high out-

under grantnumb er N00014-92-J-1252 and CAIP research

put frame rates. We b elieve the second generation archi-

center at Rutgers State university.

tectures will address dynamic datasets or large input rates.

Fast sampling devices and interactivevolume graphics will

References

help promote this trend. Future interactive visualization

[1] I. Bitter and A. Kaufman. A Ray-Slice-Sweep Volume Rendering

systems may so on consist of a sp ecial purp ose accelerator

Engine. In Proceedings of the Siggraph/Eurographics Workshop

on Graphics Hardware, pages 121{138, August 1997.

card connected to a real-time data acquisition subsystem.

[2] B. Cabral, N. Cam, and J. Foran. Accelerated Volume Rendering

In volume graphics applications, it is exp ected that vol-

and Tomographic Reconstruction using Texture Mapping Hard-

ume up dates, volume animation, and voxelization will re-

ware. In 1994 Workshop on Volume Visualization, pages 91{98,

Washington, DC, Octob er 1994.

quire input frame rates comparable to output frame rates.

[3] R. Crisp. Direct Rambus Technology: The Next Main Memory

Consequently,it may b e necessary to parallelize voxel in-

Standard. IEEE Micro, 176, Novemb er 1997.

put into the volume memory and the memory partitioning

[4] T. J. Cullip and U. Neumann. Accelerating Volume Reconstruc-

tion with 3D Texture Mapping Hardware. In Technical Report

scheme in a given architecture may not necessarily b e the

TR93-027, Department of Computer Science at the University

optimal partitioning scheme for b oth parallel voxel input

of North Carolina, Chap el Hill, 1993.

and output. Also, double bu ered volume memory see Ar-

[5] M. de Bo er, A. Gropl, T. Gun  ther, C. Poliwo da, C. Reinhart J.

Hesser, and R. Manner. Latency-Free and Hazard-Free Volume

ray Based Ray Casting will b e necessary to eliminate ren-

Memory Architecture for Direct Volume Rendering. In Proceed-

dering artifacts and p erformance loss when simultaneously

ings of the 11th Eurographics Hardware Workshop, pages 109{

loading and visualizing dynamic datasets. Furthermore, 118, Poitiers, France, August 1996.

[6] M. C. Doggett. An Array Based Design for Real-Time Volume

3D double-bu ering follows the development of the tradi-

Rendering. In 10th Eurographics Workshop on Graphics Hard-

tional graphics pip eline i.e., double-bu ered frame bu er.

ware, pages 93{101, August 1995.

These are areas for additional research.

[7] M. C. Doggett. Vizar : A VideoRate System for Volume Visu-

alization. PhD thesis, University of New South Wales, 1996.

If volume graphics is to o set traditional p olygonal

[8] M. C. Doggett and G. R. Hellestrand. Video Rate Shading for

Volume Data. In Australian Pattern Recognition Society Digital

graphics, volumetric raytracing [48] and p ersp ective pro-

Image Computing : Techniques and Applications, pages 398{

jections must b e considered. Raytracing is capable of pro-

405, Decemb er 1993.

ducing photo-realistic images using shadows, re ection, re-

[9] M. C. Doggett and G. R. Hellestrand. A Hardware Architecture

for Video Rate Smo oth Shading of Volume Data. In Eurographics

fraction, etc. For instance, VIRIM is capable of shadows.

Hardware Workshop, pages 95{102, Septemb er 1994.

Secondary rays during raytracing have less coherence than

[10] S. M. Goldwasser, R. A. Reynolds, and T. Bapty. Physician's

primary rays, therefore, volumetric raytracing may b e b et-

Workstation with Real-Time Performance. IEEE Computer

Graphics and Applications, 512:44{57, Decemb er 1985.

ter suited for image order architectures. Ray casting archi-

[11] R. Grzeszczuk, C. Henn, and R. Yagel. Advanced Geometric

tectures that rely on lo ck-step coherence b etween cast rays

Techniques for Ray Casting Volumes. In SIGGRAPH 98 Course

i.e., Array Based Ray Casting and EM-Cub e mayhave

Nbr. 4, Orlando, FL, 1998.

[12] T. Gun  ther, C. Poliwo da, C. Reinhart, J. Hesser, R. Manner,

diculty incorp orating raytracing.

H.-P. Meinzer, and H.-J Baur. VIRIM: A Massively Parallel

Memory throughput increased an order of magnitude

Pro cessor for Real-Time Volume Visualization in Medicine. In

Proceedings of the 9th Eurographics Hardware Workshop, vol-

e.g., Direct Rambus since most of the architectures in

ume 19, No. 5, pages 705{710, 1995.

this pap er were rst prop osed. Advances in semiconductor

[13] B. M. Hemminger, T. J. Cullip, and M. J. North. Interactive

technologies towards deep-submicron pro cesses will con-

Visualization of 3D Medical Image Data. In Technical Report

TR94-027, Department of Radiology and Radiation Oncology

tinue to promote higher logic sp eeds, higher memory den-

at the University of North Carolina, Chap el Hill, 1994.

sity,aswell as lower memory access times. In addition, the

[14] J. Hesser, R. Manner, G. Knittel, W. Straer, H. P ster, and

ability to embed large amounts of memory on-chip with

A. Kaufman. Three Architectures for Volume Rendering. In

Proceedings of Eurographics '95,volume 14, No. 3, Maastricht,

computational units will further enhance memory through-

The Netherlands, Septemb er 1995. Europ ean Computer Graph-

put. All of these trends will simplify future volume render-

ics Asso ciation.

ing architectures, increase their sp eed, and lower their cost.

[15] K. Hohne, M. Bomans, A. Pommert, M. Riemmer, C. Schiers,

U. Tiede, and G. Wieb ecke. 3D Visualization of Tomographic

Furthermore, as these sp ecial-purp ose accelerators evolve,

Volume Data Using The Generalized Voxel Mo del. The Visual

software and application-program interfaces APIs will b e

Computer, 61:28{36, February 1990.

de ned and develop ed [11], [29], [30]. They will provide [16] D. Jackel. The graphics PARCUM system: A 3D Memory Based

Computer Architecture for Pro cessing and Display of Solid Mo d-

the user with more exibility and additional features, such

els. Computer Graphics Forum, 44:21{32, 1985.

as stereoscopic views. The availability of low-cost real-

[17] S. Juskiw, N. Durdle, V. Raso, and D. Hill. Interactive Rendering

of Volumetric Data Sets. Computer and Graphics, 195:685{

time hardware and industry strength APIs will increase

693, 1995.

the acceptance of volume visualization and volume graph-

[18] A. Kaufman. Volume Visualization. IEEE Computer So ciety

ics. This will certainly lead to the development of new and

Press, 1991.

exciting applications. [19] A. Kaufman and R. Bakalash. CUBE - An Architecture Based on

[43] S. W. Smith, H. G. Pavy, and O. T. von Ramm. High-Sp eed Ul- a3DVoxel Map. Theoretical Foundations of Computer Graphics

trasound Volumetric Imaging System { Part I: Transducer De- and CAD, pages 689{700, 1988.

sign and Beam Steering. IEEE Transactions on Ultrasonics,

[20] A. Kaufman and R. Bakalash. Memory and Pro cessing Archi-

Ferroelectrics, and Frequency Control, 382:100{108, 1991.

tecture for 3D Voxel-based Imagery. IEEE Computer Graphics

[44] J. van Scheltinga, J. Smit, and M. Bosma. Design of an On-

and Applications, 86:10{23, Novemb er 1988.

Chip Re ectance Map. In Proceedings of the 10th Eurographics

[21] A. Kaufman, D. Cohen, and R. Yagel. Volume Graphics. IEEE

Workshop on Graphics Hardware, pages 51{55, Maastricht, The

Computer, 267:51{64, July 1993.

Netherlands, August 1995.

[22] G. Knittel. VERVE: Voxel Engine for Real-Time Visualiza-

[45] O. T. von Ramm, S. W. Smith, and H. G. Pavy. High-Sp eed

tion and Examination. Computer Graphics Forum, 193:37{48,

Ultrasound Volumetric Imaging System { Part I I: Parallel Pro-

Septemb er 1993.

cessing and Image Display. IEEE Transactions on Ultrasonics,

[23] G. Knittel. A PCI-based Volume Rendering Accelerator. In

Ferroelectrics, and Frequency Control, 382:109{115, 1991.

In Proceedings of the 10th Eurographics Workshop on Graphics

[46] L. A. Westover. Splatting: A Paral lel, Feed-Forward Volume

Hardware, pages 73{82, August 1995.

Rendering Algorithm. PhD thesis, The University of North Car-

[24] G. Knittel. A Scalable Architecture for Volume Rendering. Com-

olina at Chap el Hill, Department of Computer Science, jul 1991.

puter and graphics, 195:653{665, 1995.

[47] R. Yagel. Towards Real Time Volume Rendering. In Proceed-

[25] G. Knittel and W. Straer. VIZARD-Visualization Accel-

ings of GRAPHICON 1996 Saint-Petersburg, Russia,volume 1,

erator for Realtime Display. In Proceedings of the Sig-

pages 230{241, July 1996.

graph/Eurographics Workshop on Graphics Hardware, pages

[48] R. Yagel, D. Cohen, and A. Kaufman. Discrete Ray Trac-

139{146, August 1997.

ing. IEEE Computer Graphics and Applications, 129:19{28,

[26] P. Lacroute. Analysis of a Parallel Volume Rendering System

Septemb er 1992.

Based on the Shear-Warp Factorization. IEEE Transactions on

[49] R. Yagel and A. Kaufman. Template-Based Volume Viewing.

Visualization and Computer Graphics, 23:218{231, September

Computer Graphics Forum, 113:153{167, Septemb er 1992.

1996.

[50] T. S. Yo o, U. Neumann, H. Fuchs, S. M. Pizer, T. Cullipand J.

[27] P. Lacroute and M. Levoy. Fast Volume Rendering Using a

Rhoades, and R. Whitaker. Direct Visualization of Volume Data.

Shear-Warp Factorization of the Viewing Transformation. In

IEEE Computer Graphics and Applications, pages 63{71, July

Andrew Glassner, editor, Proceedings of SIGGRAPH '94 Or-

1992.

lando, Florida, July 24-29, 1994, Computer Graphics Pro ceed-

[51] K. J. Zuiderveld. Visualization of Multimodality Medical Vol-

ings, Annual Conference Series, pages 451{458. ACM Press, July

ume Data using Object-Oriented Methods. PhD thesis, Utrecht

1994.

University, March 1995.

[28] M. Levoy. Display of Surfaces From Volume Data. IEEE Com-

puter Graphics and Applications, 58:29{37, May 1988.

[29] B. Lichtenb elt. Design of A High Performance Volume Visual-

ization System. In Siggraph/Eurographics Hardware Workshop,

pages 111{119, August 1997.

[30] B. Lichtenb elt, R. Crane, and S. Naqvi. Introduction to Volume

Rendering. Hewlett-Packard Professional Bo oks. Prentice Hall

PTR, 1998.

[31] B. Lichtenb elt M. Bentum and T. Malzb ender. Frequency Anal-

ysis of Gradient Estimators in Volume Rendering. IEEE Trans-

actions on Visualization and Computer Graphics, 23:242{254,

Septemb er 1996.

[32] H.-P. Meinzer, K. Meetz, D. Schepp elmann, U. Engelmann, and

H. J. Baur. The Heidelb erg RayTracing Mo del. IEEE Computer

Graphics and Applications, pages 34{43, Novemb er 1991.

[33] M. Meiner, U. Kanus, and W. Straer. VIZARD I I: A PCI-

Card for Real-Time Volume Rendering. In Proceedings of the

Siggraph/Eurographics Workshop on Graphics Hardware, pages

61{67, Lisb on, Portugal, August 1998.

[34] T. Moller, R. Machira ju, K. Mueller, and R. Yagel. A Com-

parison of Normal Estimation Schemes. In IEEE Visualization

Proceedings 1997, pages 19{26, Octob er 1997.

[35] R. Osb orne, H. P ster, H. lauer, N. McKenzie, S. Gibson,

W. Hiatt, and T. Ohkami. EM-Cub e: An Architecture for

Low-Cost Real-Time Volume Rendering. In Proceedings of the

Siggraph/Eurographics Workshop on Graphics Hardware, pages

131{138, Los Angeles, CA, August 1997.

[36] H. P ster. Architectures for Real-Time Volume Rendering. PhD

thesis, State University of New York at Stony Bro ok, Computer

Science Department, Stony Bro ok, NY 11794-4400, 1996. MERL

Rep ort No. TR-97-04.

[37] H. P ster and A. Kaufman. Cub e-4 - A Scalable Architecture for

Real-Time Volume Rendering. In Volume Visualization Sympo-

sium Proceedings, pages 47{54, Octob er 1996.

[38] H. P ster, A. Kaufman, and T. Chiueh. Cub e-3: A Real-Time

Architecture for High-Resolution Volume Visualization. In In

1994 Workshop on Volume Visualization, pages 75{83, Wash-

ington, DC, Octob er 1994.

[39] H. P ster, A. Kaufman, and F. Wessels. Towards a Scalable

Architecture for Real-Time Volume Rendering. In Proceedings of

the 10th Eurographics Workshop on Graphics Hardware, pages

123{130, Maastricht, The Netherlands, August 1995.

[40] B. Phong. Illumination for Computer Generated Pictures. Com-

munications of the ACM, 186:311{317, 1975.

[41] T. Porter and T. Du . Comp ositing Digital Images. Computer

Graphics, 183, July 1984.

[42] P.Schroder and G. Stoll. Data Parallel Volume Rendering as

Line Drawing. In 1992 Workshop on Volume Visualization, pages 25{31, Boston, MA, Octob er 1992.