Scalable Data Transformations for Low-Latency Large-Scale Data Analysis

Dissertation

Presented in Partial Fulfillment of the Requirements for the Degree Doctor of Philosophy in the Graduate School of The Ohio State University

By

Steven Martin, B.S., M.S.

Graduate Program in Computer Science and Engineering

The Ohio State University

2013

Dissertation Committee:

Han-Wei Shen, Advisor Roger Crawfis Raghu Machiraju c Copyright by

Steven Martin

2013 Abstract

Interactive analysis of simulation results has become a mainstay of science and engi- neering. With continually increasing compute power, the size of simulation results con- tinues to grow. However, network and mass storage device throughput are not increasing as quickly. This introduces difficulties in scaling analysis workflows to take advantage of these new compute resources.

This dissertation describes a body of work that increases the scalability of interactive volume analysis workflows by moving elements strongly dependent on the input data size from the interactive phase of the workflow to the data preparation phase. This reduces the overall computational complexity of the interactive phase, enabling reduced interaction latency. Two related groups of approaches are explored: salience-aware techniques, and techniques for scalable salience discovery.

Salience-aware techniques leverage the tendency for different parts of volume data to be of differing importance. In this dissertation, salience-aware techniques are proposed for load-balancing of isosurfacing on clusters, and salience-aware level of detail selection.

In many cases, the salience of data may not be known a priori. Salience discovery techniques seek to facilitate the discovery of salience of different interval volumes. In this dissertation, a technique for iterative salience discovery, in the context of interactive transfer function design on large-scale volumes, is discussed. Supporting that, a technique is described for evaluating distribution range queries.

ii For both groups of techniques, scalable data transformations are described and target applications are explored. This work streamlines workflows for visualization of large-scale volume data.

iii Acknowledgments

I would like to thank Professor Han-Wei Shen for the advice and support he has given me as my advisor. I would also like to thank Professors Barb Cutler and W Randolph

Franklin at Rensselaer Polytechnic Institute for giving me the chance to learn about aca- demic research. Finally, I would like to thank Pat McCormick at Los Alamos National

Laboratory and Thomas Ruge at NVIDIA Corporation for their advice and support during my internships. The work in this dissertation would not have been possible without the feedback, advice, and support of these and many other individuals. It is in that context that

“We” is used to to the author in this dissertation.

iv Vita

2007 ...... B.S.Electrical Engineering, Rensselaer Polytechnic Institute 2007 ...... B.S.Computer and Systems Engineering, Rensselaer Polytechnic Institute 2012 ...... M.S.Computer Science, The Ohio State University

Publications

Research Publications

B. Cutler, Y. Sheng, S. Martin, D. Glaser, “Interactive Selection of Optimal Fenestration Materials for Schematic Architectural Daylighting Design”. Automation in Construction, 17, 2008.

S. Martin, H. Shen, R. Samtaney, “Efficient Rendering of Extrudable Curvilinear Vol- umes”. Proceedings of IEEE Pacific Visualization Symposium, 2008.

P. McCormick, E. Anderson, S. Martin, C. Brownlee, J. Inman, M. Maltrud, M. Kim, J. Ahrens, L. Nau, “Quantitatively Driven Visualization and Analysis on Emerging Archi- tectures”. SciDAC Journal of Physics, 2008.

S. Martin, H. Shen, “Load-Balanced Isosurfacing on Multi-GPU Clusters”. Proceedings of Eurographics Symposium on Parallel Graphics and Visualization, 2010.

S. Martin, H. Shen, “Histogram Spectra for Multivariate Time-Varying Volume LOD Se- lection”. Proceedings of IEEE Symposium on Large-Scale Data Analysis and Visualization, 2011.

v S. Martin, H. Shen, “Interactive Transfer Function Design on Large Multiresolution Vol- umes”. Proceedings of IEEE Symposium on Large-Scale Data Analysis and Visualization, 2012.

S. Martin, H. Shen, “Stereo Frame Decomposition for Error-Constrained Remote Visual- ization”. SPIE Visualization and Data Analysis, 2013.

S. Martin, H. Shen, “Transformations for Volumetric Range Distribution Queries”. Pro- ceedings of IEEE Pacific Visualization Symposium, 2013.

Fields of Study

Major Field: Computer Science and Engineering

vi Table of Contents

Page

Abstract...... ii

Acknowledgments...... iv

Vita ...... v

ListofTables ...... xi

ListofFigures...... xii

1. Introduction...... 1

1.1 Salience-AwareTechniques ...... 2 1.1.1 Load-Balanced Parallel Isosurfacing ...... 2 1.1.2 LevelofDetailSelection...... 4 1.2 Techniques for Scalable Salience Discovery ...... 5 1.2.1 Transformations for Volumetric Range Distribution Queries . . . 5 1.2.2 InteractiveTransferFunctionDesign ...... 6 1.3 Contributions...... 8

2. Load-Balanced Isosurfacing on Multi-GPU Clusters ...... 10

2.1 RelatedWork...... 12 2.2 BlockDistributionAlgorithm ...... 15 2.2.1 IsosurfacingCostHeuristic ...... 16 2.2.2 Preprocessing ...... 18 2.2.3 Profiling ...... 18 2.2.4 Assignment ...... 19 2.3 BlockIsosurfacingAlgorithm ...... 21 2.3.1 TriangleCounting ...... 21

vii 2.3.2 TriangleCreation ...... 22 2.3.3 Optimizations ...... 23 2.4 Results ...... 24 2.4.1 TriangleCountsversusIsosurfacingTime...... 25 2.4.2 Effectsofsalientisovaluerangesonspeedup ...... 28 2.4.3 Volumesizescalability ...... 30 2.4.4 Strongscalability ...... 31 2.5 Conclusion ...... 32

3. Stereo Frame Decomposition for Error-Constrained RemoteVisualization . . . 39

3.1 RelatedWork...... 40 3.2 Technique...... 45 3.2.1 Reprojection ...... 46 3.2.2 ResidualDecimation...... 48 3.2.3 DecimatedResidualCodec ...... 49 3.2.4 Remapping...... 51 3.2.5 RateBalancing...... 52 3.3 ErrorConstraints...... 55 3.3.1 ColorDifference...... 55 3.3.2 TransferFunctionDistance ...... 56 3.3.3 IntegratedTransferFunctionContrast ...... 57 3.4 Results ...... 57 3.4.1 Datasets ...... 58 3.4.2 LossyCodecs ...... 58 3.4.3 LosslessCodecs ...... 59 3.4.4 CompressionPerformance...... 60 3.5 Conclusion ...... 63

4. Histogram Spectra for Multivariate Time-Varying Volume LODSelection . . . 68

4.1 RelatedWork...... 70 4.2 LevelofDetailSelection...... 74 4.2.1 HistogramSpectra ...... 74 4.2.2 WeightedHistogramSpectra ...... 76 4.2.3 PredictedErrorUsingHistogramSpectra ...... 77 4.2.4 DiscretizationofHistogramSpectra ...... 78 4.2.5 Integer Programming Formulations for LOD Selection ...... 79 4.2.6 Greedy Algorithm for Nonlinear Integer Programming Formu- lation...... 81 4.2.7 MultivariateConsiderations ...... 83 4.3 Results ...... 86

viii 4.3.1 Testdatasets...... 87 4.3.2 Runningtimecomparisons ...... 88 4.3.3 Visualandstatisticalcomparisons ...... 91 4.4 Conclusion ...... 92

5. Efficient Rendering of Extrudable Curvilinear Volumes ...... 97

5.1 RelatedWork...... 99 5.2 Applications ...... 102 5.3 ComputationalSpaceRepresentation ...... 104 5.3.1 Dataandspaces ...... 104 5.3.2 Positionaltransformations ...... 104 5.3.3 Computationofp ¯(k),n ˆ(k),s ¯(i, j), andy ˆ...... 105 5.3.4 Jacobianmatrices ...... 108 5.3.5 AMRintegration...... 110 5.4 Rendering...... 110 5.4.1 RayCasting ...... 113 5.4.2 Correctionloop ...... 114 5.4.3 Steplengthdetermination ...... 116 5.4.4 GPUimplementation ...... 117 5.5 Results ...... 119 5.6 Conclusion ...... 122

6. Transformations for Volumetric Range Distribution Queries...... 125

6.1 RelatedWork...... 129 6.2 Technique...... 132 6.2.1 Highleveloverview ...... 133 6.2.2 IntegralDistributionFunction ...... 133 6.2.3 Discretization ...... 136 6.2.4 SpanDistributions ...... 138 6.2.5 StorageofSpanDistributions ...... 142 6.2.6 ApproximateQuerieswithSpanDistributions ...... 146 6.2.7 ComparingSpanDistributions...... 148 6.3 WorkingSetsinApplications ...... 149 6.3.1 Application:H¨ovmollerdiagrams ...... 151 6.3.2 Application:Transferfunctiondesign ...... 153 6.4 ExtensionsandConclusion...... 155

7. Interactive Transfer Function Design on Large MultiresolutionVolumes . . . . 158

7.1 RelatedWork...... 159

ix 7.2 Technique...... 161 7.2.1 CursorHistograms...... 162 7.2.2 HistogramExpressions ...... 163 7.2.3 LevelofDetailSelection...... 164 7.2.4 TransferFunctionConstruction ...... 166 7.2.5 Interaction ...... 167 7.3 Results ...... 167 7.4 ConclusionandExtensions...... 169

8. ExtensionsandConclusion ...... 172

Bibliography ...... 174

x List of Tables

Table Page

3.1 Cross correlations were computed between the bitrates for many observed trials...... 54

5.1 Set1blocks...... 119

5.2 Set2blocks...... 119

5.3 Datasetmemoryrequirements ...... 119

5.4 Set 2 rendering times for different minimum step lengths and viewport res- olutions...... 121

5.5 Set 1 rendering times for different minimum step lengths and viewport res- olutions...... 121

xi List of Figures

Figure Page

1.1 Salience-aware techniques can facilitate salience discovery techniques. Sim- ilarly, salience discovery techniques can facilitate salience-aware techniques. 2

2.1 Our approach preprocesses the volume data for a range of salient isovalues to estimate the amount of work required to perform isosurfacing for blocks of the input volume. The blocks are subsequently assigned to GPUs such that the isosurfacing work is more evenly distributed...... 12

2.2 The time required for isosurfacing a single block of a volume varies ap- proximately linearly with the triangle count in the isosurface. The constant factor in the fit line is reduced by applying the optimizations discussed in section2.3.3...... 17

2.3 The triangle counting and creation process computes vertex buffer offsets for rows of the block of cells being isosurfaced then applies marching cubes tofillthevertexbuffer...... 34

2.4 The blue (dark) surface is isovalue -1.0 within the test volume used for the subsequent graphs and the yellow (light) surface is isovalue +3.0 within the same volume. At a volume resolution of 384x256x256 the gold surface contains 298858 triangles and the blue surface contains 916337triangles. . 35

2.5 The salient isovalue ranges substantially affect the performance. In this figure it can be seen that the speedup is improved over ranges of isovalues that are specified as salient. When no cost heuristic is used, the distribu- tion of performance over the isovalue range is not well defined because the effective cost value of the work for each block is equal. Each line has 1100 sample isovalues, computed over a 1536x1024x1024 test volume on 24GPUs...... 36

xii 2.6 The performance advantage of using our cost heuristic over using no cost heuristic is maintained over the range of loadable volume sizes on a clus- ter of 24 GPUs. The salient isovalue range used for the cost heuristic is 2.75 to 3.25, resulting in a mean isosurfacing performance on the order of 250 million triangles per second over that range of isovalues. Using no cost heuristic over that same range yields performance on the order of 175 milliontrianglespersecond...... 37

2.7 Using our proposed cost heuristic improves scalability, especially when the isovalues for which isosurfaces are being computed are within the salient range. In this figure, the salient range of isovalues used for the computation of the cost heuristic is 2.75 to 3.25 and the volume is 768x512x512 samples. 38

3.1 The difference between the ground truth and the error-constrained degraded image is wasted information that would need to be transmitted, if lossless encodingwereused...... 41

3.2 The framework decomposes the left and right frames into one depth stream, one color stream, and two residual streams in the encoder (§3.2), which are then reconstructed into left and right frames in the decoder. Because the depth stream generally takes much less space than the color stream, and the error introduced by reprojection is small, this yields better performance than encoding the streams separately. Additionally, transmission of partial residuals subject to user-defined error constraints enables fidelity guaran- teesforvisualizationapplications...... 64

3.3 The per-pixel color magnitudes of the residuals are shown for both eyes, before and after decimation subject to an error constraint, with darker col- orsmeaninggreatermagnitude...... 65

3.4 Increasing the LCE (left color encoded) bitrate decreases the LRE (left residual encoded) bitrate. Increasing the LDE (left depth encoded) bi- trate decreases the RRE (right residual encoded) bitrate. The curves, from top to bottom, have ITFC error constraints of 0,6,12,18,24,and 32. More- restrictive constraints tend to require higher LCE and LDE bitrates for op- timalperformance...... 66

3.5 Different types of transfer functions are appropriate for different types of errorconstraints...... 66

xiii 3.6 Stereo rendering of the combustion dataset (§3.4.1) using isosurfacing with a2Dtransferfunction(figure3.5b)...... 67

3.7 The benefit of using our reprojection technique or a joint coding technique over discrete coding techniques increases as the eye separation is reduced, as explained in §3.4.4. Additionally, the benefit of using the reprojection technique increase as the error constraints are loosened...... 67

4.1 The histogram spectra generator takes a multiresolution bricked volume and generates a histogram spectrum for each subvolume (“brick”) of the volume. This will be done as a precomputation step in the data preparation phase. The LOD selector then uses that, with a set of user-defined parame- ters such as intervals of interest, to produce a LOD selection set. The LOD selection can be performed interactively...... 74

4.2 This histogram spectrum of a single plane of a single timestep of the QVA- POR variable of the climate test data set (defined in §7.3) is typical of his- togram spectra. Moving up on the vertical axis corresponds to downsam- pling, and each column corresponds to the potential change in the area of an isosurface as a function of sampling frequency. Columns with brighter colors in this plot correspond to values that are more sensitive to sampling. Rows with brighter colors correspond to sampling frequencies with greater overall,unweighted,error...... 75

4.3 The weighting function is used to control the width of the interval volumes of interest in the context of the level of detail selection. In this example a weighting function was chosen to place importance on the interval of values from 0.0070 to 0.0105. The weighting function is applied over the columns of the histogram spectrum, facilitating the computation of histogram spec- trumpredictederrorasinequation(4.4)...... 76

4.4 The RMS error is proportional to the histogram spectrum predicted error. This fig. exhibits a test case on the QVAPOR variable of the climate data set (defined in §7.3), and is typical of what we have observed on other data sets. The exact scaling factor to determine the RMS error depends on the units of the data in the field and the norm of the weighting func- tion. However, this does not need to be computed because only the relative differences between errors need to be used in the algorithm discussed in §4.2.6. Because the RMS error is linearly proportional to the histogram spectrum predicted error, the ratio between two RMS errors is the same as the ratio between their corresponding histogram spectrum predicted errors. . 78

xiv 4.5 Directly solving the integer programming problem with a general integer programming package is impractical due to the high computational com- plexity involved in solving the NP-hard problem. Our greedy algorithm as described in §4.2.6 yields nearly identical results with O(NlgN) where N is linearly proportional to the number of subvolumes...... 84

4.6 Several variables from the climate data set are rendered for a single timestep. The white, opaque parts are clouds defined by the QCLOUD variable. The magenta regions are clouds with high vertical velocities, as determined by the W variable. The yellow exhibits water vapor density as determined by the QVAPOR variable. The volume is a curvilinear volume, with the Z variable of its mesh determined by the MESHZ variable. All of these vari- ables have their levels of detail determined by the level of detail algorithm. Figures 4.6a and 4.6b are generated from the ground truth resolution, while figures 4.6c and 4.6d have levels of detail selected for a 4GiB working set size constraint. Figure 4.6c was generated with narrow intervals of interest, while fig. 4.6d was generated with wide intervals of interest. Like in fig. 4.9, selecting narrow intervals of interest yields results closer to the ground truththanselectingwideintervalsofinterest...... 93

4.7 In some cases, with multivariate fields, a user is interested in seeing a vari- able A where variable B is between B0 and B1. This interval [B0 : B1] is expressed as a weighting function for the histogram spectra of B. The choice of the best weighting function for A depends on the statistical de- pendence between A and B. If A is not independent of B then we can use the conditional probability density function of A given the case that B lies within [B0 : B1] as a starting point for constructing a weighting function for A. In this example it can be seen that the PDF of the vertical veloc- ity(W) in the climate data set is different for different intervals of the cloud density(QCLOUD.)...... 94

4.8 This figure shows the error for different working set size constraints, using different error estimators in the LOD selection algorithm. The E j function in the optimization problem as referenced by equation (4.5) can be approx- imated using equation (4.4) instead of directly computing the RMS error (RMSE). The prediction of error using the histogram spectra predicted er- ror (HSPE) yields results close to the direct RMS error. By using equation (4.4) with histogram spectra it is possible to avoid loading samples from the source volume when performing LOD selection, substantially improv- ingperformance...... 94

xv 4.9 Values of MIXFRAC from the combustion data set within the range [0.45: 0.55] are rendered for a single timestep, where values less than 0.5 are blue and those greater than or equal to 0.5 are orange. Figure 4.9a is a crop of an image generated using the ground truth resolution, while figures 4.9b and 4.9c have levels of detail selected for a 250MiB working set size constraint. Figure 4.9b has a weighting function that is 1 for values in the range [0.45:0.55] and 0 elsewhere. Figure 4.9c has a weighting function that is uniformly 1. The narrower interval of interest used for figure 4.9b clearly yields a result closer to the ground truth than the wide interval of interestthatwasusedforfigure4.9c...... 95

4.10 For a fixed working set size constraint, increasing the width of the range of values defining the interval volumes of interest results in increased error. This figure, which was generated using the QVAPOR variable of the cli- mate test data set defined in §7.3, is typical of what we have observed. This is to be expected because a larger interval volume will encompass more samples yet the information density is likely to remain similar. Thus, the narrower the interval volume of interest, the fewer samples are needed to reconstructthevolumewithagivenleveloferror...... 96

5.1 Sample volume renderings of data set 1. The left column shows two views of one data component. The right column shows two different AMR level ranges for a different component, with the top image showing levels 0 through1,thebottomimageshowingjustlevel1...... 99

5.2 Sample renderings from data set 2. In clockwise direction from the top left corner are AMR levels 0 through 4, 2 through 4, 3 through 4, and 4. . . . .101

5.3 Data set 2 volume block bounding wireframes. Each vertex corresponds to a grid-centered position on the boundary. The wireframes demonstrate the curvature and non-uniform cell sizes of the curvilinear space. Level 0 has 8distinctblocks,level1has24distinctblocks...... 108

5.4 Data set 1 volume block bounding wireframes. Each vertex corresponds to a grid-centered position on the boundary. The left column shows AMR levels 0 and 1, while the right column shows AMR level 1. The wire- frames demonstrate the curvature and non-uniform cell sizes of the curvi- linear space. Level 0 has 8 distinct blocks, level 1 has 24 distinctblocks. . . 111

xvi 5.5 Volume renderings for different minimum step lengths. Each row from left to right show step lengths 0.001,0.005,0.010,0.050,and 0.100. The top row shows data set 2 and the bottom row shows data set 1. A larger minimum step length decreases required computational time while increasing error. . 113

5.6 Dataset1runningtimes ...... 122

5.7 Dataset2runningtimes ...... 123

5.8 The positional error(the difference between the original mesh position and the mesh point found with equation 5.1) is proportional to the point dark- ness in these images. From left to right, the images are of set 1 levels 0 to 1,set1level1,set2levels0to4,set2levels3to4...... 124

6.1 The preprocessing phase transforms the volume data into metadata using the transformation pipeline in equation (6.2). This requires O(N) working set complexity, for a volume with N elements. In the interactive phase, queries for distribution range queries are evaluated by reading parts of the metadata on demand into the transformation pipeline in equation (6.3). The working set complexity for this phase depends primarily on the result query sizeratherthanthesizeoftheinputvolume...... 131

6.2 In this example, a 1D integral distribution volume (Xi(s)) is discretized into 8 span distributions (Yk,i) as described in equation (6.9). The span distribution at index 6, for example, is computed by subtracting Xi(5) from Xi(7)...... 135

6.3 Distribution range queries are executed by evaluating the integral distirbu- tion of each corner of the range using equation (6.10), then combining them using equation (6.8). In this example, the range query is evaluated using 4 span distributions, subtracting the span distributions (Y2,i and Y3,i) that con- tribute to the Xi(4) integral distribution, and adding the span distributions (Y4,i and Y6,i) that contribute to the Xi(7) integraldistribution...... 140

6.4 The Z-order space-filling curve maps a d-dimensional integer coordinate to a 1-dimensional integer coordinate. In this example, a 3D coordinate with 4 bits per component is mapped to a a single 1D coordinate with 12bits. . . 142

xvii 6.5 Because span distributions take advantage of the similarity between neigh- boring integral distributions for storage, they take considerably less space, even for lossless reconstruction. Additionally, by dropping some of the span distribution levels, the size can be further reduced at the cost of being lossy. In this case the distributions were represented by 64 bin histograms on3Dcomputationalfluiddynamicsvolumedata...... 144

6.6 By dropping some levels, which results in queries being approximate, the size of the span distributions necessary can be reduced. This can reduce the working set size of an application. In this case the distributions were repre- sented by 64 bin histograms on 3D computational fluid dynamics volume data...... 145

6.7 Both the size of the levels, and the number of span distributions in the levels, exponentially decreases as the level number increases. The ratio be- tween the size of the span distributionsand the number of span distributions enables modeling of the entropy per span distribution...... 146

6.8 The relationship between the error bound and the stored size for varying numbersoflevelsskipped...... 147

6.9 Out-of-core data, query time, randomlypositionedandsized queries, 2016MiB source data. The majority of the time spent in this test was I/O. Reducing the working set reduces the demands on storage devices, improving perfor- mance...... 148

6.10 Storing the integral distributions directly, sampled on a uniform grid, can take considerably more space than storing compressed span distributions. Span distributions also permit the dropping of levels, which reduces the datasize,atthecostofaccuracy ...... 150

6.11 Out-of-core data, query time transient response, randomly positioned and sized queries, 64MiB source data. The majority of the time spent in this test was I/O for the top two lines of the graph. For the bottom two lines I/O has a substantial impact at the left end of the graph, but this effect is quickly reduced as the file cache warms. Using span distributions reduces the working set size required over performing direct queries. Reducing the number of levels used for span distributions reduces the working set as well. Reducing the working set reduces the demands on storage devices and reduces file cache rates, improving performance...... 151

xviii 6.12 Approximate sum aggregation of 3D volumes for H¨ovmoller diagrams as discussed in §6.3.1. The horizontal axis is longitude and the vertical axis is time. The tolerance provides a bound on how far the approximate sums may be from the true sums, in terms the value of the sum. The dataset is from a simulation produced by the Pacific Northwest National Laboratory toexaminetheMadden-JulianOscillation[37]...... 156

6.13 Interactive transfer function design for large-scale time-varying volume data, using interactive 4D distribution range queries, as discussed in §6.3.2. The user moves a region of interest in the left pane on a projection of the volume. The distribution of the region of interest is then used to generate transfer functions in the right pane, using the technique discussed in chapter 7157

7.1 Level of detail selection and transfer function design both depend on inter- valsalience...... 161

7.2 An example of the technique being applied to the Flame test volume, dis- cussedinsection7.2.5 ...... 170

7.3 The performance as a function of volume size and working set size is largely a function of the working set size, rather than the volume size, fa- cilitatingscalabilityforlarge-scaledata...... 171

7.4 An example of the per-frame performance, as a function of running time, for a test run using the 62GiB Nek dataset with a 600MiB working set limit. In this case cursors are being moved around and expressions edited, yielding incremental updates to the target histogram...... 171

xix Chapter 1: Introduction

Interactive analysis of simulation results has become a mainstay of science and engi- neering. With continually increasing compute power, the size of simulation results con- tinues to grow. However, network and mass storage device throughput are not increasing as quickly. This introduces difficulties in scaling analysis workflows to take advantage of these new compute resources. This dissertation describes a body of work that increases the scalability of interactive volume analysis workflows by moving elements strongly depen- dent on the input data size from the interactive phase of the workflow to the data preparation phase.

This work addresses this problem by breaking it into two aspects: salience-aware tech- niques for scalability enhancement, and techniques for scalable salience discovery. For both aspects, data transformations are applied in the preprocessing phase of workflows to decrease the work that needs to be done in the interactive phase of workflows.

Chapters 2 and 4 discuss salience-aware techniques, while chapters 6 and 7 propose techniques for salience discovery. Chapter 3 proposes a technique that can be used to enable interactive remote use of the technique discussed in chapter 2. Chapter 5 describes a technique that can utilize the techniques proposed in chapters 7 and 4. Potential paths for extensions are discussed in chapter 8.

1 The following sections provide a brief summary of the proposed salience-aware tech-

niques, techniques for salience discovery, and a discussion of contributions.

Figure 1.1: Salience-aware techniques can facilitate salience discovery techniques. Simi- larly, salience discovery techniques can facilitate salience-aware techniques.

1.1 Salience-Aware Techniques

Salience-aware techniques leverage the tendency for different parts of volume data to be of differing importance. In this dissertation, salience-aware techniques are proposed for load-balancing of isosurfacing on clusters, and salience-aware level of detail selection.

1.1.1 Load-Balanced Parallel Isosurfacing

Isosurface extraction is a common technique applied in scientific visualization. Isosur- faces are often rendered to show structures indicated by surfaces over which a particular value is uniform. In many cases, a user has an idea of what isovalue ranges may be rea- sonable for the extraction of features of interest, but may not know exactly what isovalues

2 should be used. As the user interacts with the visualization platform, they may change the

isovalue of interest. This is effectively changing the salient interval of interest.

In the case of parallel isosurfacing algorithms such as Marching Cubes with empty

space skipping, the amount of work that is required to compute the isosurface for a block

of cells is dependent on the number of triangles in the isosurface. Thus, if the surfaces are

not evenly distributed through the volume and a distributed-data strategy is used for par-

allelization, load balancing may be uneven. However, direct estimation of triangle counts

would require access to the data, if no metadata is available, making adaptive load bal-

ancing impractical in an interactive workflow. In chapter 2, a solution is proposed that

generates metadata in a preprocessing phase to facilitate fast load balancing for a given

isosurface.

The input volume is broken into subvolumes, all of which are available on all nodes of a compute cluster. During the preprocessing phase, for each subvolume, metadata is generated. This metadata stores the number of triangles in the isosurface for each of a set of isovalues.

During the interactive phase, the user is able to dynamically change which interval vol- umes are salient. With this metadata, it is possible to estimate the mean number of triangles within these interval volumes, without needing to directly computing the isosurfaces during the interactive phase. Because the work required to compute an isosurface is proportional to the number of triangles in the isosurface, this can be used to estimate the amount of com- putational work required to compute isosurfaces for a given subvolume, assuming that the isosurfaces are chosen from the salient interval volumes. This estimation is used to enable balanced assignment of subvolumes to cluster nodes to maximize performance.

3 With these isosurfaces being computed on a cluster, and the number of triangles in the

isosurfaces possibly being much larger than the number of pixels in a given rendering, re-

mote visualization solutions can be useful. A remote visualization solution appropriate for

this cluster-hosted isosurfacing technique is discussed in chapter 3. This solution applies

an error-constrained stereo video compression algorithm to remote visualization.

1.1.2 Level of Detail Selection

Multiresolution volumes are commonly used to enable effective visualization of very large datasets without requiring the entire dataset to be loaded at full detail. Computing the optimal level of detail selection for a given size constraint requires the ability to estimate the effects of downsampling on the fidelity of a given level of detail. Additionally, it is common that the salience of different interval volumes may change during the interactive cycle of the visualization workflow.

Histogram Spectra, described in chapter 4, are a type of metadata designed to enable this LOD selection. By precomputing the effect of downsampling on the distributions of regions of the volume, level of detail optimization can be performed quickly, during the interactive phase of the workflow.

The input volume is broken into cubic subvolumes. For each subvolume a histogram spectrum is generated. A histogram spectrum stores the difference between a histogram of the ground truth level of detail and the histogram of each reduced level of detail. By precomputing histogram spectra during the preprocessing phase, fast estimates of the ef- fects of downsampling on distributions within each block can be made in the interactive phase. Because this metadata is relatively small, this enables users to obtain effective level of detail selections while interactively selecting different interval volumes of interest.

4 1.2 Techniques for Scalable Salience Discovery

In many cases, the salience of data may not be known a priori. Salience discovery techniques seek to facilitate the discovery of salience of different interval volumes. In this dissertation, a technique for iterative salience discovery, in the context of interactive transfer function design on large-scale volumes, is discussed. Supporting that, a technique is described for evaluating distribution range queries.

1.2.1 Transformations for Volumetric Range Distribution Queries

Distribution range queries are one type of volumetric range query, aggregating the con- tents of subregions into distributions. Fast evaluation of distribution range queries can be used to facilitate interactive transfer function design, classification, and aggregation. How- ever, direct evaluation of distribution range queries on out-of-core volume data without a priori knowledge of the distribution requires a working set proportional to the size of the range. For large out-of-core data this will prohibit interactivity.

Chapter 6 proposes a transformation framework to facilitate distribution range queries.

A technique is proposed within this framework to transform volume data into span distribu- tions, a form of metadata that facilitates distribution range queries. Example applications are then explored using the technique.

During the preprocessing phase of a workflow, the technique generates metadata that enables distribution range queries to be evaluated quickly during the interactive phase of a workflow. The concept of integral distributions is introduced as a way to quickly esti- mate distributions of ranges of a volume. Integral distributions store the distribution of the subvolume of a volume ranging from the origin to a given point within the volume. Using only a few integral distributions, it is possible to evaluate the distribution of ranges of a

5 volume. However, direct storage of integral distributions is impractical due to their size.

Thus, the metadata is stored in the form of span distributions, a decomposition of integral

distributions intended to increase storage efficiency and facilitate approximate queries.

The technique is subsequently applied in two applications, one being the transfer func-

tion design technique discussed in chapter 7, and the other being the construction of H¨ovmoller

diagrams, a tool used in meteorology.

1.2.2 Interactive Transfer Function Design

Direct volume rendering is widely used in the visualization of volume data. Key to the creation of high quality visualizations using DVR is the construction of effective trans- fer functions. Interactive, semi-automatic transfer function design seeks to leverage users’ domain-specific knowledge to progressively develop value saliency. Interactive transfer function design techniques rely on iterative refinement of parameters to a transfer function generation algorithm, based upon visual feedback to the user, potentially requiring interac- tive direct volume rendering.

If interactive transfer function design is to be performed on large-scale multiresolution data on workstations and direct volume rendering is to be applied, then level of detail selection is going to be necessary during the interactive workflow. However, effective level of detail selection depends on having knowledge of data salience, while interactive transfer function design seeks to incrementally develop saliency to construct transfer functions, introducing a cyclic dependency. Chapter 7 describes an interactive transfer function design technique that enables the combination of histogram range queries, like those supported by the technique described in chapter 6, with the level of detail selection technique described

6 in chapter 4. This enables incremental, interactive transfer function design on large-scale

volumes.

This work provides a system where users select regions of interest within volume data.

For a given region of interest, a histogram of the source data is computed (optionally us- ing the technique described in chapter 6). The histograms of different regions of interest are then combined into a single target histogram using a user-defined histogram expres- sion. These target histograms are then used to define a transfer function. Values within the volume with large counts in the target histogram receive greater opacity and contrast.

Simultaneously, the target histograms are used to define salience in the level of detail selec- tion scheme discussed in chapter 4. By interactively manipulating regions of interest and histogram expressions, a user can progressively identify which regions of a volume may or may not be salient.

An associated volume rendering technique for multiresolution curvilinear volumes is presented in chapter 5. This technique is an example of a technique that can leverage the proposed interactive transfer function design framework in the context of large data.

It describes a novel volume rendering system that enables ray casting of multiresolution curvilinear volumes on GPUs. Transfer function design can be driven by the technique discussed in chapter 7, and level of detail selection can be performed using the technique described in chapter 4.

7 1.3 Contributions

The following is a summary of the contributions in each chapter:

Chapter 2: Load-Balanced Isosurfacing on Multi-GPU Clusters: The core contribution

of this work is the proposal of a technique that enables fast salience-aware load bal-

ancing of isosurfacing on clusters.

Chapter 3: Stereo Frame Decomposition for Error-Constrained Remote Visualization:

The core contribution of this work is a technique for video-based stereo remote visu-

alization, where visualization-aware error constraints are applied to the frames.

Chapter 4: Histogram Spectra for Multivariate Time-Varying Volume LOD Selection:

The primary contribution of this work is the introduction of the concept of histogram

spectra, which are a form of metadata that can be used to facilitate salience-aware

level of detail selection.

Chapter 5: Efficient Rendering of Extrudable Curvilinear Volumes: The core contri-

bution of this work is a transformation and associated rendering algorithm that can

be used to enable efficient rendering of extrudable curvilinear multiresolution vol-

umes on GPUs. The technique can utilize the other level of detail selection and

transfer function design techniques in this dissertation.

Chapter 6 Transformations for Volumetric Range Distribution Queries: This work has

three main contributions. First, a framework is proposed that generalizes existing

work to support distribution range queries. Then, a specific technique within this

framework that supports distribution range queries is proposed for volume data de-

fined on regular grids. Finally, proposals are made for how to apply integral and

8 span distributions to reduce working set complexity in different salience discovery

applications.

Chapter 7: Interactive Transfer Function Design on Large Multiresolution Volumes:

The core contribution of this work is a technique that facilitates salience discovery

on large-scale data by enabling interactive, incremental construction of target his-

tograms that can simultaneously be used to support transfer function construction

and salience-aware level of detail selection.

All of the work in this dissertation has been published and peer-reviewed, as listed in the

vita section, in the course of my doctoral studies. The following chapters present the above

contributions in detail.

9 Chapter 2: Load-Balanced Isosurfacing on Multi-GPU Clusters

Isosurface extraction is a common technique applied in scientific visualization. Isosur- faces are often rendered to show structures indicated by surfaces over which a particular value is uniform. Additionally, it is often of use to have the triangle data of these surfaces available for the computation of quantities such as surface area. In many cases, a user has an idea of what isovalue ranges may be reasonable for the extraction of features of interest, but may not know exactly what isovalues should be used. Thus, providing fast isosurfacing of a particular subset of potential isovalues can be of particular utility.

As scientists have sought to increase simulation accuracy, the quantity of data produced has increased commensurately. Analysis tools, including those that provide isosurfacing, must scale to support this increased volume of data.

Over recent years, a transition has been seen toward hierarchical parallelism, both in terms of memory and processors. Even single PCs often contain multiple CPUs and GPUs, with each GPU containing multiple stream processors. Clusters add an additional level within the hierarchy. Challenges are introduced not only by the hierarchical nature of the compute resources, but also by the diversity of interconnects between them.

Making the most of these compute resources requires deciding the levels within the hierarchy at which subdivision of work and data-dependent distribution of work is appro- priate. Several considerations must be made:

10 • Hardware constraints: Limits are often imposed on the local memory available in dif-

ferent elements of the compute resources, and there are often substantial disparities

between processor speed, available local memory, and interconnect speed.

• Required result constraints: Results from an isosurfacing algorithm should be in

a format appropriate for how they will be used. For example, triangles from an

isosurfacing algorithm should be stored in a buffer with an appropriate format for

rendering, if rendering is required.

• Preprocessing cost: The resources consumed, both in time and space, by preprocess-

ing must be warranted by the expected gains in usability as a result.

• Efficient scalability: Algorithms must scale well with increased data size and com-

pute resources, while also having reasonable absolute speeds for the range of ex-

pected target data sizes and systems.

We propose an approach, exhibited in figure 2.1, that evenly distributes isosurfacing work to multiple GPUs in a cluster, taking into consideration user-defined salient isovalue ranges. The approach then applies our efficient parallel isosurfacing algorithm on each

GPU. A modest amount of preprocessing enables efficient distribution of work.

This paper is organized as follows. Section 7.1 describes related work. Details of the isosurfacing cost heuristic are discussed in section 2.2.1. Then, details of the work distribution and isosurfacing algorithms are discussed in sections 2.2 and 2.3 respectively.

Finally, results and conclusions are discussed in sections 7.3 and 7.4.

11 Cluster Volume Configuration Volume Preprocess Distribute Compute Assignments Triangles Metadata

Salient Isovalue Isovalues

Figure 2.1: Our approach preprocesses the volume data for a range of salient isovalues to estimate the amount of work required to perform isosurfacing for blocks of the input volume. The blocks are subsequently assigned to GPUs such that the isosurfacing work is more evenly distributed.

2.1 Related Work

A commonly applied tool in scientific visualization, isosurfacing has been well ex- plored in research literature. Two broad groups of isosurfacing techniques exist: those that explicitly generate geometric primitives for the surfaces, and those that provide for direct rendering of the surfaces without necessarily generating geometric primitives for the entire isosurface. The former has advantages in cases where the geometric primitives are neces- sary or when the same surface is to be viewed from many different views. The latter has advantages in situations where the surface geometry is not needed or there are a limited number of views of interest and there is significant occlusion exhibited in those views. Our technique is among those in the former category, explicitly generating geometric primitives for isosurfaces.

Among techniques in the former category, the marching cubes technique, introduced by Lorensen, et al.[66], has become the ubiquitous solution. Further improvements on the

12 core technique have been proposed by Nielson, et al. [76]. An in-depth discussion is made

on potential improvements on the marching cubes algorithm by Lopes, et al. [65].

In the original marching cubes algorithm, even cells without an isosurface in them are scanned. One approach used in avoidance of this is the use of hierarchical spatial data structures. Wilhelms, et al. [121] propose using octrees, Livnat, et al. [62] propose a kd-tree-based method, and Dyken, et al. [17] extends the concept to a hierarchy of his- tograms to assist in efficient isosurface extraction. Itoh, et al. [46] propose another method using contour trees to accelerate isosurfacing for unstructured volumes, skipping empty cells. Shen, et al. [100] apply an algorithm utilizing the minimum and maximum values for groups of cells, in the context of unstructured data, to reduce unnecessary empty cell scanning. Another approach is a technique introduced by Gallagher [23] in which values are bucketized to facilitate faster searching.

Several techniques have been developed to explicitly generate isosurface geometry us- ing GPUs. Tatarchuk, et al. [107] describe a technique using GPU geometry shaders to generate triangle geometry for tetrahedral volumes and tetrahedralized hexahedral volumes.

Dyken, et al. [17] apply histopyramids [132] to accelerating marching cubes isosurfacing on GPUs. Marching cubes are implemented directly in vertex shaders by Goetz, et al. [31] and further enhanced with span-space acceleration techniques by Johansson, et al. [49].

Pascucci [80] and Klein, et al. [53] propose implementations of the marching tetrahedra algorithm on GPUs.

Many techniques have been developed that do not explicitly generate isosurface geom- etry. One of the simplest methods is to perform volume rendering with a transfer function that exposes the isovalues. Further refinements upon that are applying volume ray casting

13 where the intersections with the surfaces in the interpolated cells are computed, then illu-

minated using common illumination models such as Phong’s illumination model [83]. One

such example of a technique using ray tracing to render isosurfaces is proposed by Parker,

et al. [79]. Point splatting based techniques such as those proposed by Co, et al. [12] and

Livnat et al. [63] can also be applied. R¨oottger, et al. [89] describe how cell projection, a

technique often used for volume rendering, can be applied to isosurfacing. Another com-

mon approach is to generate view-dependent geometry that does not necessarily include the

entire isosurface, taking into account occlusion. Gao, et al. [25] proposes one such tech-

nique where triangular geometry is directly generated in areas that pass a GPU-accelerated

occlusion test.

While the fundamental marching cubes algorithms can easily map to data-parallel ar- chitectures under limited circumstances, a naive mapping can be very inefficient if the distribution of the isosurfaces throughout the volume is nonuniform. Additionally, many of the above techniques introduce acceleration data structures which add an additional degree of complexity to parallelization of the isosurfacing algorithms. Due to these concerns, and the ever increasing sizes of datasets to be analyzed, parallel isosurface extraction has been widely explored.

Gao, et al. [24] propose a parallel view-dependent isosurfacing algorithm using occlu- sion culling, combining hierarchical data structures with image space partitioning. Hansen, et al. [38] propose an algorithm that assigns individual cells in the volume to Connec- tion Machine virtual processors – a concept that exists in a similar sense in the context of

OpenCL-capable GPUs. Shen, et al. [99] extend the span space isosurfacing acceleration algorithm to a MIMD system by using a lattice-based search structure distributed to differ- ent processing elements. Zhang, et al. [130] and Chiang, et al. [10] both seek to provide

14 an infrastructure for out-of-core rendering of isosurfaces on clusters. Gerstner, et al. [30]

provides a strategy for distributing work for their hierarchical tetrahedral grid isosurfacing

technique to processors in a SMP system.

Zhang, et al. [129] uses a similar cost heuristic, in the context of out-of-core isosur- facing, to what is applied in our technique. However their technique considers active cells rather than triangle counts, and their technique uses hard-coded coefficients while ours profiles the target system and uses linear regression to estimate the coefficients. Isosurface statistics, as discussed by Scheidegger et al. [97], could be applied in the computation of a cost heuristic. However, we found a sampling of triangle counts to provide sufficient infor- mation for cost determination in our application while being substantially less complex.

Our technique directly generates triangular geometry for isosurfaces using marching cubes. With clusters having multiple nodes, each with multiple GPUs, with each GPU having multiple stream processors, we operate on two levels of parallelism: node-level and

GPU-level. A data-parallel model is used for work distribution. To distribute load at the node level we use a cost heuristic based on profiling information to assign large blocks of cells from the volume to GPUs. To distribute load at the GPU level we apply data-parallel algorithms to each block [40], subdividing the block into rows. Our algorithm combines the simplicity of marching cubes with data-parallel algorithms to enable balanced fine-grained parallelism at the GPU level. Simultaneously, coarse-grained parallelism is applied at the node level using heuristics to provide for effective load balancing with minimal overhead.

2.2 Block Distribution Algorithm

The input data is treated as an array of cuboid blocks of cells, and it is assumed that the input data is too large to be fit entirely on any one node in the cluster. The goal of a

15 block distribution algorithm is to assign these blocks to different GPUs in the cluster such

that the load will be balanced for subsequent isosurfacing operations. The blocks need to

be assigned to GPUs, without having to load the blocks explicitly on every node.

The block distribution algorithm consists of three phases: preprocessing, profiling, and

assignment. The preprocessing phase collects data-centric information needed to compute

the cost heuristic such as the triangle counts for different isovalues in different blocks. The

profiling phase collects machine-centric information needed to compute the cost heuristic

for the target machine. The assignment phase assigns blocks to GPUs across the cluster,

given a user-defined range of salient isovalues and the cost heuristic.

2.2.1 Isosurfacing Cost Heuristic

An isosurfacing cost heuristic is required to estimate the amount of time it will take to compute isosurfaces for a block of cells in the volume. Some critical design requirements for such a heuristic are:

• It must enable estimation of the amount of time a block will take to compute. Even

blocks with very few triangles may take substantial time.

• Not all of the data can be loaded every time we want to evaluate the heuristic. Instead,

the heuristic must be computable with a value extracted from a simple, compact

metadata representation produced by preprocessing.

• The heuristic should reflect the hardware platforms being used. The relationship

of the overhead associated with starting the isosurfacing of a block to the actual

isosurfacing work for a block may vary from platform to platform.

16 • It must be well-conditioned. We cannot have heuristic that will produce unreasonably

large changes in its estimates for relatively small changes in the input metadata.

In our experiments we found a linear correlation between triangle count and isosur- facing time as exhibited in figure 2.2. Preprocessing can easily be performed to estimate triangle counts for different isovalues in different blocks of an input volume, which can then be subsequently stored as metadata. Requiring only the generated metadata rather than the entire volume, this enables fast and accurate estimation of a cost heuristic for isosurfacing a given block for a given isovalue.

Relationship Between Triangle Counts and Time 9

8

7

6

5

4

3 Isosurfacing Time in Milliseconds 2

1 20 40 60 80 100 120 140 Thousands of Triangles in Isosurface

Figure 2.2: The time required for isosurfacing a single block of a volume varies approxi- mately linearly with the triangle count in the isosurface. The constant factor in the fit line is reduced by applying the optimizations discussed in section 2.3.3

17 2.2.2 Preprocessing

Preprocessing is performed once per data set, in a standalone cluster-aware program.

The preprocessing phase determines the triangle count for a range of isovalues for each block of cells. The probe isovalues used for determining the triangle counts should be cho- sen so that they evenly cover the histogram of data values. This provides a representative sampling of potential isovalues. Our approach is to uniformly distribute the blocks across the cluster, with one process per CPU. For each block, the data values in the block are sorted in ascending order. To find M different probe isovalues for N sample values, we choose an isovalue to be the value at every N/(M − 1)’th value in the sorted list.

For each of these probe isovalues we iterate through the data cells, on the CPU, to find the number of triangles that would be returned from the marching cubes algorithm. This is accomplished by classifying the cells into different marching cubes cases then using those classifications, per cell, to lookup triangle counts from a table. It is not necessary to explicitly compute the isosurfaces as the triangle counts are sufficient. The resulting mapping of isovalues to isosurface triangle counts is aggregated then written to a file, with a set of M entries for each block.

2.2.3 Profiling

The goal of the profiling phase is to determine the unknowns in the linear function map- ping triangle count to the cost. The approach to do this needs to be reasonably inexpensive, but at the same time able to come up with reasonably confident estimates for the heuristic.

Additionally, the approach must be appropriate for the block sizes used when subdividing the data for distribution to GPUs.

18 Our system generates a test volume of a size similar to that of a block. In our test cases a block size of 1283 was used, but others could be used subject to the compromise

discussed in §2.4.1. This synthetic test volume is sufficient so long as it provides for a

diversity of triangle counts for different isovalues. The volume samples are generated by

superimposing sinusoidal waves with random frequencies, directions, and amplitudes. This

results in a reasonably complex volume for isosurfacing. We then compute the isosurfaces

for isovalues ranging from the minimum to the maximum value in this generated field,

estimating the time it takes for each. A linear least squares fitting is used to fit a linear

function to these results, mapping triangle counts to expected times.

The resulting expected time from this equation, when evaluated for a particular triangle

count, is the cost heuristic value for that triangle count. For clusters with more than one

kind of GPU, the cost heuristic can be computed independently on the different kinds of

GPUs. This provides a consistent basis for comparison of potential costs for isosurfacing

across the different GPUs.

2.2.4 Assignment

Blocks are assigned to GPUs when the isosurfacing program is started, or when the user

changes the set of salient isovalue ranges. From the preprocessing stage we have a table,

one for each block, mapping a set of sample isovalues to triangle counts. From the profiling

stage we have an equation mapping triangle counts to a cost heuristic. Using these tables,

blocks need to be assigned to GPUs such that the variance is minimized between the sums

of the cost heuristics of the blocks assigned to each GPU.

For every block, the cost heuristic is estimated using the set of salient isovalue ranges

defined by the user. Because these ranges will not, in general, match the exact sample

19 isovalues from the preprocessing stage, linear interpolation is applied between sample iso-

values as necessary. The mean of the triangle count within the ranges specified by the user

is computed to find the expected triangle count for a given block. With this triangle count,

the cost heuristic can be evaluated, resulting in a single cost heuristic value for each block.

The blocks are then sorted in order of descending cost heuristic value. With this list, the

blocks are then assigned to GPUs in a round-robin fashion. This results in an assignment

of blocks to GPUs that is not necessarily optimal, but still is a good starting point.

To further refine the block assignments, they are randomly exchanged between GPUs,

subject to the constraint that all exchanges must decrease the variance of the sums of the

cost heuristic values assigned to each GPU. This is accomplished in a three step iterative

process:

1. Pick a random pair of block assignments, with each element of the pair on a different

GPU. This pair defines a potential exchange of block assignments.

2. If the variance is decreased by performing this exchange, the exchange is said to be

successful. If the exchange is successful then we apply the exchange and return to

step 1. Otherwise, we continue through this process.

3. If the number of unsuccessful exchanges since the last successful exchange exceeds a

limit or the variance decreases below a threshold, break from this process, else return

to step 1.

After this process is complete, each GPU has a list of blocks assigned to it. The block isosurfacing algorithm can then be applied independently on each GPU, where each GPU is responsible for processing the blocks assigned to it.

20 2.3 Block Isosurfacing Algorithm

When the block isosurfacing algorithm is applied, each GPU will have been assigned a set of blocks of the volume and the user will have selected a particular isovalue that they would like to visualize. The block isosurfacing algorithm needs to generate triangles for the isosurfaces, populating vertex buffers on the GPU. An algorithm for this needs to produce packed triangle buffers without wasted space in a format amenable to GPU rendering. Because the number of triangles produced for isosurfaces will vary substantially within and between blocks, pre-allocating buffers to store triangles may be unacceptably wasteful in terms of memory consumption. Additionally, because GPUs are fundamentally parallel, such an algorithm needs to map well to the GPU parallel programming model.

One CPU thread controls each GPU, keeping each GPU busy processing the blocks assigned to it, resulting in one triangle buffer per block. We perform a marching cubes algorithm in two passes. The first pass counts the number of triangles and the offsets of the triangles into the vertex buffers. It does not directly compute the spatial positions of the triangles. The second pass creates the triangles, writing their spatial positions and normals into the vertex buffers according to the vertex buffer offsets found in the first pass. Figure

2.3 exhibits this process.

2.3.1 Triangle Counting

The triangle counting phase takes a block of cells as input, and generates two outputs: a count of the total number of triangles in the isosurface in the block, and the offset of triangles within the vertex buffer for each X row of cells in the input volume. The triangle counting algorithm is local to each GPU, with one GPU operating on one block at a time.

21 Our approach applies exclusive prefix sums to compute the exact indices within the output vertex buffer for output triangles associated with each row of cells, resulting in a packed vertex buffer. The prefix sums could be implemented in parallel using techniques similar to those introduced by Harris [39]. However, because we are computing prefix sums over many small distinct lists of numbers rather than one large list of numbers, it is more efficient to simply perform the many independent serial prefix sums in parallel. This maps well to GPUs because the individual sums are of nearly uniform length.

2.3.2 Triangle Creation

With the buffers resulting from the triangle counting pass, we now have the information needed to know where to store the triangles created by the marching cubes algorithm. The triangle creation phase computes these triangles and their normals.

Each X row of cells is assigned to a GPU thread. Each GPU thread then computes the isosurface triangles for its assigned row of cells. The resulting triangles for each row are placed into the target vertex buffer using the offsets computed in the triangle counting phase. This results in a packed vertex buffer on each GPU.

The packed vertex buffer contains positions of the vertices of the triangles. With these positions the normals for each vertex of the triangles can be computed by using finite dif- ferences to compute the gradient at each vertex. To obtain consistent normals, ghost cells are required around blocks. We found that using texture hardware and finite differences was substantially more efficient than attempting to compute normals directly using triangle connectivity and triangle geometry.

22 2.3.3 Optimizations

Some elements of the computation within the triangle counting and triangle creation phases is redundant. With a naive implementation, the blocks of cells will be sampled twice. Optimizations can be made to reduce the amount of redundant work. We apply two such optimizations: a minimum-maximum table for empty space skipping, and an isosurface crossing table to cache results from the triangle counting phase for use in the triangle creation phase.

Minimum-Maximum Table

Minimum and maximum values of the set of values within X rows of cells are com- puted at load time. Each row is subdivided into contiguous spans, with the minimum and maximum values being computed and stored for each span. This data lets the triangle counting and triangle creation phases skip spans of cells that do not contain the isovalue, thus potentially reducing the number of required memory reads.

Trade-offs are present in terms of how large the spans in the minimum-maximum ta-

ble should be. If the spans are too large, then it may be that fewer opportunities will be

encountered to skip spans that do not contain isovalues. If spans are too small, then too

much memory may be required to store the tables. Additionally, the minimum-maximum

table needs to be read once per span to determine if the span contains the isovalue. This

implies that, in addition to high memory consumption, span lengths that are too small may

also result in excessive memory reads. We found span lengths in the range of 10 to 20 cells

to be reasonable for the test datasets.

23 Isosurface Crossing Table

When we perform triangle counting, we are identifying the active cells. Rather than scanning all cells a second time in the triangle creation phase, we can record the indices of active cells within each X row in an isosurface crossing table. Then, when we perform the triangle creation we can iterate through this table instead of data values to apply marching cubes only to the cells that are active.

As with the minimum-maximum table, a compromise is present between performance and memory consumption. Large tables supporting a large number of isosurface crossings per X row can permit greater performance in cases where a large number of isosurface crossings per X row occur. Also, because only one of these tables needs to be stored per-

GPU, rather than per-block with the minimum-maximum table, memory limitations are less restrictive. We implement this table as a byte per cell, at the full resolution of a block, because our block sizes are reasonably small.

2.4 Results

The test platform was a cluster of 12 nodes. Each node had 16GiB of memory, two NVIDIA Quadro FX5600s each with 1.5GiB of memory, two quad core AMD Opteron

2350 CPUs at 2GHz, and an Infiniband interface. The algorithm was implemented using

OpenCL for the GPU elements, MPI for inter-node communication, and Intel Threading

Building Blocks [87] for CPU multithreading. An important aspect of this configuration is the hierarchical nature of the parallelism – load must be balanced between nodes, amongst

CPUs, and amongst GPUs.

We conducted four experiments on our test platform to explore:

• the relationship between isosurface triangle counts and isosurfacing time

24 • strong scalability: speedup in terms of a varying number of GPUs for a fixed data

size

• volume size scalability: performance in terms of varying data size for a fixed number

of GPUs

• the relationship between salient isovalue ranges, isovalues, and speedup

It was found that our cost heuristic yielded substantial performance improvements over a naive round robin distribution of blocks without a cost heuristic.

The test dataset was constructed from a sum of sine waves with random amplitude, frequency, and phase. The isosurfaces for isovalues -1.00 and 3.00 are exhibited in fig- ure 2.4. This dataset was chosen because it offers sufficient complexity and variation to be interesting, and is easy to reproduce at any resolution. The base dataset size we use is

1536x1024x1024 resulting in 6 gigabytes of IEEE754 single precision floating point sam- ples. To maintain consistency, the dataset was downscaled from this base size as necessary for the different experiments.

2.4.1 Triangle Counts versus Isosurfacing Time

The time required to isosurface each block of cells within a volume was recorded, along with the number of triangles in the blocks, resulting in a mapping of triangle counts to times as in figure 2.2. This experiment directly examines the performance of the block isosurfacing algorithm from section 2.3. For a fixed block size, a linear relationship was found between the isosurfacing time for a single block, and the number of triangles within the isosurface in the block. Different elements contribute to the constant and linear factors.

25 Constant factor

Several elements contribute to the constant term in the linear relationship. Fundamen- tally, they are of two types: those that are related to the size of the block being isosurfaced and those that are not. In our algorithm, GPU kernel execution startup times and OpenCL

API overhead are independent of the size of the volume blocks being considered. Addi- tionally, the GPU to CPU and CPU to GPU transfer times from within the triangle counting algorithm are dominated by the startup cost of the transfers rather than the size of the trans- fers because the transfer sizes are intentionally small, on the order of 128 bytes for a 643 block and 512 bytes for a 1283 block.

However, other elements of the triangle counting algorithm that contribute to the con- stant factor do exhibit dependence on the size of the block. Time is required to perform the exclusive prefix sums on the tables for the block as seen in section 2.3.1. Additionally, time is required to perform marching cubes table lookups and volume lookups to count the number of triangles in each cell. The minimum-maximum table optimization from section

2.3.3 seeks to reduce these contributors to the constant factor by reducing the number of cells whose triangle counts must be checked, at the cost of requiring some additional table lookups.

Typical constant time factors seen on our test platform for a 1283 block on a single GPU were around 1.2ms. This time is dominated by the API and kernel startup time overhead.

Further exploration may be worth consideration when NVIDIA Fermi-class GPUs become available, which may reduce kernel context switching time. Additionally, the drivers for

OpenCL are still relatively new, so additional optimizationsshould be expected in the future to reduce API-related overhead. Such hardware and software improvements would further increase the benefits seen from our algorithm by reducing this constant factor.

26 Linear factor

The linear term in the triangle count to time relationship also has multiple contributing factors. Marching cubes triangle construction requires interpolations and table lookups, per triangle, with up to five triangles per cell. For the vertices of each triangle, finite differ- encing using the GPU texturing hardware is used to compute gradients that are normalized to form triangle vertex normals. This requires 18 texture lookups per triangle for central differencing, hence its contribution to the linear factor. While the isosurface crossing ta- ble from section 2.3.3 can reduce the constant factor substantially by eliminating the need for a second scan of the volume cells for triangles in the triangle creation phase, it does introduce a linear factor because it requires a write for each non-empty cell in the triangle counting phase (§2.3.1) and a read for each non-empty cell in the triangle creation phase

(§2.3.2). The write is of lesser consequence from a performance standpoint because there are no read-after-write hazards associated with it in the triangle counting phase.

Typical times seen on our test platform for the linear factor were around 40ns per trian- gle, on a single GPU. Faster GPU memory and better GPU caches would reduce this factor substantially, so it is expected that with new NVIDIA GPUs such as Fermi this linear factor may see a substantial improvement, though not to the same extent that would be expected of the constant factor.

Block size compromise

With some of the constant factor contributors depending on the number of cells in the block, and some not depending on the number of cells in the block, it is clear that choosing an appropriate block size is a trade-off. As the block size is made smaller, the overall performance for isosurfacing in terms of single blocks will decrease because the net

27 overhead for isosurfacing will be higher, but the load balancing between different GPUs

may be more accurate because of decreased load balancing data granularity. We found

block sizes of around 1283 to be a good compromise, with larger blocks offering insufficient

flexibility for load balancing thus reducing multi-GPU speedup, and smaller blocks having too much overhead.

In our algorithm the triangle counting phase (§2.3.1) contributes primarily to the con- stant factor while the triangle creation phase (§2.3.2) contributes primarily to the linear factor. As architectures change, the algorithm can be adapted by moving complexity from one phase to the other.

The linear relationship between single block isosurfacing time and the number of trian- gles enables the transformation of predicted triangle counts into a cost heuristic as discussed in section 2.2.1.

2.4.2 Effects of salient isovalue ranges on speedup

This experiment was conducted to explore how the user selected salient isovalue range

and the isovalue being isosurfaced affect the speedup. Fixed size (1536x1024x1024) data

was broken into roughly uniformly sized 1283 cell blocks, with some variation at the edges

of the volume. Blocks were distributed to the different GPUs using the algorithm discussed

in section 2.2. Isosurfaces were then computed using the algorithm in section 2.3.

Three different runs were performed, each sweeping 2000 isovalues from -3.00 to 7.00,

with the 1100 isovalues ranging from -1.50 to 4.00 exhibited in figure 2.5 with a different

line for each run:

• red line (line L1): uses our cost heuristic in distributing the blocks, with a salient

isovalue range of -1.25 to -0.75 selected.

28 • green line (line L2): uses our cost heuristic in distributing the blocks, with a salient

isovalue range of 2.75 to 3.25 selected.

• blue line (line L3): uses no cost heuristic, distributing the blocks in an arbitrary order.

Speedup varies based on how evenly isosurfacing work is distributed across the nodes.

The uniformity of isosurfacing work distribution at the node level is a function of both the isovalue and the dataset, because the work is linearly proportional to the number of triangles in the isosurface. Our algorithm assigns blocks to nodes to minimize the variance between the sums of the work assigned to each node, analogous to the variance of the sums of the cost heuristic values of the blocks assigned to each node. The salient isovalue range determines the range of isovalues that are considered when computing the cost heuristic.

The green line (line L2) in figure 2.5 exhibits a strong peak in the range of 2.75 to

3.25 because that is the salient isovalue range that was selected for that run, which implies that the work was assigned to nodes to maximize work uniformity only for those ranges of isovalues. However, other peaks are visible in locations like 1.5 because there is likely a similar spatial distribution of the isosurfaces for values around 1.5 as there is for isovalues around 3.0, in this data. Similarly to the green line, the red line (line L1) exhibits the same phenomenon for a different salient range, -0.75 to -1.25, with different similar regions for the same reason.

The blue line (line L3) is drawn for the naive no-cost-heuristic method. It shows varied performance because the arbitrary block assignments can create different work distribution uniformities, and thus different speedups, for different isovalues. Both the red line and the green line demonstrate substantially improved speedup over the naive method in their salient ranges.

29 The results of this exhibit that selecting salient isovalue ranges does offer the potential for improved speedup within those regions. At the same time, isovalues outside of those ranges do not suffer unacceptable penalties in speedup, sometimes even receiving improved speedup.

2.4.3 Volume size scalability

This experiment examined volume size scalability of our algorithm; that is, it ran trials for varying data sizes, with a fixed number of processing elements. The data was scaled from 1536x1024x1024 down to the appropriate sizes. Data was divided into roughly uni- formly sized 1283 cell blocks. 32 blocks were assigned per GPU for the largest resolution and 2 blocks were assigned per GPU for the smallest resolution.

Two different runs were performed, one with our proposed cost heuristic, with a salient isovalue range of 2.75 to 3.25, the other with no cost heuristic and arbitrary block assign- ments. From each of those runs, data was collected for two different ranges of isovalues:

• 100 isovalues in 2.75 to 3.25, the salient range

• 2000 isovalues in -3.00 to 7.00, the entire range

Mean and maximum times were recorded per isovalue, yielding the four lines in figure 2.6.

Small scale variation occurs within the lines of figure 2.6 primarily because there is a small degree of noise present in the isosurfacing times and in the cost heuristic accu- racy, thus, that noise can manifest itself in the results. In the case of the run done with no cost herustic there is a second source of variation. With no cost heuristic, the block assignments are arbitrary, thus changing the data size completely rearranges the block as- signments which results in substantial variation in the resulting times. In the case of our

30 cost heuristic based algorithms, such variation does not occur to the same extent because

block assignment is done based upon the cost heuristic.

The red line (the top line in the key) shows the performance of our method when the salient range matches the range being isosurfaced. It substantially outperforms the other test cases, with triangle rates on the order of 250 million per second. Comparing the other lines it can be seen that the performance can still be better than using no heuristic at all even when the isosurfacing is done in ranges outside of the salient isovalue range.

All four lines exhibit good volume size scalability, with the performance not substan- tially decreasing for an increasing data size on the same set of processing elements. In the next section it will be shown that the algorithm delivers strong scalability as well.

2.4.4 Strong scalability

This experiment was conducted on a fixed size 768x512x512 test data set, breaking the

data into blocks of approximately 1283 cells. The experiment was run on 4, 6, 8, 12, 16,

and 24 GPUs to examine strong scalability; that is performance scaling for a fixed data

size and varying numbers of processing elements. The time to isosurface every block was

recorded, and the results for every block were stored, including triangles with normal data.

Two different runs were performed for each GPU count:

• using our cost heuristic, with a salient isovalue range of 2.75 to 3.25 selected.

• using no cost heuristic

From each of those runs, we took the mean and maximumtimes for isosurfacing two ranges

of isovalues, 2.75 to 3.25 (the salient range), and -3.00 to 7.00 (the entire range.) For the

former range 100 isovalues were sampled and for the latter range 2000 isovalues were

sampled. This resulted in the 4 lines in figure 2.7.

31 The dependence on triangle counts for isosurfacing on the GPUs means that load may

be distributed unevenly between nodes depending on the isovalue and the data. Our block

distribution technique seeks to reduce this disparity, and the results in figure 2.7 exhibit its

success in achieving this.

Over the salient isovalue range of 2.75 to 3.25, our technique has substantially better speedup (21x on 24 GPUs) versus the naive technique with no cost heuristic over the same range (13x on 24 GPUs). Even over the entire range of values, -3.00 to 7.00, a modest benefit in speedup was seen versus the naive technique. When scaled to a larger number of GPUs, with appropriately larger data, we expect that scalability would continue similar trends.

If the block size could be made smaller, then we could possibly further improve the performance. However, this would be unlikely to increase performance because decreasing the block size would decrease the absolute performance per GPU. If interconnect, bus, and/or disk speeds were higher relative to the speed of the GPUs an attempt could be made to dynamically load blocks on demand. However, we already attain 86% efficiency over the salient range of isovalues with 24 GPUs, so it is unlikely that such an approach could yield further performance improvement.

2.5 Conclusion

We have presented an efficient, load-balanced multi-node, multi-CPU, multi-GPU method for computing triangular isosurfaces on volume data. A preprocessing stage computes metadata, permitting the efficient computation of a cost heuristic. The cost heuristic is computed for blocks using the preprocessed data and user-specified hints on isosurface saliency, then blocks are distributed to GPUs to maximize work uniformity. An efficient

32 parallel isosurfacing algorithm is then applied on each GPU with the assistance of the CPUs

to produce triangles in packed arrays that may subsequently be used for rendering or other

computations.

Our implementation is able to deliver isosurfacing performance in excess of 250 mil- lion triangles per second on 24 GPUs. Strong scalability is exhibited with 90% utilization with 8 GPUs and 86% utilization with 24 GPUs. Our algorithm enables the leveraging of contemporary hybrid-architecture clusters with CPU and GPU resources for more efficient exploration of large scale volume data.

33 volume data triangle data

Compute triangle counts Compute triangles with with marching cubes marching cubes

per−cell triangle counts per−row

N x N x N triangle x y z offsets

Sum triangle counts N x N y z per X row

per−row triangle Compute exclusive prefix counts sum per row with offset

N x N y z per−plane triangle Sum triangle counts N offsets per X−Y plane z

per−plane triangle Compute exclusive prefix sum counts N z

Figure 2.3: The triangle counting and creation process computes vertex buffer offsets for rows of the block of cells being isosurfaced then applies marching cubes to fill the vertex buffer.

34 Figure 2.4: The blue (dark) surface is isovalue -1.0 within the test volume used for the subsequent graphs and the yellow (light) surface is isovalue +3.0 within the same volume. At a volume resolution of 384x256x256 the gold surface contains 298858 triangles and the blue surface contains 916337 triangles.

35 Effects of Salient Isovalue Ranges on Speedup 23 Proposed cost heuristic, Salient isovalue range -1.25 to -0.75 (L1) Proposed cost heuristic, Salient isovalue range 2.75 to 3.25 (L2) 22 No cost heuristic (L3)

21

L1 L3 L2 20

19 24 GPU Speedup 18

17

16 -1 0 1 2 3 4 Isovalue

Figure 2.5: The salient isovalue ranges substantially affect the performance. In this figure it can be seen that the speedup is improved over ranges of isovalues that are specified as salient. When no cost heuristic is used, the distribution of performance over the isovalue range is not well defined because the effective cost value of the work for each block is equal. Each line has 1100 sample isovalues, computed over a 1536x1024x1024 test volume on 24 GPUs.

36 Scaling for Varying Volume Sizes on 24 GPUs

Our proposed cost heuristic, Mean of isovalues 2.75 to 3.25 300 Our proposed cost heuristic, Mean of isovalues -3.00 to 7.00 No cost heuristic, Mean of isovalues 2.75 to 3.25 No cost heuristic, Mean of isovalues -3.00 to 7.00

250

200

150

100 Millions of isosurface triangles per second, with normals 0 1 2 3 4 5 6 Gigabytes of single precision floating point volume data

Figure 2.6: The performance advantage of using our cost heuristic over using no cost heuristic is maintained over the range of loadable volume sizes on a cluster of 24 GPUs. The salient isovalue range used for the cost heuristic is 2.75 to 3.25, resulting in a mean isosurfacing performance on the order of 250 million triangles per second over that range of isovalues. Using no cost heuristic over that same range yields performance on the order of 175 million triangles per second.

37 Scaling for Varying Numbers of GPUs on a Fixed-Size Volume 24 Our proposed cost heuristic, Mean of isovalues 2.75 to 3.25 Our proposed cost heuristic, Mean of isovalues -3.00 to 7.00 No cost heuristic, Mean of isovalues 2.75 to 3.25 20 No cost heuristic, Mean of isovalues -3.00 to 7.00

16

12 Speedup

8

4

4 8 12 16 20 24 Number of GPUs

Figure 2.7: Using our proposed cost heuristic improves scalability, especially when the isovalues for which isosurfaces are being computed are within the salient range. In this figure, the salient range of isovalues used for the computation of the cost heuristic is 2.75 to 3.25 and the volume is 768x512x512 samples.

38 Chapter 3: Stereo Frame Decomposition for Error-Constrained Remote Visualization

Continued growth of dataset sizes relative to bandwidth availability continues to pro-

vide challenges for visualization systems. Simultaneously, trends in computing continue to

move applications into the cloud, with workstations being replaced by thin clients with lim-

ited bandwidth for cloud access. Additionally, stereo video solutions have become lower-

cost and more common, with devices such as NVIDIA 3D Vision increasing the potential

for their use by a wider range of visualization users.

Consider the case of an engineer seeking to interactively visualize the results of a simu- lation using a thin client, with very limited memory and compute resources, with simulation compute resources and their associated storage solutions being remotely located. Even for modestly sized datasets, the volume of data to analyze will be considerably larger than the number of pixels in one frame. Additionally, the size of the data will likely be larger than the memory available on the client. In this case, video-based remote visualization techniques are likely to be more effective than techniques that seek to move the simulation result data directly to the client.

Interactive visualization requires reasonably high framerates of at least 10 FPS [133].

This means that even for resolutions as low as 720p, with 24 bits per pixel per eye, greater than 52MiB/second is required for uncompressed transmission. Compression is clearly

39 needed, but currently available lossy video codecs do not support visualization-specific

error constraints that consider aspects such as what transfer functions are being used. Loss-

less compression could be an option, except that it wastes space by transmitting the infor-

mation needed for lossless reconstruction rather than the minimal information needed to

satisfy less restrictive error constraints. This can be observed in figures 3.1 and 3.3.

We propose a solution for video-based remote visualization that enables lossy coding

of stereo video streams subject to user-provided error constraints. The stereo color and

depth frame streams are decomposed into one depth, one color, and two residual streams.

A novel video+depth coding algorithm is used to take advantage of coherence between

the eyes and a novel residual coding technique is used to enable the use of off-the-shelf

lossy codecs for color and depth stream transmission while adhering to error constraints. A

novel framework is proposed that enables integration with existing remoting solutions and

the utilization of multiple CPUs and GPUs. Visualization techniques that can benefit from

the enhanced depth perception permitted by stereo, such as maximum intensity projection

and shaded isosurfaces, are then used in experiments to demonstrate the efficacy of the

technique.

This paper is organized as follows. The technique itself is described in detail in §3.2, and some suggested error constraints that may be used with it are described in §3.3. Finally,

the results are discussed in §7.3.

3.1 Related Work

Fundamentally, our technique seeks to adapt remoting techniques used for tasks other than visualization, developed in much larger markets that have funded considerable in- novation, to work well for visualization. Of close relation to our technique are stereo

40 (a) Degraded (b) Ground truth

(c) Color difference

Figure 3.1: The difference between the ground truth and the error-constrained degraded image is wasted information that would need to be transmitted, if lossless encoding were used.

reprojection and view synthesis, residual coding, stereo video coding, and other remote visualization schemes.

We consider two different categories of remote visualization tools: those specifically designed for visualization, and those designed for more general use. While the visualization- specific tools may offer better performance for a limited set of applications, the general purpose tools may be more accessible on a wider variety of platforms and be lower cost to implement. Our technique seeks to bridge the gap between more general remoting and stereo visualization.

TightVNC is a commonly used general remoting solution that offers a lossy JPEG codec, not supported by standard VNC, as an encoding option in addition to the stan- dard lossless and simple lossy VNC codecs. A more advanced solution, demonstrated at

SIGGRAPH 2011, is the NVIDIA Monterey Reference Platform, which leverages GPUs to

41 enable low-latency streaming of H.264 [106] video while supporting Android and Windows clients.

Within the context of visualization, MPEG has been used [18] to stream images to workstations. It was found to improve the temporal resolution of the models visible for the bitrates tested, so it is reasonable to think that H.264 may also perform well in that regard.

ParaView offers a range of options for remote visualization, though none are similar to our technique [7]. More similar to our technique is the work using Chromium by Lamberti et al. [59] except that they do not offer error constraints on the results, stereo support, or the ability to apply remapping (§3.2.4) to improve compression performance.

Stereo and multiview video coding has been explored in contexts outside of visual- ization. The current H.264/MPEG-4 AVC Standard offers standard extensions to support stereo and multiview video coding [111]. These extensions define ways of packing mul- tiple views of a scene into a single encoded image stream. However, they do not directly consider depth information, though it is mentioned as a possibility for future research.

Smolic et al. [103] provide a good overview of techniques for stereo video coding that are relevant to the context in which our technique operates. Three general categories of approaches are discussed: conventional stereo, video + depth stereo, and layered depth video.

Conventional stereo methods transmit color information separately for each eye. Ozbek¨ et al. [78] apply this using lossy codecs, adaptively choosing different bitrates for the two views. Rate balancing is also important in the context of our technique, and is addressed in §3.2.5. We compare our technique to a conventional (discrete) stereo technique, using off-the-shelf codecs, in section 3.4.4.

42 Layered depth video methods bear some similarity to the conventional stereo methods, except that they separate foreground and background objects, encoding them separately, possibly using video + depth encoding. Other techniques proposed by Moellenhoff et al.

[73], Jiang et al. [48], Yan et al.[126], and Yang et al. [128] are similar to the joint coding technique we use for comparison. They encode one or more primary views (in the case of multiview) then encode other views differentially with respect to the primary views.

Video + depth stereo techniques, such as those proposed by Smolic et al. [103], Smolic

et al. [104], and Merkle et al. [72] transmit the color and depth information for one view,

then use that information to reconstruct the image for both views. In contexts such as live

action broadcast video, the depth buffer is not known. Thus, it must be estimated. This has

been a long-standing image processing problem, and has been addressed by many works.

Some techniques of particular relevance to video + depth encoding are proposed by Roy et

al. [90], Saxena et al. [93], Yang et al. [127], and Saxena et al. [94]. However, it is still a

fundamental source of error in depth frames.

However, for many visualization applications, depth information is available from the rendering process. For example, for isosurface rendering techniques, the depth information for the surface is known. Given this, we take a video + depth approach in our system, synthesizing the right frame from the color and depth buffers of the left frame.

For a video + depth solution to work, we must be able to synthesize one view from the other view. The depth information combined with the camera transformation asso- ciates each pixel in the source view with a world space position. These pixels can then be projected to the destination view using the camera transformation of the destination view.

Conceptually, this is simple, but there are some challenges. Firstly, what may be a dense sampling in one view may be a sparse sampling in another, leaving gaps. Secondly, with the

43 typical asymmetric frustum parallel axis projection used in stereo rendering, occlusion will

create gaps because all points visible from the right eye are not necessarily visible from the

left eye. Pixels that have multiple world space positions contributing to their colors due to

transparency introduce additional complexity which is handled via residual coding (§3.2.3)

in our system.

Sample reprojection for view synthesis been looked at in previous works in multiview video compression. Martinian et al. [70] apply a technique considering camera information to reproject point samples. While the technique does have some similarity to ours in that they transform the point samples with matrix multiplications and use H.264, we use a different technique for filling the gaps (§3.2.1), apply a different camera transformation

because we know the exact camera in our case (§3.2.1), propose a GPU implementation,

and correct the results using a residual (§3.2.2) to enable its use for visualization.

In our system, residuals are required both because we use lossy codecs for the color

streams and because view reprojection cannot, in general, reconstruct views completely.

Residual coding has been applied previously in many widely used predictor-corrector based

techniques. Examples of mainstream codecs include PNG (Portable Network Graphics),

FFV1 [75], and LJPEG [95]. In techniques like these, a predictor operates to reconstruct

samples from previously reconstructed samples, then a corrector stores the resulting resid-

ual from a comparison versus the ground truth. Conceptually, this is similar to our tech-

nique except that our residuals are lossy (controlled by the error constraint) and the predic-

tor is a lossy video compression technique like H.264.

Another way of looking at residual coding in the context of image compression is to

look at lossy wavelet-based methods such as lossy JPEG 2000 as predictor-corrector based

methods. Effectively, the low pass filter of each filter stage acts as a predictor and the high

44 pass filter acts as a corrector [98]. The choice of what wavelet coefficients to zero for the

purposes of compression is analogous to the choice of residual entries to zero in the context

of our residual decimation technique (§3.2.2.)

We are not aware of any techniques like ours that apply lossy residual coding, with visualization-centric error constraints (§3.3), as a corrector on top of mainstream video codecs.

3.2 Technique

The goal of the technique is to reduce the bandwidth needed for remoting while adher- ing to user-provided error constraints. The approach we take accomplishes this, in addition to enabling support for the use of existing off-the-shelf lossy codecs that can take advantage of temporal coherence.

The stereo input frame, containing left color (LC), left depth (LD), and right color streams (RC), is decomposed into an encoded left depth primary stream (LDE), an encoded left color primary stream (LCE), an encoded left residual stream (LRE), and an encoded right residual stream (RRE) as in figure 3.2a. Different components are encoded using different techniques to improve performance. By reconstructing (§3.2.1) the output right color stream using the LCE, LDE, and camera transformation (CX), coherence between the left and right images can be utilized.

The LCE and LDE can be encoded with off-the-shelf codecs, such as H.264, that are hardware accelerated on mobile devices. This allows for the technique to take advantage of motion compensation and other features offered by contemporary video codecs to fur- ther improve performance. Additionally, this enables piggybacking of our technique onto existing remoting solutions such as the NVIDIA Monterey Reference Platform.

45 Because lossy video codecs, in general, will be incapable of meeting the user-provided

error constraints proposed in §3.3, the primary streams are augmented with residual streams

as in figure 3.2a. Each of these residual streams contains an approximately minimal amount

of information to correct the lossily encoded frames (generated using the techniques de-

scribed in §3.2.2 and §3.2.3) to adhere to user-defined error constraints. With the algorithm described in section 3.2.4, colors are remapped using known properties of the transfer func- tions (XF) being used. This further improves compression by reducing the amount of in- formation that needs to be stored in the residual.

Details about the technique are in the following sections.

3.2.1 Reprojection

Substantial redundancy exists between sample values of each eye [103]. For example, consider looking at a surface from both eyes, as in figure 3.6b. If the frames for the two eyes are transmitted separately, then the color information for a point that is visible in both eyes is transmitted twice. The goal of reprojection is to minimize this kind of redundancy by only transmitting the color information for one eye while using camera and depth infor- mation to reconstruct the image for the other eye.

The reprojection algorithm uses the color image and depth image for one eye, combined with the camera transformations for both eyes, to synthesize the view for the other eye. In the data flows shown in figures 3.2b and 3.2c, this is used to reconstruct the right eye using information from the left.

We propose an algorithm similar in goal and overall approach to Martinian et al. [70], in that we first synthesize a new view from existing views, then fill any gaps left by the view synthesis. However, the specifics of the approach we take are substantially different.

46 In contrast to their approach, we are reconstructing a view from only one source view,

with knowledge of the depth and camera projection information. Additionally, to avoid an

unnecessary pass of color remapping (§3.2.4), we do not synthesize any new colors in the

gap filling algorithm. Finally, we are more focused on efficiency as decoding needs to be

lightweight enough for thin clients and needs to be easy to implement with a data-parallel

programming paradigm such as that offered by CUDA.

View synthesis and Filtering

Each pixel in the source eye image has a camera space position defined by its depth buffer value and image position. The inverse camera transformation for the left eye is applied to pixels’ camera space positions to produce world space positions. These world space positions are then projected to the destination eye using the camera transformation of the destination eye. The color of the source pixel is written to the single pixel at the projected destination position. Depth testing is applied so that the write closest to the cam- era will be used if the same pixel is written multiple times,. Pixels with the background color are not projected, as both framebuffers have been previously cleared with the back- ground color. This algorithm is easily implemented with CUDA on platforms supporting the CUDA 1.1 atomicMin operation.

While this algorithm was found to produce images of reasonable quality, some gaps are present in the results due to occlusion and varying sampling rates of the world space from the slightly different perspective views. The residual codec could be applied to correct any artifacts, but this would increase the bitrate required. Instead, a gap filling filter is applied.

Each pass of the gap filling filter iterates over all of the pixels in the destination. Each pixel in the destination view that is not set to the background color, has an uninitialized depth value, and has at least one initialized neighbor will have its color and depth set from

47 the neighbor with the minimum depth value. Passes are applied until the gaps have been

sufficiently reduced. Like the reprojection algorithm, this gap filling algorithm is also easily

implemented with CUDA.

The resulting destination eye image from this process will be an approximation of the view from the destination eye. Any artifacts remaining that violate the error constraint will be corrected by residual coding.

Depth encoding

Depth buffers are typically dominated by low spatial frequency regions with a few high spatial frequency regions at boundaries, similarly to many natural images, as in figure

3.6a. Because of this, lossy video codecs such as H.264 can be used to encode depth

[103]. We observed that the bitrate required, as shown in graph 3.4b, to encode typical depth buffers while still providing good quality reprojections was substantially lower than the color bitrate required to provide color reconstructions, as in graph 3.4a. Additionally, storing the depth buffer at 8 bits per pixel was still found to be very effective for our test cases, given that small variations in depth value have very small effects on the spatial position for reprojection.

3.2.2 Residual Decimation

The lossy codec (such as H.264) used for the primary streams (LCE and LDE) produces frames different from the ground truth. The difference between the ground truth and the decoded lossy codec is the residual.

Because the goal is to perform lossy coding subject to an error constraint rather than to perform lossless coding, the entire residual is typically not needed. The residual can be

48 simplified, producing a decimated residual, to reduce the amount of wasted information, as seen in figure 3.1.

Residuals are decimated by replacing every nonzero value with a zero, when the re- placement will not result in a user-specified error constraint (discussed in §3.3) being vio- lated. This is similar in motivation to how JPEG applies quantization to produce repeated zero coefficients [114]. Figure 3.3 shows decimated residuals in comparison to undec- imated residuals. Because the decimated residuals are less complex, they are easier to compress.

The decoder does not need to be aware of the decimation strategy, as it simply adds the decimated residual to the degraded image to correct the image. This means that other strate- gies could easily be applied for residual decimation. For example, a decimation strategy could be tuned to improve performance for the specific codecs being used, taking advantage of hardware support.

3.2.3 Decimated Residual Codec

Once a decimated residual has been computed as in §3.2.2, it must be losslessly com- pressed using a technique that is well-suited to the characteristics of decimated residuals.

Typical decimated residuals look like figures 3.3c and 3.3d. They have sparse arrange- ments of single pixel differences, with narrow contiguous regions of pixels near boundaries.

The eye whose image is reconstructed using the reprojection algorithm (§3.2.1) generally has more boundary artifacts. These are due to differences in visibility between the two eyes, as determined by eye separation.

In general, the codecs used for encoding the decimated residuals must be good at encod- ing sparse single pixels and curves of pixels, rather than continuous regions with gradients.

49 This means that DCT-based codecs such as JPEG, which were designed for photographic

image encoding [114], are not well suited to the task.

While many different codecs could be used for encoding the residual, we found the combination of Zero-Run-Length Encoding (ZRE) with Lempel-Ziv-Oberhumer (LZO [77]) to offer good performance in terms of encoding speed, compression ratio, and simplicity of implementation. For encoding, ZRE is first applied then LZO is applied. For decoding, the reverse is done.

Other configurations such as ZRE alone, LZO alone, applying a Hilbert Curve reorder- ing of samples, and applying multi-frame temporal-differential coding were tested, but the

ZRE+LZO scheme was found to be the most effective. Temporal-differential coding can work well on images that are mostly static, but these are not encountered in our test suite.

Automatically switching between temporal-differential and non-differential based on scene dynamicity would be trivial.

ZRE collapses sequences of zeroes into a single value. This is similar in approach to run length encoding (RLE), except that ZRE does not need to store the symbol type for a repeating sequence, by assuming that it is zero. ZRE is well-suited to encoding residuals because our decimation technique produces long runs of zeros with very little continuous repetition of nonzero values.

LZO is a lossless sliding dictionary-based compression algorithm designed to offer fast encoding and and very fast decoding while still offering competitive compression rates

[77]. This makes it ideal for our circumstances, where we need to encode and decode data in real-time on thin clients with limited compute resources. It has also been used previously in the context of remote visualization [19], though they probably could have improved performance by applying some data preparation (similar to ZRE) before the LZO

50 coding. Because ZRE tends to produce strings of words that have some repetition, LZO is

well-suited for encoding.

Residual encoding and decoding can both be implemented in parallel, for use on mobile devices with multiple relatively-slow cores, with an acceptable compression performance penalty. This can be done by breaking the image into multiple blocks of pixels and applying the residual codec on a block-wise basis, with one thread per block.

3.2.4 Remapping

When an image is encoded with a lossy primary codec such as H.264, the set of output colors is generally not a subset of the set of input colors. By definition, any color in the output that is not in the input is a color in error. Knowing the set of colors in the input, we can remap the set of output colors back to that set of input colors. This offers two key benefits.

First, with remapping being applied to the decoded image before the residual is com-

puted, much of the information that would otherwise be necessary to transmit as part of the

residual to map samples back to the ground truth does not need to be sent. This is because

a remapped sample is more likely to be correct in terms of the error constraints than an

unremapped one, thus improving compressibility of the residual.

Secondly, some error constraints, such as ITFC (§3.3.3) and TFD (§3.3.2), require that

there is a corresponding transfer function space position for every image space sample.

Because the samples produced by a lossy primary codec are not necessarily in the transfer

function, there must be a mapping between possible result colors from the lossy primary

codec and the transfer function space. The remapping operation maps a set of samples

51 whose values belong to a set B to values within a set A. We experimented with a couple different operators:

Simple inverse transfer function Given a sample color value, the color value with the

smallest color difference (as defined in CIE 1976 [47]) to it within the transfer func-

tion is used.

Frame history For each frame, the mapping of observed samples to the ground truth sam-

ples is recorded, permitting estimation of the conditional probability that a sample

will have some remapped value y ∈ A given an observed value of x ∈ B. For remap-

ping a frame, the probabilities of the previous frame are used.

In practice, the simple inverse transfer function method was found to outperform the frame history method and was substantially more simplistic to implement. The frame his- tory method was found to yield conditional probabilities that tend to produce incorrect results, because they depend on lossily encoded samples.

The inverse transfer function can be easily computed in parallel on a GPU using a map- reduce algorithm to find the nearest color in the transfer function for each color that may occur in a decoded image. This results in a 3D lookup table mapping decoded image colors to transfer function colors. In cases where some colors are not invertible due to ambiguities, errors in color remapping will be corrected by the residual if they violate the user-defined error constraint.

3.2.5 Rate Balancing

As can be seen in figure 3.2a, the technique decomposes the input into two primary streams and two residual streams. The bitrate used for the codec (such as H.264) for each primary stream is configurable and may vary over time.

52 The bitrates of the residuals are strongly related to the artifacts introduced by the pri-

mary codec streams, which implies that they are strongly related to the bitrates of the

primary codec streams. As shown in table 3.1 and figure 3.4a, an increase in the bitrate

of the left color primary stream (LCE) yields a decrease in the bitrate of the left resid-

ual stream (LRE). This is reasonable because the only two things that can vary in bitrate

that contribute to the left color frame are the LCE and the LRE. The left depth primary

stream (LDE) bitrate does not affect the LRE bitrate because the LDE is not an input to any

functional blocks producing the left color frame (LCD.)

Similarly, as can be seen in figure 3.4b, an increase in the bitrate of the LDE yields

a decrease in the bitrate of the right residual stream (RRE.) However, the bitrate of the

LCE is largely decoupled from the RRE because the reprojection process is done using

the post-residual-application left color frame (LCD), not the raw decoded frame from the

LCE. A circumstance under which the coupling may become substantial is if very loose

error constraints are used, but in practice we did not find any reasonable error constraints

that produce substantial coupling.

This decoupling allows for the optimal bitrate to be chosen independently for the LDE

and the LCE, which substantially simplifies the optimization problem. Graphs 3.4c and

3.4d exhibit overall bitrate as a function of LDE and LCE bitrate. As would be expected,

the optimal bitrate increases as the error constraint is tightened. The curves vary in data,

transfer function, error constraint, and primary codec-dependent ways, so a closed-form

formula to find the optimal bitrate is not practical.

Applying a simple direct optimization approach was found to work well over a range of

a bitrates. First, a function in the form y = x +b isfittotheLRE(y) and RRE (y) bitrates as a function of LCE (x)andLDE(x) bitrates, respectively. Independent optimization problems,

53 a in the form argmin x +x+b can be defined for the LCE bitrate and LDE bitrate. This can be a directly solved, by solving for a positive zero of 1 − x2 . One challenge with this is finding the values of a and b. One approach is that, periodically, the system can sweep the bitrate x to find a set of (x,y) tuples to fit for the values a and b. Different curves can be applied for the fit, as needed. In general, the overall bitrate is not strongly sensitive to small changes in the LCE or LDE bitrate.

An alternative to direct optimization is to apply PID (proportional-integral-derivative) control for the bitrate for the left eye (LCE + LRE), and for the right eye (LDE + RRE.)

Because the system is not ill-conditioned, and has fairly limited frame latency, standard techniques like Ziegler-Nichols can be used to find the PID coefficients for the use cases of interest. If codecs are used that have substantial frame latency, such as H.264 with

B-frames, the proportional gain possible will be limited due to the potential of oscillations.

Finally, manual control of the bitrates can be a reasonable alternative, if the implemen- tation cost of the other techniques is prohibitive, because the curves in graphs 3.4c and 3.4d have easy to observe global minima.

Left Color Left Residual Left Depth Right Residual Left Color 1.00 -0.96 0.00 -0.02 Left Residual -0.96 1.00 0.00 0.02 Left Depth 0.00 0.00 1.00 -0.83 Right Residual -0.02 0.02 -0.83 1.00

Table 3.1: Cross correlations were computed between the bitrates for many observed trials.

54 3.3 Error Constraints

Error constraints control how different a frame can be from the ground truth. Within the context of our system, an error constraint is defined by implementing a boolean function

T (c,c′) where c is the ground truth color and c′ is a potential replacement color for that ground truth color. If and only if this function returns true may c be allowed to be replaced

by c′.

We suggest three different error constraints that may be used in different circumstances depending on user intent and transfer function properties: color difference (CD), trans- fer function distance (TFD), and integrated transfer function contrast (ITFC). Other error constraints can be applied, if needed for a particular application.

The TFD and ITFC error constraints all share one requirement: there must be a map- ping from remapped(§3.2.4) image space color to data-domain value. Examples of appli- cations where this may be appropriate are transfer function-shaded isosurface rendering, maximum intensity projection volume rendering, and volume rendering where there is no semi-transparency. A related limitation is that the error constraints are most useful when there is a one-to-one mapping from image domain positions to data domain positions.

3.3.1 Color Difference

Simply considering the color difference, in terms of CIE 1976 [47], can be a viable option in cases in which the important measure of difference within a transfer function is a difference in color. It is defined as:

′ ′ true |c − c | < Dmax TCD(c,c ) = (3.1) (false otherwise ′ ′ where |c − c | is the CIE 1976 color difference between c and c and Dmax is the error constraint value.

55 This constraint may be a good choice when the color space distances within the transfer

function are substantially different from the corresponding transfer function space distances

(§3.3.3). This commonly occurs when users want to emphasize some ranges of values more

than other ranges of values. Figure 3.5a exemplifies this kind of transfer function.

3.3.2 Transfer Function Distance

Considering the distance within the transfer function space itself is another alternative

to color distance. It requires that the input colors c and c′ each correspond to a single position within the transfer function, which can be accomplished with remapping (§3.2.4).

It is defined as: ′ ′ true |M(c) − M(c )| < Dmax TTFD(c,c ) = (3.2) (false otherwise where |M(c) − M(c′)| is the Euclidean distance between M(c) and M(c′). M(c) and M(c′) are transfer function positions of the colors c (ground truth) and c′ (degraded), respectively, within the transfer function. M(c) and M(c′) are not defined for colors that are not within the transfer function, but this does not matter because remapping is used to map colors to the set of colors in the transfer function. Dmax is the error constraint value.

This constraint may be a good choice when color space distances within the transfer function are similar to a linear scaling of the transfer function space distances in the transfer function. Figure 3.5c exemplifies this kind of transfer function. This stands in contrast to

figure 3.5a, which is likely to be inappropriate for TFD because color distances in figure

3.5a are not linearly proportional to transfer function space distances.

56 3.3.3 Integrated Transfer Function Contrast

In some transfer functions, the distance between two values depends not only on the values themselves, but also on the path between the instances of the values within the transfer function. We call these transfer functions context-sensitive.

For example, consider the transfer function in figure 3.5b, where A, B, and C refer to specific samples within it. If the color space distance metric (CD – §3.3.1) were used, then

|A − B| would be similar to distance |A − C|. If this was the intent of the user then this

may be acceptable. However, if the intent of the user was for A to be more dissimilar from

C than B, then CD is not an appropriate metric. Additionally, if the intent was for A to be

more dissimilar from D than D is from E, then the transfer function distance (TFD – §3.3.2) is not appropriate. Integrated transfer function contrast (§3.3.3) can resolve both of these problems inherent in context-sensitive transfer functions, resulting in the distance |D − A| being greater than |E − D| and |A −C| being greater than |B −C|.

The ITFC error constraint is defined as follows:

v = M(c′) − M(c) (3.3) 1 v ′ true 0 ∇C(M(c) + uv) · |v| du < Dmax TITFC(c,c ) = (3.4) (falseR otherwise

v where ∇C(M(c) + uv) · |v| is the directional derivative of color in terms of CIE 1976 [47] color differences, which is effectively contrast per unit of distance in transfer function

space. M(c) is the mapping of colors to transfer function positions.

3.4 Results

Experiments were conducted to examine how the technique performs with different error constraints, primary codecs, eye separations, transfer functions, and datasets.

57 3.4.1 Data sets

Two datasets were used for experimentation: the combustion dataset used by Akiba et al. [2] and the plume dataset used by Akiba et al. [1]. The combustion dataset was rendered using shaded isosurfacing. The plume dataset was rendered using maximum value projection.

3.4.2 Lossy Codecs

Multiple lossy primary codecs were tried for encoding the color and depth primary streams (LCE, LDE). Two different versions of H.264 [106] were tried, one being the main profile, and one being the constrained baseline profile. The constrained baseline profile (BCP) is more commonly available on mobile devices, which would be a common client target platform for our system. Additionally, the BCP does not use B-frames, which introduce more frame latency [20]. MJPEG and MPEG-4 were both tried to check if more simplistic primary codecs could still yield good performance.

The H.264 main profile outperformed all of the other codecs in all of the tests. This is reasonable, considering that it offers more features for motion estimation and bidirectionally- aware temporal coding than the other codecs. However, the substantially simpler H.264

BCP offered performance very similar (within 5%, in terms of bitrate) to the H.264 main profile, so it may be a better choice for many applications, especially when latency matters.

Thus, for our technique, we apply the H.264 BCP as the codec for the color and depth primary streams (LCE, LDE) for all subsequent comparisons.

58 3.4.3 Lossless Codecs

Lossless codecs are required for comparison for two primary reasons:

1. Lossy codecs will tend to cause a color shift in the results. While this color shift

may be acceptable for some error metrics, such as CD (§3.3.1,) it will not enable

computation of the TFD (§3.3.2) or ITFC (§3.3.3) error constraints because there is

not a clear mapping from colorspace to transfer function space, without a remapping

operator such as that used by our technique (§3.2.4.)

2. It is not, in general, possible to directly compare the output of the lossy codecs men-

tioned in §3.4.2 versus our system using the same lossy codecs because these lossy

codecs do not offer compatible error constraints and may not be able to adhere to the

error constraints even at the highest quality available. This can be observed in figure

3.4a for the H.264 codec, where the residual bitrate required to attain lossless or even

near-lossless performance does not converge to zero, even as the maximum possible

bitrate produced by the H.264 codec is approached.

We experimented with two lossless codecs: LJPEG [95] and FFV1 [75]. Both of these are prediction-correction based methods that apply a predictive pass, then encode the resid- ual between the prediction and the actual image.

LJPEG was chosen because it is a well known codec, and FFV1 was chosen because it is a widely available [108] codec that often outperforms LJPEG. In fact, we found that

FFV1 outperformed LJPEG by such a substantial margin in our test cases that we only present results of FFV1 versus our technique.

59 3.4.4 Compression Performance

Experiments were performed to examine the sensitivity of the compression perfor- mance to error constraints, eye separation, and alternate datasets. Four different techniques were compared: our reprojection technique, discrete coding of frames with a lossless codec, discrete coding of frames with a lossy codec combined with residual coding, and joint cod- ing of frames.

Discrete coding is simply the case where each eye is treated as an entirely separate video stream, with no coupling between the two eyes. Effectively this means transmit- ting the LCE, LRE, RCE, and RRE, but not the LDE. It is implemented as two monocular codecs running in parallel. Other techniques (such as those proposed by Ozbek¨ et al. [78] and Smolic et al. [103]) do use this, though they may use different bitrates for each eye.

The joint technique is similar to the discrete technique, except that the LCE is also used as the RCE. Basically, this means that only one primary stream, LCE, is sent over the network along side the LRE and RRE, with the difference between the two eyes being encoded en- tirely into the RRE. Conceptually this is similar to many other techniques that have been used for stereo video transmissions, especially when depth information is unknown or dif-

ficult to compute. Some examples of these techniques are those proposed by Moellenhoff et al. [73], Jiang et al. [48], Yan et al.[126], and Yang et al. [128]

In all of our test cases, including lossless ones, our technique substantially outper- formed discrete lossless and discrete lossy video compression. It also outperformed simple joint coding except for very loose error constraints.

60 Sensitivity to Error Constraint Values

As the error constraint is loosened (increased), the compressibility increases. This is because the number of entries that can be decimated (as in §3.2.2) in the residual increases, yielding greater compressibility.

Figure 3.7a exhibits this. In this case the combustion dataset was rendered with an eye separation of 0.03, producing images like figure 3.6b, for different ITFC error constraint values.

It is reasonable that the technique would outperform the discrete lossless technique because, even for an error constraint of 0, there is still wasted information sent due to the fact that the colors in the image are constrained to those in the transfer function by the (figure 3.5b) remapping operation. Additionally, with the reprojection technique, we avoid duplication of interocularly-coherent color information that occurs with all discrete techniques.

Similarly, it can be seen that the reprojection technique outperforms the joint coding technique until the error constraint becomes very loose. This is reasonable because the joint technique has to encode the disparity due to the difference in the camera projection between the left and right eye, while the reprojection technique only needs to encode corrections for regions where the depth is either inaccurate, or where occlusion resulted in a lack of samples. However, with very loose error constraints, there is sufficient freedom within the constraint to ignore much of the data in the residual due to the camera projection difference, thus there is little benefit to sending the depth information over the communication channel for reprojection in this case.

61 Sensitivity to Eye Separation

Intuitively, one can expect that the compression performance of a technique that repro- jects the pixels from one eye’s camera space to the other eye’s camera space should de- crease as the separation between the eyes increases because the disparity between the two eyes will increase [48]. This effect is indeed seen in graph 3.7b. For the camera configura- tion of our test cases for our test scenes, an eye separation of0.05 is about the maximum that can be used without causing eye strain or completely preventing stereopsis. Even for those extreme eye separations the reprojection technique outperforms the joint and discrete codecs, and for the less extreme eye separations its performance is further improved.

Experiments were also run for different datasets, using different rendering techniques, different transfer functions, and different error constraints. In all cases we found the tech- nique to continue to exhibit roughly the same compression performance behavior with respect to changes in error constraints.

This is expected because most datasets and transfer functions will permit our system to improve compression performance over lossless techniques by utilizing coherence be- tween the eyes and the color information limits imposed by the transfer functions using reprojection (§3.2.1) and remapping (§3.2.4.)

Monocular Viewing

Experiments were also performed to verify that the concept of transmitting residuals in addition to a primary codec stream was also useful for monocular streams. The results were approximately the same as the discrete techniques in figure 3.7a, though with one half the bitrate. This is because the frames for both eyes are very similar, and we only need to transmit the frame for one of the two eyes in the monocular case. Even though there

62 is no potential for taking advantage of coherence between eyes in the monocular case, there is still the potential for utilizing the flexibility permitted by error constraints, motion estimation, and color information limits imposed by the transfer functions.

3.5 Conclusion

We have proposed a video-based remote visualization solution enabling transmission of stereo video streams using efficient lossy codecs while adhering to user-defined error constraints. A novel video+depth coding algorithm is used to take advantage of coherence between eyes and a novel residual coding technique is used to enable the use of arbitrary lossy codecs for transmitting primary streams. The novel framework proposed enables integration with existing remoting solutions such as VNC and NVIDIA Monterey, as well as the utilization of multiple CPUs and GPUs. The system enables transmission of remote stereo visualization video streams at lower bitrates than would be possible with traditional lossless techniques, while providing support for visualization-specific error constraints.

63 (a) Framework

(b) Reprojecting encoder

(c) Reprojecting decoder

Figure 3.2: The framework decomposes the left and right frames into one depth stream, one color stream, and two residual streams in the encoder (§3.2), which are then reconstructed into left and right frames in the decoder. Because the depth stream generally takes much less space than the color stream, and the error introduced by reprojection is small, this yields better performance than encoding the streams separately. Additionally, transmission of partial residuals subject to user-defined error constraints enables fidelity guarantees for visualization applications.

64 (a) Original left (b) Original right

(c) Decimated left (d) Decimated right

Figure 3.3: The per-pixel color magnitudes of the residuals are shown for both eyes, before and after decimation subject to an error constraint, with darker colors meaning greater magnitude.

65 1.6 1.08 0.96 1.28 0.84 0.96 0.72

0.64 0.6 0.48

Left residual bits/pixel 0.32 Right residual bits/pixel 0.36 0 0.24 0 0.16 0.32 0.48 0.64 0 0.048 0.096 0.144 Left color H.264 bits/pixel Left depth H.264 bits/pixel (a) LCE vs. LRE (b) LDE vs. RRE

1.92 1.12

1.6 0.96 1.28 0.8 0.96 0.64 0.64 Left total bits/pixel 0.32 Right total bits/pixel 0.48

0 0.32 0 0.16 0.32 0.48 0.64 0 0.048 0.096 0.144 Left color H.264 bits/pixel Left depth H.264 bits/pixel (c) LCE vs. LCE+LRE (d) LDE vs. LDE+RRE

Figure 3.4: Increasing the LCE (left color encoded) bitrate decreases the LRE (left residual encoded) bitrate. Increasing the LDE (left depth encoded) bitrate decreases the RRE (right residual encoded) bitrate. The curves, from top to bottom, have ITFC error constraints of 0,6,12,18,24,and 32. More-restrictive constraints tend to require higher LCE and LDE bitrates for optimal performance.

(a) Nonlinear context-free (b) Nonlinear context-sensitive (c) Linear context-free

Figure 3.5: Different types of transfer functions are appropriate for different types of error constraints

66 (a) Left Depth (b) Left and right color

Figure 3.6: Stereo rendering of the combustion dataset (§3.4.1) using isosurfacing with a 2D transfer function (figure 3.5b)

3.5 Stereo streaming 3 5 reproj(H.264) 4.5 2.5 joint(H.264) discrete (FFV1) 4 discrete(FFV1) discrete (H.264) 2 reproj (H.264) 3.5 1.5 joint (H.264) 3 Bits per pixel 2.5 Bits per pixel 1 2 0.5 1.5 0 5 10 15 20 25 30 35 0 0.01 0.02 0.03 0.04 0.05 0.06 ITFC error constraint Eye separation (a) Effects of the error constraint on bitrate (b) Effects of eye separation on bitrate

Figure 3.7: The benefit of using our reprojection technique or a joint coding technique over discrete coding techniques increases as the eye separation is reduced, as explained in §3.4.4. Additionally, the benefit of using the reprojection technique increase as the error constraints are loosened.

67 Chapter 4: Histogram Spectra for Multivariate Time-Varying Volume LOD Selection

Large, time-varying, multivariate volumes are commonly encountered in scientific vi-

sualization. As the available compute power for simulation has increased, the quantity of

data produced has increased commensurately. However, storage system throughput and

latency have not improved at the same rate. Analysis tools such as volume renderers that

seek to enable interactive visual analysis must scale to support interactivity on these larger

data sets.

The ability to interactively rotate, translate, and focus in on time-varying volume sim- ulation data can increase the understanding of the data. If the data is too large to fit into memory at its highest resolution, techniques must be applied to choose the subset, or level of detail, of the data that maximizes quality subject to the working set size constraints as determined by hardware resource availability.

All subsets of the value domain of the data are not necessarily of the same importance, and it is often possible for users to make informed guesses at what subsets are important.

However, these informed guesses often need to be part of the interactive workflow. This means that level of detail selection must be done interactively.

For example, in the context of a weather simulation, the scientist may be interested in the vertical velocity of clouds. It is clear that, if we have a wind field defined over the entire

68 volume and the clouds do not cover the entire volume, only a portion of the wind field is

important. Similarly, a large portion of the volume variable that determines cloud density

will also be unimportant where it is below the threshold for clouds. Thus, we can refine

the level of detail selection based on intervals of interest, for both the cloud variable and

the velocity variable, to maximize information density within the data loaded. By focusing

only on the quality of a limited interval volume of the volume, we can attain higher quality

than if levels of detail were selected in an interval-agnostic manner.

The best level of detail for a multiresolution data set is the one that minimizes the error over the intervals of interest subject to a size constraint. This introduces two challenges:

• Selection of the level of detail: General binary integer programming, which can be

used for LOD selection, is NP-hard. We need to offer a more efficient alternative. • Error estimation for intervals of interest: For level of detail selection, given intervals

of interest, it is necessary to estimate the error introduced by downsampling. If no

metadata is stored, the entire data set must be loaded every time the error is to be

estimated for a new interval of interest. We need to generate metadata to facilitate

faster error estimation.

This work provides two core contributions, one addressing each of these challenges.

First, we introduce the novel concept of histogram spectra, which are used to estimate the statistical sensitivity of time-varying volumes to sampling. Histogram spectra are stored as metadata, enabling the estimation of error without having to access the volume data di- rectly. Secondly, we introduce an efficient level of detail selection algorithm utilizing the linear relationship between histogram spectra predicted error and RMS error. Our tech- nique enables fast, interactive LOD selection with reasonable preprocessing times and low implementation complexity.

69 This paper is organized as follows. §7.1 discusses previous work as related to our work.

§4.2.1 through §4.2.4 introduce the concept of histogram spectra. Our solutions to the level of detail selection problem are discussed in §4.2.5 and §4.2.6. Considerations for applying the algorithm to multivariate data are discussed in §4.2.7. Finally, the results are examined in §7.3.

4.1 Related Work

The challenge of interactively visualizing large data, both in scientific visualization and in general graphics, has inspired much previous work. Common to many of the methods is the concept of multiresolution data availability. Two critical aspects in dealing with multiresolution data are designing a multiresolution representation, and deciding which portions of the multiresolution data to load in an application.

Many approaches to multiresolution volume representation have been explored. Wavelets are a widely used method, offering multiple levels of detail with little or no space overhead.

Westermann, et al. [120] developed a method for directly rendering wavelet transformed volume data. Wang, et al. [117] propose a method for rendering very large wavelet- transformed volume data and subsequently extended [116] the method to time varying volume data using a wavelet time-space partitioning tree. While wavelets can provide for efficient storage of large multiresolution volumes, they do have a disadvantage in that to access a given level of detail of a volume multiple levels of the wavelet hierarchy must be accessed.

An alternative to directly storing a multiresolution volume is to generate metadata that can be used to skip large portions of the high resolution volume that are not needed. For example, Gregorski, et al. [32] developed a method for preprocessing tetrahedral volumes

70 such that diamonds of min-max values are identified. This enables fast reconstruction of isosurfaces subject to a user-specified error tolerance without having to visit the entire volume. However, this approach does not directly apply to our problem because we are performing general level of detail selection rather than computing isosurfaces. Instead, we are using interval volumes to weight the importance of different portions of a volume for the purposes of error estimation in level of detail selection.

More similar to our technique are techniques that downsample the volume into a mul- tiresolution hierarchy, resulting in some data duplication, but at the same time enabling

flexible reconstruction with minimal computational overhead and fewer reads. Gao, et al.

[26] developed a distributed architecture for volume rendering of distributed data using multiresolution hierarchies while considering visibility to reduce data movement and ap- plying prefetching in a load on need context. LaMar, et al. [58] use a multiresolution texture hierarchy to accelerate volume rendering on graphics hardware.

Rendering a multiresolution hierarchy requires the selection of a subset of blocks (level of detail) from the multiresolution hierarchy that need to be rendered and/or loaded. Con- ceptually, this can be thought of as the construction of a cut through the multiresolution hierarchy. In our technique we explicitly generate multiresolution cuts (or level of detail selections) through a 4D multiresolution volume. Boada, et al. [5] also directly generate cuts through a multiresolution hierarchy in the context of volume visualization. Gyulassy, et al. [36] also directly generate cuts through a multiresolution hierarchy but combine it with view-dependent error calculations. However, there are substantial differences between these methods and ours. First, their multiresolution volumes are 3D rather than 4D, which has implications for the complexity of the algorithms and the severity of errors introduced

71 by the use of interpolation with lower levels of detail. Secondly, their cut construction algo-

rithms construct the cuts in a top-down manner from the lowest level of detail. Construction

of the cut from the lowest level of detail may sometimes be less computationally expensive

than our optimization-based approach. However, especially in data sets with wide ranges of

levels of detail like those we tested, it poses the potential of missing features at the higher

levels of detail if the lowest level of detail is sufficiently undersampled and other measures

are not taken.

Constructing the level of detail selection, regardless of whether it is constructed in an optimization-oriented or bottom-up manner, requires deciding when to decimate or refine the level of detail selection. Two general optimization approaches that can be taken are error-constrained and size-constrained. In the case of error-constrained approaches the ob- jective is to minimize the size of the working set subject to the error constraint. In the case of size-constrained approaches the optimization function may seek to maximize importance or minimize error subject to a working set size constraint. Size-constrained approaches are more appropriate when hardware limits on the maximum interactive working set size are important in interaction.

Error-constrained approaches have been widely applied in visualization. Wang, et al.

[116] consider both spatial and temporal error constraints in the rendering of wavelet data.

Danskin, et al. [15] consider image-space error constraints on rendering error in volume ray tracing. Gregorski, et al. [32] consider error tolerances in the extraction of time-varying isosurfaces. These techniques contrast with ours in that ours is size-constrained rather than error-constrained. However, it is important to consider that even though our technique is not error-constrained the error can still be quantified in the results.

72 Size-constrained (and by extension, load time-constrained) approaches have also been

widely applied in visualization, as well as general graphics applications. Saito [91] de-

veloped a time-constrained point rendering approach for previewing volumes and argues

that constant frame rates are beneficial for interaction. Shin, et al. [101] developed a

quadtree-based approach for fixed frame rate continuous LOD terrain visualization. Lind-

strom, et al. [61] applies a height field simplification algorithm to keep a constant frame

rate. Funkhouser, et al. [21] proposed an adaptive rendering algorithm that seeks to main-

tain a constant frame rate for virtual environment visualization. Certain, et al. [8] applies

wavelets within a time-constrained context to maintain constant framerates for multiresolu-

tion surface viewing. While all of these works are from different contexts, they all consider

consistency of frame rate and working set size to be important enough for interaction to

make it a constraint on their level of detail choices.

The concept of importance (expressed via interval selection using the weighting func- tion, in our technique) can help facilitate higher quality by allowing some subsets of the data to have higher priority over other subsets of the data. The concept of importance sampling has been widely used in visualization and rendering. Danksin, et al. [15] and

Viola, et al. [112] both apply importance sampling in the context of volume rendering.

The technique described in chapter 2 considers user-specified isovalues when computing a

fixed distribution of work in the context of distributed-data isosurface computation, though that work does not consider multivariate data and does not pose the problem directly as an optimization problem.

While level of detail selection has been widely used and explored, we are not aware of a proposal for error prediction similar to the concept of histogram spectra, nor have we

73 found a similar greedy solution to the level of detail selection problem that can efficiently

utilize the histogram spectra for 4D, multivariate, multiresolution volumes.

4.2 Level of Detail Selection

Assume that a time-varyingvolume is dividedintoa set of 4D subvolumes (or “bricks”.)

Each subvolume is sampled into a set of levels of detail, each with a different sampling frequency. The goal of the level of detail selection algorithm is to select the level of detail for each subvolume that maximizes quality subject to a working set size constraint. The data flow of the level of detail selection process in our technique is exhibited in fig. 4.1.

Figure 4.1: The histogram spectra generator takes a multiresolution bricked volume and generates a histogram spectrum for each subvolume (“brick”) of the volume. This will be done as a precomputation step in the data preparation phase. The LOD selector then uses that, with a set of user-defined parameters such as intervals of interest, to produce a LOD selection set. The LOD selection can be performed interactively.

4.2.1 Histogram Spectra

Let fa(x) be the probability density function (PDF) of a subvolume sampled at sampling

frequency a. The histogram spectrum of the subvolume is then a mapping R2 → R

h(x,a) = | fb(x) − fa(x)| (4.1)

74 low 0.0001

high 0 Sampling Frequency 0 0.0035 0.0070 0.0105 0.0140 Value

Figure 4.2: This histogram spectrum of a single plane of a single timestep of the QVAPOR variable of the climate test data set (defined in §7.3) is typical of histogram spectra. Moving up on the vertical axis corresponds to downsampling, and each column corresponds to the potential change in the area of an isosurface as a function of sampling frequency. Columns with brighter colors in this plot correspond to values that are more sensitive to sampling. Rows with brighter colors correspond to sampling frequencies with greater overall, un- weighted, error.

where b is the sampling frequency of the highest level of detail of the subvolume and x is the value parameter to the PDF.

For a volume comprised of multiple subvolumes, the set of histogram spectra is com- prised of the histogram spectrum of each subvolume. Each subvolume is processed inde- pendently of every other subvolume.

Evaluating h(x,a) for a given x and a yields a value proportional to the absolute differ- ence between the surface area of an isosurface with value x in the subvolume sampled with sampling frequency a and the surface area of an isosurface with value x in the subvolume sampled with frequency b. The relationship between isosurface area and histograms has been examined in depth by Scheidegger, et al. [96] and Carr, et al. [6]. If no information has been lost in value x by sampling with a frequency a versus sampling with a frequency b then h(x,a) = 0.

75 4.2.2 Weighted Histogram Spectra

low 0.0001

high 0 Sampling Frequency 0 0.0035 0.0070 0.0105 0.0140 Value

Figure 4.3: The weighting function is used to control the width of the interval volumes of interest in the context of the level of detail selection. In this example a weighting function was chosen to place importance on the interval of values from 0.0070 to 0.0105. The weighting function is applied over the columns of the histogram spectrum, facilitating the computation of histogram spectrum predicted error as in equation (4.4).

A weighting function, w(x), is defined as an R → R mapping from the volume value domain to weights. Conceptually w(x) should reflect the important interval volumes (inter- vals of interest) for the current visualization task, having a higher value within the interval volumes than outside the interval volumes.

The weighted histogram spectrum, defined for a subvolume as in equation (4.1), is then:

hw(x,a) = w(x)h(x,a) (4.2)

Evaluating hw(x,a) for a given x and a yields a value proportional to the weighting function w(x) and the difference in isosurface surface areas as in equation (4.1). This is significant because it enables the estimation of the error in intervals of interest from the histogram spectrum using equation 4.4.

76 Typically, when w(x) is defined directly by the user, it will be in the form of:

1 x ∈ Y w(x) = (4.3) 0 otherwise  where Y is a user-defined set of important values. For example, choosing Y = 0.3 would mean that error is only considered to be important if it affects an isosurface with isovalue

0.3. However, it is not required that w(x) be in this form and indeed for multivariate data it can be useful for it to be in a different form, as can be seen in §4.2.7.

4.2.3 Predicted Error Using Histogram Spectra

The error of a scalar subvolume at a given sampling frequency a can be estimated using the histogram spectra via +∞ E(a) = hw(x,a)dx (4.4) Z−∞ Effectively this sums the difference in surface area for every isosurface in the subvolume, weighted by the user-specified weighting function w(x).

The RMS error of a subvolume is proportional to E(a). A linear relationship was

observed, as in fig. 4.4, on our test cases for 0.1b < a ≤ b – when the sampling frequency is greater than about one tenth the ground truth sampling frequency. In a 4D volume, one tenth the ground truth sampling frequency would equate to roughly a 10,000x reduction in size.

This is data-dependent, but does demonstrate that histogram spectra can be used to predict

RMS error resulting from a range of downsampling operations on real-world data sets.

Most importantly for the purposes of the greedy algorithm discussed in §4.2.6, it means that a ratio between two RMS errors is the same as the ratio between the corresponding two histogram spectrum predicted errors.

77 0.0005

0.0004

0.0003

0.0002 RMS Error 0.0001

0.02 0.04 0.06 0.08 0.1 Histogram Spectrum Predicted Error

Figure 4.4: The RMS error is proportional to the histogram spectrum predicted error. This fig. exhibits a test case on the QVAPOR variable of the climate data set (defined in §7.3), and is typical of what we have observed on other data sets. The exact scaling factor to determine the RMS error depends on the units of the data in the field and the norm of the weighting function. However, this does not need to be computed because only the relative differences between errors need to be used in the algorithm discussed in §4.2.6. Because the RMS error is linearly proportional to the histogram spectrum predicted error, the ratio between two RMS errors is the same as the ratio between their corresponding histogram spectrum predicted errors.

4.2.4 Discretization of Histogram Spectra

In practice, for multiresolution data, only a finite number of levels of detail can be considered. Similarly, the resolution required for discrete forms of the probability density functions used in the histogram spectra is also limited.

In our implementation we store a uniformly sampled histogram spectrum for each sub- volume as a 2D array of floats. The histogram resolution (the number of columns) deter- mines the narrowest interval volume that can be considered for level of detail selection.

Too few columns will reduce the effectiveness of the algorithm, while too many will waste space. It should be chosen to reflect the minimum width of intervals that the user is likely

78 to be interested in for level of detail selection. Future extensions may consider alternative sampling strategies for the histograms, if we can find an application that requires them.

When there are M levels of detail, M frequencies (rows) are stored for the histogram spectra. However, if the highest sampling frequency of a level of detail is the same as the ground truth sampling frequency, it is not necessary to store the rows of the histogram spectra corresponding to that level because they can be assumed to be zero.

The resulting floating point data can be compressed losslessly with floating point image compression techniques, but in our test cases the space consumed by the histogram spectra was not found to be large enough to warrant this.

4.2.5 Integer Programming Formulations for LOD Selection

The goal of the level of detail selection problem is to compute the level of detail index

Li for every subvolume i of N subvolumes such that the error is minimized and the size of the subvolumes to be loaded for the level of detail is below a threshold, Smax. This can be structured as a nonlinear integer programming problem

N

argminL ∑ Ei(aLi) (4.5) i=1 with the constraints

N Z ∑ Si,Li ≤ Smax;1 ≤ Li ≤ M;Li ∈ (4.6) i=1 where ak is the sampling frequency for level of detail index k, Si, j is the load size for LOD j of block i, and M is the number of levels of detail. This optimization problem is nonlinear because the optimization arguments are used as arguments to the nonlinear Ei(a) function within the objective function.

79 An alternate, equivalent, binary, linear integer programming formulation can be con-

structed by recognizing that there are a finite number of levels of detail:

N M argminH ∑ ∑ Ei(a j)Hi, j (4.7) i=1 j=1 with the constraints

N M ∑ ∑ Hi, jSi, j ≤ Smax (4.8) i=1 j=1 M Z ∑ Hi, j = 1;1 ≤ i ≤ N;i ∈ (4.9) j=1

It follows from equation (4.9) that the solution to the binary, linear integer programming

problem is related to the nonlinear integer programming problem by:

1 L = j H = i (4.10) i, j 0 otherwise 

This is linear because the only potentially nonlinear part of the objective function, Ei(a),

is now dependent on a set of constants, the possible level of detail sampling frequencies,

rather than the argument being optimized as in equation (4.5).

With equation (4.10) it can be seen that the constraint equations (4.8) and (4.6) as well

as the objective functions (4.5) and (4.7) are respectively equivalent.

General linear programming packages such as the GNU Linear Programming Kit (GLPK)

can be applied to solve the binary integer programming problem described in equation

(4.7). However, general binary integer programming is NP-hard. Binary integer program-

ming packages, such as that offered by GLPK, often integrate acceleration strategies to

more efficiently solve special cases of general binary integer programming problems. How-

ever, we found (as in the results in §4.3.2) that these strategies were generally insufficient

for attaining reasonable running times for interactive level of detail selection.

80 Instead, we propose a greedy algorithm for solving the nonlinear integer programming

problem described in equation (4.5) that yields approximate solutions very close to the

optimal solution, with considerably lower computational complexity. We still present the

binary integer programming form because it provides a way to easily apply existing integer

programming optimization packages such as GLPK to the problem for performance testing,

and it provides for extensibility.

4.2.6 Greedy Algorithm for Nonlinear Integer Programming Formu- lation

Because the number of subvolumes multiplied by the number of levels of detail, N, in a data set may be very large, a greedy approximation to the integer programming problem is more practical. For example, with 1 MiB subvolumes (which equates to roughly 234 univariate 4D blocks with 4 bytes per variable) and 8 levels of detail, a 1 TiB volume would have approximately 8 million unknowns in equation (4.7) and 1 million in equation

(4.5). Even if a nonlinear but polynomial time direct solution was possible for the integer programming problem, the performance would still be insufficient for performing LOD selection during the interactive portion of the workflow.

Our approach is to consider the set of potential levels of detail for all subvolumes, then apply them to the subvolumes in order of increasing error density until the size constraint is satisfied. We propose a three step greedy algorithm for accomplishing this:

1. Estimate the result error for every LOD for every subvolume, as described in

§4.2.6. This requires O(N) time using histogram spectra. 2. Compute the error density values and sort the potential LOD assignments by

them, as described in §4.2.6. This requires O(NlgN) time.

81 3. Assign the best LOD to every block using the sorted list, as described in §4.2.6. This

requires O(N) time.

Error estimation

Every possible level of detail selection for a subvolume has an associated estimate of the sampling error that would be present due to the choice of the level of detail. Whenever the intervals are changed via the weighting function, equation (4.3), the error estimate will need to be recomputed. If RMS error is computed directly, instead of using the precom- puted histogram spectra with equation (4.4), the entire volume will need to be revisited to compute the error, which is impractical within an interactive workflow. However, if the histogram spectra are used to estimate error using equation (4.4) then only the histogram spectra need to be visited. The histogram spectra are much smaller than the entire vol- ume and have already been computed during either the data preparation or data generation phases.

Sorting and the heuristic

A list is constructed containing an entry for every potential LOD assignment, for every subvolume. Each entry has a heuristic value (error density), an LOD index, and a sub- volume index. The heuristic value used for subvolume i with LOD j, Ai, j, is defined as follows:

E(a j) Ai, j = (4.11) Si, j where a j is the sampling frequency of LOD j, and Si, j is the size in bytes loaded for LOD

j of subvolume i. Ei(a j) is equation (4.4) evaluated for subvolume i or, alternatively, the

directly computed RMS error which was used for the performance tests in §4.3.2.

82 This list is then sorted in ascending order of the heuristic, Ai, j. This results in a list of

potential LOD assignments sorted by ascending error density.

LOD assignment

Conceptually the goal is to choose levels of detail that minimize error density, subject

to a size constraint, Smax. The following algorithm is applied to assign levels of detail using

the sorted list: Listing 4.1: LOD assignment algorithm L :=(list produced by sorting) B :=(LOD assignment for each subvolume) N_subvols:=(number of subvolumes) S_max :=(maximum working set size) S_total :=N_subvols * getSubVolSizeForLOD(1) for (i in 1..L.length) AND S_total>S_max if B[L[i].block]

When the solution is feasible, we have found this algorithm produces results close to the

optimal solution produced by directly applying general binary integer programming algo-

rithmsas can be seen in fig. 4.5. When the Smax constraint is too low for a feasible solution,

it gracefully results in the lowest detail level being specified for all blocks.

4.2.7 Multivariate Considerations

Multiple variables with different units are commonly used simultaneously in visual- izations. For example, in a weather simulation we may be interested in volume rendering clouds in the context of a water vapor field. In this case, this implies that we want high levels of detail where lower levels of detail would introduce too much error in either the

83 0.1 Greedy 0.08 Direct 0.06 0.04 0.02 RMS error of result 0 100000 1e+06 1e+07 1e+08 1e+09 1e+10 Working set sample count

Figure 4.5: Directly solving the integer programming problem with a general integer pro- gramming package is impractical due to the high computational complexity involved in solving the NP-hard problem. Our greedy algorithm as described in §4.2.6 yields nearly identical results with O(NlgN) where N is linearly proportional to the number of subvol- umes.

water vapor field, or the cloud field. These variables that are used to guide the selection of levels of detail are called guiding variables.

Optimization

Histogram spectra are generated for each guiding variable. Separate weighting func- tions are applied for each variable to produce weighted histogram spectra for each, enabling the estimation of error for each guiding variable independently using equation (4.4). Ef- fectively the nonlinear integer programming problem in equations (4.5) and (4.6) can be extended to include C variables:

C Nk

argminL ∑ ∑ Ek,i(ak,Lk,i) (4.12) k=1 i=1 with the constraints

C Nk

∑ ∑ Sk,i,Lk,i ≤ Smax (4.13) k=1 i=1

1 ≤ Lk,i ≤ M;Lk,i ∈ Z

84 where ak, j is the sampling frequency for level of detail index j of variable k, Sk,i, j is the load size of LOD j of block i of variable k, Nk is the number of subvolumes in variable k of the volume, M is the number of levels of detail, and C is the number of variables. The binary linear integer programming formulation in equations (4.7), (4.8), and (4.9) can be similarly extended: C Nk M argminH ∑ ∑ ∑ Ek,i(ak, j)Hk,i, j (4.14) k=1 i=1 j=1 with the constraints

C Nk M ∑ ∑ ∑ Hk,i, jSk,i, j ≤ Smax (4.15) k=1 i=1 j=1 M Z Z ∑ Hk,i, j = 1;1 ≤ i ≤ N;i ∈ ;1 ≤ k ≤ C;k ∈ j=1

The optimization solutions presented in §4.2.6 apply identically to this multivariate case as they do to the univariate case. The multivariate forms’ objective functions are defined as the sum of multiple univariate objective functions. Similarly, the multivariate forms’ constraints are the logical conjunction of multiple univariate constraint sets. When C is 1

the multivariate optimization equations reduce to the univariate optimization equations.

Conditional importance

Sometimes there are variables that are only important where the guiding variables are within a particular interval. We refer to these variables as following variables. For example, consider the case where a user wants to see the vertical velocity of clouds in the context of a volume rendering where the cloud density defines the opacity and the cloud color is defined by the vertical velocity. In this case the guiding variable is the cloud density and the vertical velocity is the following variable.

85 From the standpoint of optimization the guiding variables and following variables are treated identically. However, the weighting functions for following variables should take into account conditioning by the guiding variables. In many circumstances, such as in the above cloud velocity example, the probability distribution of the following variable within the interval of interest of the guiding variable is different from the probability distribution of the following variable over the entire volume. This has been observed in our test data sets, as can be seen in fig. 4.7.

While it is not absolutely required, for following variables, the weighting function should be chosen to assign increased weight where the conditional probability density of the following variable is high. This will reduce the relative importance, with respect to level of detail selection, assigned to values that fall outside of guiding variable interval volumes of interest. We have found that this works effectively as can be seen in the climate data test case in the results.

4.3 Results

Fundamentally, the goal of our technique is to permit interactive selection of levels of detail on data sets much larger than can be fit in-core or loaded interactively. This enables users to interactively select levels of detail that focus on the intervals of interest within time-varying volumes, which offer increased quality for a given sample size constraint. We performed experiments to look at three aspects of this: running times, visual quality, and statistical quality.

86 4.3.1 Test data sets

Two data sets were used for experiments, one being from a climate simulation at Pacific

Northwest National Laboratory and the other being from a turbulent combustion simulation at Sandia National Laboratory. Both were time-varying volume data sets.

Climate

The climate data set is a set of multivariate volume timestep snapshots from a long- term weather simulation of the region around Indonesia in the context of climate change research. The 4D multiresolution data set used for the purposes of the experiments was

117GiB with 8 levels of detail, sampled on a time-varying geopotentially-defined curvi- linear mesh with 41 timesteps broken into 18,944 4D subvolumes. The data set contained variables for geopotentially-defined elevation(MESHZ), cloud density(QCLOUD), water vapor density(QVAPOR), and vertical velocity(W).

The MESHZ variable was comprised of 1,184 4D subvolumes with 8 levels of detail ranging from 5.8MiB per subvolume to 64 bytes per subvolume. The QCLOUD, QVA-

POR, and W variables were each comprised of 5,920 4D subvolumes with 8 levels of detail ranging from 1.3MiB per subvolume to 64 bytes per subvolume. The total size of the his- togram spectra, discretized into 128 histogram bins and 7 frequencies per subvolume, for the entire data set, was 67MiB, 0.056% of the size of the multiresolution volume. Addi- tional static 2D variables included for the purposes of producing renderings were the land elevation, vegetation fraction, and surface normals.

Combustion

The combustion data set is a set of volume timestep snapshots from a simulation of the injection of fuel into two countercurrent air streams in which combustion occurs. A

87 single variable, the mixing fraction (referred to as MIXFRAC), was used for the purposes

of testing. The 4D multiresolution data set, defined on a regular grid, was 69GiB with 8

levels of detail and 121 timesteps broken into 17,010 subvolumes. Levels of detail ranged

from 1.1MiB per block to 64 bytes per block. The total size of the histogram spectra,

discretized into 128 histogram bins and 7 frequencies per subvolume, for the entire data set

was 60MiB, 0.085% of the size of the multiresolution volume.

4.3.2 Running time comparisons

The two major bottlenecks to interactive LOD selection for varying intervals of interest are the load time for error estimation, and the computation time for solving the optimization problem to minimize error for a given size constraint.

The test platform was a Linux PC with an Intel Core 2 6600 dual core CPU, 4GiB of main memory, and a hard drive with the IBM JFS filesystem capable of approximately

30MiB/s with 5-10ms latency for data storage.

Error estimation

For the error estimation aspect, we compare using the histogram spectra versus di- rectly estimating the RMS error from the data. Estimating the error with the histogram spectra predicted error (HSPE) only requires loading the substantially smaller discretized histogram spectrum for each subvolume. Estimating the error directly with RMS error

(RMSE) requires loading the entire data set every time the LOD changes.

In the following tests LOD selection was performed for a size constraint and a target interval on the climate QVAPOR and combustion MIXFRAC variables, in conjunction with our greedy algorithm for the LOD selection. The size constraint choice and target interval choice do not affect the timing results.

88 Heuristic Data set LOD Selection Time RMSE/size QVAPOR 1782.0s HSPE/size QVAPOR 0.1s RMSE/size MIXFRAC 3900.3s HSPE/size MIXFRAC 0.2s Using HSPE to compute the error in the heuristic, Ai, j, clearly outperforms using RMSE on both data sets. This is because using HSPE only depends on the histogram spectra, which have already been precomputed in the non-interactive portion of the workflow. In contrast, RMSE requires reading the entire volume data set every time the list described in

§4.2.6 is constructed. Figure 4.8 shows the typical relationship between the error perfor-

mance of the HSPE and RMSE error estimators.

Optimization

For the optimization aspect, we compare our greedy approximation to a direct integer programming approach. While general binary integer programming is an NP-hard problem, packages like GLPK apply some techniques to improve performance. Further information about the techniques GLPK applies can be found in the GLPK source code.

For GLPK with the QVAPOR data set, binary linear integer programming was per- formed with 47,360 variables. The greedy algorithm required only 5,920 entries to be sorted because only one is needed per block, rather than one per block per LOD. For the

MIXFRAC data set, binary linear integer programming was performed with 136,080 vari- ables, and the greedy algorithm required only 17,010 entries to be sorted.

The running time for GLPK is sensitive to the target interval of interest, while the run- ning time for the greedy algorithm is unaffected by the target interval of interest. This is because the techniques GLPK can apply depend on the coefficients in the linear program- ming problem, while the greedy approximation simply sorts by the heuristic then assigns the levels of detail. In all cases the GLPK performance was slower.

89 Solver Data set Solving time GLPK QVAPOR 7625.6s Greedy QVAPOR 0.1s GLPK MIXFRAC 142.1s Greedy MIXFRAC 0.2s The greedy algorithm substantially outperforms GLPK. The reason why the greedy algorithm is much faster is because its computational complexity is O(NlgN) as discussed

in §4.2.6. This stands in contrast to the general integer programming methods applied by

GLPK which, for nontrivial inputs, are much worse than polynomial time and exponential in the worst case. The result of the greedy algorithm was also found to consistently be very close to the to that of GLPK, as can be seen in figure 4.5.

Histogram Spectra Computation

Computation of the discrete histogram spectrum for a single subvolume requires the computation of a histogram for different sampling frequencies for the subvolume. Each subvolume can be processed independently, yielding an embarrassingly parallel streaming algorithm that can be easily implemented on GPUs, multi-core, and/or multi-node plat- forms. Because of this, the computation of histogram spectra is likely to be read-bound, rather than compute-bound, on most system configurations. However, placement within the visualization workflow will determine the true cost of this operation.

If the histogram spectra computation is done in-situ, during the data generation phase, no additional reads are required because the histogram spectra can be computed as the data is written out to disk. If the histogram spectra computation is done as a separate pass during the data preparation phase then the volume needs to be streamed in from disk storage once, in its entirety. It is likely that software engineering considerations specific to each application will dictate which approach is appropriate. In either case, the computation process scales linearly with the number of data samples.

90 4.3.3 Visual and statistical comparisons

Both from a statistical and visual standpoint, choosing narrow, salient intervals to focus on for error reduction yields improved quality. The univariate case was tested using the combustion data set, and the multivariate case was tested using the climate data set. Narrow interval widths are the minimal interval widths needed to cover the non-transparent portions of the color-opacity transfer functions used for producing the figures, while wide interval widths cover the entire value domain.

Combustion

Level of detail selection operations were performed for different interval widths and centers on the MIXFRAC variable of the combustion data set. Figure 4.10 exhibits the typ- ical dependence observed of the RMS error on the width of the interval volume of interest as defined by a user via the weighting function. For a fixed working set size constraint, increasing the width tends to result in increased error. This is reasonable, because a larger interval volume will encompass more samples yet the information density is likely to re- main similar.

The implications of this increased error can be observed in figure 4.9. Artifacts are typ- ical of block-wise downsampling, with a smoothing effect on the data and discontinuities at block (or subvolume) boundaries. Animations of the time series exhibit the improvement of the narrow interval over the wide interval more dramatically than the images. Choosing narrower intervals clearly yields images closer to the ground truth.

Climate

Similarly to the combustion data set, level of detail selection operations were performed for different interval widths. However, multiple variables were considered simultaneously.

91 The QCLOUD, QVAPOR, and MESHZ were guiding variables while the W was a follow- ing variable, as described in §4.2.7. Using the guidelines in 4.2.7, the weighting function

for the following variable, W, was conditioned by QCLOUD using the conditional PDF in

figure 4.7.

Figure 4.6 exhibits the results. Like the combustion data set in figure 4.9, the qual- ity was higher for narrower intervals. The figures producing using narrower intervals of interest are more similar to the ground truth than those produced using wider intervals of interest.

4.4 Conclusion

We have introduced the concept of histogram spectra as a new approach for efficiently estimating error due to downsampling for interval volumes of time-varying, multivariate, multiresolution volumes. A new optimization approach for level of detail selection was then introduced taking advantage of the linear relationship between the histogram spectra predicted error and RMS error. Both the optimization approach and the histogram spectra are easy to implement in software, increasing the practical applicability of our approach.

These contributions enable interactive level of detail selection on large, multivariate, multiresolution volumes for user-specified intervals of interest. By enabling the interactive selection of intervals of interest for the purposes of level of detail selection, increased visual and statistical quality can be obtained.

92 (a) Ground truth (b) Ground truth, zoom

(c) Narrow intervals, zoom (d) Wide intervals, zoom

Figure 4.6: Several variables from the climate data set are rendered for a single timestep. The white, opaque parts are clouds defined by the QCLOUD variable. The magenta regions are clouds with high vertical velocities, as determined by the W variable. The yellow exhibits water vapor density as determined by the QVAPOR variable. The volume is a curvilinear volume, with the Z variable of its mesh determined by the MESHZ variable. All of these variables have their levels of detail determined by the level of detail algorithm. Figures 4.6a and 4.6b are generated from the ground truth resolution, while figures 4.6c and 4.6d have levels of detail selected for a 4GiB working set size constraint. Figure 4.6c was generated with narrow intervals of interest, while fig. 4.6d was generated with wide intervals of interest. Like in fig. 4.9, selecting narrow intervals of interest yields results closer to the ground truth than selecting wide intervals of interest.

93 0.25 10-7

Figure 4.7: In some cases, with multivariate fields, a user is interested in seeing a variable A where variable B is between B0 and B1. This interval [B0 : B1] is expressed as a weighting function for the histogram spectra of B. The choice of the best weighting function for A depends on the statistical dependence between A and B. If A is not independent of B then we can use the conditional probability density function of A given the case that B lies within [B0 : B1] as a starting point for constructing a weighting function for A. In this example it can be seen that the PDF of the vertical velocity(W) in the climate data set is different for different intervals of the cloud density(QCLOUD.)

0.1 HSPE 0.08 RMSE 0.06 0.04 0.02 RMS error of result 0 100000 1e+06 1e+07 1e+08 1e+09 1e+10 Working set sample count

Figure 4.8: This figure shows the error for different working set size constraints, using dif- ferent error estimators in the LOD selection algorithm. The E j function in the optimization problem as referenced by equation (4.5) can be approximated using equation (4.4) instead of directly computing the RMS error (RMSE). The prediction of error using the histogram spectra predicted error (HSPE) yields results close to the direct RMS error. By using equa- tion (4.4) with histogram spectra it is possible to avoid loading samples from the source volume when performing LOD selection, substantially improving performance.

94 (a) Ground truth

(b) Narrow intervals

(c) Wide intervals

Figure 4.9: Values of MIXFRAC from the combustion data set within the range [0.45:0.55] are rendered for a single timestep, where values less than 0.5 are blue and those greater than or equal to 0.5 are orange. Figure 4.9a is a crop of an image generated using the ground truth resolution, while figures 4.9b and 4.9c have levels of detail selected for a 250MiB working set size constraint. Figure 4.9b has a weighting function that is 1 for values in the range [0.45:0.55] and 0 elsewhere. Figure 4.9c has a weighting function that is uniformly 1. The narrower interval of interest used for figure 4.9b clearly yields a result closer to the ground truth than the wide interval of interest that was used for figure 4.9c. 95 0.078 0.076 0.074 0.072 0.07 RMS error 0.068 0.066 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 Interval Width

Figure 4.10: For a fixed working set size constraint, increasing the width of the range of values defining the interval volumes of interest results in increased error. This figure, which was generated using the QVAPOR variable of the climate test data set defined in §7.3, is typical of what we have observed. This is to be expected because a larger interval volume will encompass more samples yet the information density is likely to remain similar. Thus, the narrower the interval volume of interest, the fewer samples are needed to reconstruct the volume with a given level of error.

96 Chapter 5: Efficient Rendering of Extrudable Curvilinear Volumes

Ray casting through curvilinear adaptive mesh refinement volumes requires a large number of transformations between computational and physical space. While this is trivial for rectilinear computational spaces, curvilinear computational spaces present additional challenges. By exploiting the characteristics of a particular class of curvilinear spaces, we enable volume rendering at interactive frame rates with minimal preprocessing and memory overhead using commodity graphics hardware.

The core contribution of our technique is its representation of a curvilinear space as an extrusion of a profile surface along a curve, permitting memory and time-efficient trans- formations between physical space and computational space. Our technique renders the data in blocks, where each block is a curvilinear grid of cells(for example, a block may be

64x64x64 cells). Blocks can have different sampling resolutions, and information is passed to the renderer to permit block hierarchies. The borders of each block are rendered to the framebuffer using triangles with vertices specified in computational space. A vertex pro- gram is utilized to transform the vertices from computational space to physical space such that for a given computational space position in any given level, the physical space position produced is consistent. Ray-casting is performed within the fragment program for every fragment rendered, stepping the ray simultaneously through both computational space and physical space. The step size is found by computing intersections with cell boundaries in

97 computational space then applying a minimum step length constraint. The physical space

step vector is trivially determined by the direction from the camera origin to the fragment

position, and the computational space step vector is found by applying a Jacobian matrix,

which is easy to derive using our specialized representation, to the physical space step vec-

tor. Several constraints were considered in the design of this method. Memory-efficiency

is of great importance to performance, both because of limited memory capacity and lim-

ited memory bandwidth. Additionally, any representation of computational space must be

easily traversable without a significant loss in accuracy. Finally, the inexpensive compu-

tational power available on GPUs today must be exploited, despite its limitations, to be

competitive with other techniques.

As data sets from simulations become larger, memory efficiency of rendering tech- niques becomes more important. Additionally, as computational power becomes more densely packaged in devices such as GPUs, the disparity between computation and memory performance becomes greater, necessitating the use of techniques which facilitate efficient cache utilization. Given these constraints, direct rendering techniques for curvilinear data rather than techniques that resample the data into rectilinear meshes or decompose the data into tetrahedral meshes make more practical sense.

A potential application of curvilinear adaptive mesh refinement volume rendering is exhibited in section 5.2. Section 5.3 describes the proposed specialized computational space representation. Then, section 5.4 provides details about the rendering process and volume data structure and details about the ray casting algorithm are described in section

5.4.1. Finally, results are examined examined in section 5.5.

98 Figure 5.1: Sample volume renderings of data set 1. The left column shows two views of one data component. The right column shows two different AMR level ranges for a different component, with the top image showing levels 0 through 1, the bottom image showing just level 1.

5.1 Related Work

Many methods exist for volume rendering of curvilinear data. Many of them can be adapted to support adaptive mesh refinement data sets, and some can be easily implemented such that they take advantage of GPU capabilities. Four potential methods are resampling to rectilinear space, decomposition of curvilinear data into unstructured tetrahedral data, direct cell projection, and direct ray casting. A good survey of techniques for volume rendering is provided in [69].

Techniques for volume rendering of rectilinear data are the most well-developed and tend to be the most straightforward due to the simplicity of the mesh. [105] presents a

99 GPU ray casting implementation for rectilinear data. [113] proposes GPU data structures

for efficient volume rendering of rectilinear AMR data. [57] presents acceleration struc-

tures for supporting empty space skipping and early ray termination for GPU rendering of

rectilinear data. Resampling of the curvilinear data into a rectilinear mesh offers the obvi-

ous advantage of enabling these well-developed techniques at the cost of introducing extra

sampling error, increasing memory consumption, and reducing potential performance due

to memory bandwidth requirements. Additionally, the level of preprocessing required may

be unacceptable for large time-varying datasets.

Decomposition of the curvilinear grid into an unstructured tetrahedral mesh is straight- forward and enables the usage of the large body of work on unstructured tetrahedral mesh rendering. [131] proposes a point-based approach for rendering unstructured meshes. [29] and [102] propose rendering techniques for tetrahedral elements. [86] presents a tech- nique for rendering unstructured grids using graphics hardware-assisted incremental slic- ing. However, in this process of decomposition, potentially useful data for accelerating rendering may be lost and, if the decomposition is done as a preprocessing step, excessive memory consumption may result. Additionally, unstructured tetrahedral meshes introduce additional challenges for the evaluation of depth-order-dependent transfer functions.

Curvilinear direct cell projection offers the potential of avoiding preprocessing and uti- lizing the curvilinear structure of the data to accelerate sorting and rendering. [28] presents a technique utilizing hardware-accelerated polygon rendering and supporting depth-order- dependent transfer functions. However, direct cell projection may require a significant amount of non-localized overdraw which, when implemented on modern GPUs, can re- duce performance due to memory bandwidth limitations. Additionally, implementation of direct cell projection requires a significant amount of vertex data to be manipulated, further

100 Figure 5.2: Sample renderings from data set 2. In clockwise direction from the top left corner are AMR levels 0 through 4, 2 through 4, 3 through 4, and 4.

increasing the required memory bandwidth and vertex processing requirements for render-

ing.

Rendering of AMR data has been well explored in the context of rectilinear data sets.

[119] presents a technique for rendering AMR data with cell projection using stitch meshes between the levels. [50] converts a sparse rectilinear data set into an AMR hierarchy which is then rendered using texture based rendering. Curvilinear AMR presents additional chal- lenges in that the cells in general have non-planar faces and blocks of data might not be convex.

101 Curvilinear direct ray casting offers the same potential as curvilinear direct cell projec- tion for reduced preprocessing and data loss, while being more adaptable for implementa- tion on modern GPUs. [122] presents a ray casting technique for direct curvilinear volume rendering and compares it to resampling to a rectilinear mesh. [42] and [41] present addi- tional methods utilizing ray casting. [9] proposes using textures for a transformation from physical space to computational space. Our technique provides a compact representation of the mesh for transformations from physical space to computational space as well as from computational space to physical space while greatly reducing the memory required for mesh specification. Sorting, for single block non-AMR data sets, is implied and image quality can be smoothly changed to permit a user-driven compromise between speed and quality.

5.2 Applications

The primary target application for the curvilinear volume rendering technique presented in this chapter is analysis of magnetic confinement fusion data. Thus, in developing this technique, consultations were made with a domain expert in that field, Ravi Samtaney of the Princeton Plasma Physics Laboratory. The remainder of the text of this applications section is his description of the application, and why volume rendering is important to it.

Therefore, it is quoted in italics:

ITER (“The Way” in Latin), a joint international research and development project that aims to demonstrate the scientific and technical feasibility of fusion power, is now under construction at Cadarache, France. Refueling of ITER is a practical necessity due to the burning plasma nature of the experiment, and longer pulse durations (100 - 1000

seconds). An experimentally proven method of refueling tokamaks is by pellet injection.

102 Pellet injection is currently seen as the most likely refueling technique for ITER. Thus it

is imperative that pellet injection phenomena be understood via simulations before very

expensive experiments are undertaken in ITER. The emphasis of the present work is to

understand the large-scale macroscopic processes involved in the redistribution of mass

into a tokamak during pellet injection. In particular, it was experimentally established that

high-field-side (HFS, or inside) pellet lauches are more effective than low-field-side (LFS

or outside) pellet launches. Arguably, such large scale processes are best understood using

magnetohydrodynamics (MHD) as the mathematical model.

There is a large disparity between the pellet size and device size. Naive estimates indi- cate that the number of space-time points required to resolve the region around the pellet for simulation of ITER-size parameters can exceed 1019. The large range of spatial scales and the need to resolve the region around the pellet is somewhat mitigated by the use of

Adaptive mesh refinement (AMR). Our approach is to employ block structured hierarchical meshes using the Chombo library for AMR developed by the APDEC SciDAC Center at

LBNL.

We use data from simulations performed with an adaptive upwind conservative mesh

MHD code in generalized curvilinear coordinates. A critical component is the modeling of the highly anisotropic energy transfer from the background hot plasma to the pellet ablation cloud via long mean-free-path electrons along magnetic field lines. Further details on the approach can be found in [92].

A primary scientific question is establishing the MHD mechanisms responsible for the differences in HFS and LFS pellet launches. Visualizations of the density field helps identify the extent of the migration of the ablated pellet mass along the magnetic field lines and more importantly, the transport across magnetic flux surfaces in the direction of increasing

103 major radius. In particular volume rendering of the density field is an effective method to

visualize the global mass distribution in the tokamak during pellet injection.

5.3 Computational Space Representation

Our technique supports rendering of volumes that can be represented via an extrusion of a planar profile surface along a curve to create the physical space grid as a function of computational space. For example, torus can be represented as a planar radially sampled circle extruded around another circle. A cylinder can be represented by a radially sampled circle extruded along a line. The tokamak shape used in the MHD simulation data presented in section 5.2 is another potential application.

5.3.1 Data and spaces

The input data is an AMR hierarchy of blocks. Each block consists of a 3D array of dimensionality Ni ×Nj ×Nk. The function dphypos(i, j,k) maps a given i, j,k computational

space point within a block to a physical space point. Additionally, each i, j,k computational

space point has one or more scalar or vector field data components that may be rendered.

Physical space is defined as the world space through which linear camera rays are cast

to produce images for the user. For curvilinear data sets, a linear component in physical

space will generally correspond to a curved component in computational space. Thus,

for ray casting to be performed for a camera in physical space, transformations between

computational and physical space are required for positions and vectors.

5.3.2 Positional transformations

Equation 5.1 transforms a point from computational space to physical space. This trans-

formation is needed to derive the Jacobian, as well as to compute the distance between a

104 given point in physical space and a point in physical space corresponding to a given point

in computational space.

q¯(i, j,k) = p¯(k) + s(i, j)u[nˆ(k) × yˆ] + s(i, j)v[yˆ] (5.1)

[nˆ(k) × yˆ] · [d (i, j,k) − p¯(k)] s¯(i, j) = phypos (5.2) yˆ· [dphypos(i, j,k) − p¯(k)]   where

q¯(i, j,k) is the physical space 3D position of a point i, j,k in computational space.

p¯(k) is a physical space position as a function of k in computational space that is the origin

of each slice plane of the extruded volume. Effectively this defines an extrusion path.

nˆ(k) is a physical space unit vector as a function of k in computational space that represents

the normal to each “slice” of the extruded volume.

s¯(i, j) is the planar profile surface, defined relative ton ˆ(k) andy ˆ as in equation 5.2. yˆ is a physical space unit vector against which all slice normalsn ˆ(k) must be orthogonal.

Effectively, equation 5.1 is a more compact representation of dphypos(i, j,k) that takes advantage of the characteristics of dphypos(i, j,k) that permit it to be represented as an ex- trusion of a profile surface along a curve rather than being represented as a simple 3D array.

5.3.3 Computation of p¯(k), nˆ(k), s¯(i, j), and yˆ.

Thep ¯(k),n ˆ(k), ands ¯(i, j) can be either derived from a given mesh, or user-specified separately. Thep ¯(k) andn ˆ(k) functions are represented as one-dimensional sampled data,

105 whiles ¯(i, j) is represented as two-dimensional sampled data. If it is necessary to derive these functions from a given mesh, the following process is used:

1. Findp ¯(k) as in equation 5.3. This is the mean of all given sample positions in physi-

cal space in slice k. It does not matter if the point is in the exact center of slice, as it

is just a reference point that must lie in the plane of the slice.

2. Findn ˆ(k) via numerical methods. This is the normal to the planar slice. Random

triplets of sample points are chosen in each slice, and two vectors(sharing one of

the three points as a common origin) are formed for each triplet. A cross product is

applied on those vectors and the sign of the resulting vector is adjusted such that it

points forward from the slice(using a simple forward or backward difference refer-

ence vector). For each slice, many of these triplets are found and added into a per-

slice accumulator vector until the change in the direction of the accumulator vector

falls below a threshold. The resulting per-slice accumulator vectors are normalized

to produce accurate slice normals.

3. Picky ˆ. Alln ¯(k) must be orthogonal toy ˆ. In our test data,y ˆ was simply the y-axis. If

yˆ is not known initially, it can be found in a manner similarly ton ˆ(k), but using the

n¯(k) values instead of slice sample positions.

4. Finds ¯(i, j). Any k slice within the volume can be picked to form this function, and

in practice k = 0 is used. A planar coordinate system defined byp ¯(k),y ˆ andn ˆ(k)

is defined at each slice, and the sample points are projected onto the axes of that

coordinate system to finds ¯(i, j) positions, as in equation 5.2.

106 1 Ni Nj p¯(k) = ∑ ∑ dphypos(i, j,k) (5.3) NiNj i=0 j=0 Note that whilep ¯(k) does not need to be planar,n ˆ(k) · yˆ must be zero. Thus, highly

irregularp ¯(k) functions may not yield useful results because the profile surface can only be

rotated about they ˆ axis. Additionally, the curvature of the extrusion must be small enough relative to the size of the slices such that no two slices intersect. The algorithm could be easily extended to support varyingy ˆ as a function of k, which would permit more than one degree of rotational freedom for slices, but many data sets will not require this and it increases the expense of Jacobian computation.

While the computational to physical space transformation is very straightforward to implement with this specialized representation, the physical space to computational space transformation is not. An analytical inverse to the above function is often impractical, and producing a 3D sampled volume in physical space mapping the positions to computational space would impart excessive memory requirements while potentially introducing incon- sistencies at the edges of the valid data. However, [9] did implement the physical space to computational space transformation using 3D textures.

The proposed rendering algorithm provides a high-accuracy starting computational space position for a ray, while requiring only incremental updates to the computational space ray position as the physical space ray position is stepped during ray casting. To avoid the accumulation of error in an incrementally updated computational space ray po- sition, small corrections are required at each step. Equation 5.1 can be applied to find a distance between a given physical space point and a physical space point corresponding to a given computational space point which can then be used as a convergence test for a

107 Figure 5.3: Data set 2 volume block bounding wireframes. Each vertex corresponds to a grid-centered position on the boundary. The wireframes demonstrate the curvature and non-uniform cell sizes of the curvilinear space. Level 0 has 8 distinct blocks, level 1 has 24 distinct blocks.

gradient descent algorithm, obviating the need for a full physical space to computational

space positional transformation.

5.3.4 Jacobian matrices

As a ray is stepped through physical space, a corresponding ray must be stepped through computational space. Because computational space is curvilinear, a straight ray in physical space will, in general, correspond to a curved ray in computational space. While gradient

108 descent could be applied with the computational space to physical space transformation to

compute corresponding computational space points, our representation of computational

space positions permits the easy computation of Jacobian matrices to transform physical

space vectors to computational space vectors which can be used to directly transform steps.

Though we need the inverse Jacobian matrix to transform a vector from physical space to computational space(J−1), that matrix is hard to directly compute given our representa-

tion of computational space. However, it is easy to compute J then invert that 3x3 matrix.

Equations 5.4, 5.5, and 5.6 form the i, j, and k columns(respectively) of the Jacobian

matrix J as in equation 5.8. The i and j columns effectively are dealing with changes within

a single slice, where as the k column deals with changes between multiple slices. Because yˆ is constant in equation 5.6, equation 5.6 can be simplified to 5.7.

∂s¯(i, j) ∂s¯(i, j) J (i, j,k) = [nˆ(k) × yˆ][ ] + yˆ[ ] (5.4) pci ∂i u ∂i v ∂s¯(i, j) ∂s¯(i, j) J (i, j,k) = [nˆ(k) × yˆ][ ] + yˆ[ ] (5.5) pc j ∂ j u ∂ j v ∂ p¯(k) ∂[nˆ(k) × yˆ] ∂yˆ J (i, j,k) = + s¯(i, j) + s¯(i, j) (5.6) pck ∂k u ∂k v ∂k ∂ p¯(k) ∂[nˆ(k) × yˆ] J (i, j,k) = + s¯(i, j) (5.7) pck ∂k u ∂k

J(i, j,k) = Jpci Jpc j Jpck (5.8) h i J−1(i, j,k) = J(i, j,k)−1 (5.9)

Because the matrix is a full 3x3 matrix, a general determinant-based matrix inversion

is used to find J−1. Note that the input data should have well-formed positions, with no

two sample points in computational space lying at the same physical space position, to

guarantee that this matrix is always invertible.

109 5.3.5 AMR integration

Adaptive mesh refinement is supported by defining, for a given level of detail, axis- aligned cuboid regions in computational space that are said to be owned by a lower level.

This enables the ray caster to easily determine whether a particular sample cell should be accumulated or not. Because consistency is required in the positional transformations between levels, thep ¯(k),n ˆ(k), ands ¯(i, j) functions must be defined in a way such that their domain contains the union of all levels of detail. Because the resolution needs to be uniform for each of the functions,p ¯(k) andn ˆ(k) will need to be specified at the lowest level of detail in the k direction, ands ¯(i, j) needs to be defined at the highest available level of detail.

5.4 Rendering

The data is stored as a hierarchy of blocks, as can be seen in figures 5.4 and 5.3. Each block is an axis-aligned cuboid in computational space with a 3D uniform grid of sample points. In physical space, the blocks will tend to be curved as shown in the figures. Asso- ciated with each block is a set of child blocks, which define their bounds in computational space. Each one of these blocks can have the currently selected data component of interest stored into a 3D texture, with single-component voxels.

Rendering is performed recursively on a per-block basis. The faces constituting the borders to the block are rendered in computational space via two triangles for each grid centered boundary cell, using a vertex program to evaluate the computational space to physical-space transformation for the vertices, and a fragment program to perform the ray casting. Back face culling eliminates the rendering of faces not facing the viewer on a per- triangle rather than per-block-side basis because the blocks may have significant curvature

110 Figure 5.4: Data set 1 volume block bounding wireframes. Each vertex corresponds to a grid-centered position on the boundary. The left column shows AMR levels 0 and 1, while the right column shows AMR level 1. The wireframes demonstrate the curvature and non- uniform cell sizes of the curvilinear space. Level 0 has 8 distinct blocks, level 1 has 24 distinct blocks.

in physical space. The physical space and approximate computational space positions are

passed to the fragment program, providing starting points for the ray to be evaluated for

every fragment. Additionally, the bounds of child blocks are passed to the fragment pro-

gram as well to permit AMR rendering. The depth buffer is used for tracking evaluated

field values for maximum intensity projection.

With this technique, only minimal preprocessing is required. The following are some use cases and the required processing for each:

111 ∂ p¯(k) ∂nˆ(k) ∂s¯(i, j) ∂s¯(i, j) Initial load: The texturesfor thep ¯(k),n ˆ(k),s ¯(i, j), ∂k , ∂k , ∂i , and ∂ j functions need to be created. If these functions are not already specified as part of the source

data, they will require one full iteration through all data points at the lowest level of

detail, and one full iteration through a single slice of the data points at the highest

level of detail. Additionally, the texture that contains the field data for the currently

selected component needs to be created.

Different data component selected: The single component 3D texture that contains the

field data for each block needs to be reloaded with the new component.

Data modified, positions left intact: The single component 3D texture that contains the

field data for each block needs to be reloaded with the new component.

Camera/view change: No re-preprocessing needs to be performed – the scene can simply

be re-rendered.

Changing mesh: If the volume data positions are changed in a way that cannot be repre-

sented with a simple affine transformation, but the volume dataisnot,thep ¯(k),n ˆ(k),

∂ p¯(k) ∂nˆ(k) ∂s¯(i, j) ∂s¯(i, j) s¯(i, j), ∂k , ∂k , ∂i , and ∂ j functions need to be rebuilt. However, the 3D volume textures storing the field data do not need to be modified if that field data is

not modified.

The majority of time during rendering is spent executing the fragment programs that perform ray casting. This creates potential for effective image-space parallelization.

112 Figure 5.5: Volume renderings for different minimum step lengths. Each row from left to right show step lengths 0.001,0.005,0.010,0.050,and 0.100. The top row shows data set 2 and the bottom row shows data set 1. A larger minimum step length decreases required computational time while increasing error.

5.4.1 Ray Casting

Ray casting is performed for every fragment generated by the border triangle rasteri- zation. Each of those fragments has associated physical space and approximate compu- tational space positions interpolated between the vertices by the rasterizer that form the starting point for a ray within a given block. Each fragment program execution performs ray casting for a single ray through a block.

1. Compute the block-local computational space position.

p¯com − p¯blkmin p¯loccom = p¯blkmax − p¯blkmin

2. Compute the global unscaled computational space step using the Jacobian.

−1 v¯comstep = J v¯phystep

3. Compute the block-local computational space unscaled step.

v¯comstep v¯loccomstep = p¯blkmax − p¯blkmin 113 4. Compute the scaled computational space and physical space steps(section 5.4.3)

5. Check whether the the computational space position of the ray lies within a child

volume. If it does not, sample the field texture and accumulate the field texture using

a maximum intensity projection rule. A simple list of volumes that are axis aligned

cuboids in computational space, each defined by a minimum and maximum point,

defines the child volumes to a given block.

6. Increment the computational space and physical space position vectors by the scaled

steps

7. Compute the block-local computational space position

8. Apply the correction(section 5.4.2) loop to the computational space position.

9. Check if that position is within the bounds of the block. If it is not, terminate the

ray loop and write the resulting color value and maximum field value to the fragment

program color and depth results respectively.

10. Goto 2

5.4.2 Correction loop

Because the current computational space position was computed using a linear approx- imation(the Jacobian) of the step in computational space for the step in physical space, an error is inherent in it. The correction loop corrects this error by iteratively transforming the reverse error vectors in physical space to computational space then accumulating them with the current computational space position. This is very similar to applying gradient de- scent for minimization of the distance between a given physical space point and a physical

114 space point that corresponds to a varying computational space position, except the gradient

is not being computed numerically in computational space, reducing the number of texture

fetches required.

1. Transform the current computational space position(p ¯com) in to a physical space

position(p ¯truephy, the current “true” physical space position), using equation 5.1.

2. Compute the physical space correction vector using the current physical space posi-

tion and the current “true” physical space position.

v¯phycor = p¯truephy − p¯phy

3. If the magnitude ofv ¯phycor is below a specified tolerance, break from the correction

loop.

4. Transform the physical space correction vector into a computational space correction

vector using the Jacobian.

−1 v¯comcor = J v¯phycor

5. Accumulate the computational space correction vectorv ¯comcor with the current com-

putational space positionp ¯com. A scaling factor(γ) is applied to the correction vector

to improve the rate of convergence.

To improve performance for previewing of volumes, a hard iteration limit can be applied to this correction loop. This will have some implications in accuracy, but for the purposes of previewing may be acceptable.

115 5.4.3 Step length determination

While a uniform step length can be used successfully, variable step lengths can provide for greater performance. With curvilinear data, cells may vary in size greatly, so the proper step size through the data should also vary. Our technique computes the approximate inter- section points in computational space with the borders of a given cell to find the necessary step to the next cell. Performing the intersection in computational space greatly simplifies the operation, because the cells are unit cubes in computational space rather than six-faced curved volumes in physical space.

1. Compute the cell ceiling and floor for that step using equations. These are needed to

find the intersection with neighboring cells.

⌈Ni p¯loccomi ⌉ p¯ = ⌈Nj p¯loccom ⌉ comcellceil  j  ⌈Nk p¯loccomk ⌉   ⌊Ni p¯loccomi ⌋ p¯ = ⌊Nj p¯loccom ⌋ comcellfloor  j  ⌊Nk p¯loccomk ⌋   These two functions,p ¯comcellfloor andp ¯comcellceil, define the bounds of the cell in

computational space that contains a given computational space positionp ¯loccomi .

2. Compute the intersection with the neighboring cells in computational space to find

the proper step size using the computational space step vector,p ¯comcellceil, andp ¯comcellfloor.

This is done by computing the intersection with each face plane of thev ¯comstep vec-

tor, then using the intersection with the lowest positive intersection parameter. Some

special cases must be handled with the intersections. If a ray is traveling parallel to

or even within a given face, intersections should not be computed against that face.

Also, if a ray intersection with a face would result in an intersection parameter of

116 zero, the ray should not be intersected with that face. Additionally, a scaling factor

greater than but close to 1 needs to be applied to the resulting intersection parameter

to reduce the likelihood that an intersection point will lie exactly on a face.

3. Apply the minimum step length constraint to the intersection parameter. If the inter-

section parameter is less than the minimum step length, then it it is set to the mini-

mum step length. This is to permit user configurability of quality. Larger minimum

step lengths result in larger steps.

4. Using that intersection parameter, scale the computational space and physical space

step vectors(¯vcomstep andv ¯phystep by the resulting intersection parameter. This will

result in the ray being stepped out of the current cell into the next cell as determined

by the intersection and minimum step length constraint.

In practice, it was found that the linear approximation to a ray within a given cell implied by computing the intersection using a linear component in computational space did not introduce noticeable error. An extension to this step length determination method could use a maximum step length that is a function of the curvature of the space in that cell to reduce the amount of error. Additionally, because the step length is chosen based on cell boundary intersections, only a test for whether the origin of a given ray step is within a child volume is required to support adaptive mesh refinement.

5.4.4 GPU implementation

Block boundaries are rasterized with OpenGL using GLSL vertex and fragment pro- grams. The depth buffer is used for compositing the different blocks with maximum inten- sity projection. Equations 5.1, 5.8, and 5.9 are implemented within fragment and vertex

117 programs. In total, 6 small textures are used to represent the parameters for defining the

transformations between computational and physical space, and the volume samples for

each block are stored in a 3D volume texture. Each block has an associated list of child

block bounding regions in computational space which is passed to the vertex and fragment

programs.

The positional functionsp ¯(k),n ˆ(k), ands ¯(i, j) can each be defined by a texture.p ¯(k)

is a one-dimensional texture with resolution Nk and 3 components per texel.n ˆ(k) is a one-

dimensional texture with resolution Nk and 3 components per texel. It is possible to reduce

nˆ(k) to two-components per texel given thatn ˆ(k) is a unit vector, but on current graphics

hardware this would not yield any performance improvement because the memory require-

ments wouldn’t be significantly changed, yet additional computation would be required to

renormalize the values.s ¯(i, j) is a two-dimensional texture with resolution Ni x Nj, with two components per texel.

While the derivatives for the Jacobian matrices can be derived within the fragment pro- gram, a significant performance penalty was found to be incurred by the required number of texture fetches and conditionals required for handling boundary cases. Instead, deriva-

∂ p¯(k) tive samples are computed on the CPU then stored in textures. The derivatives ∂k and ∂nˆ(k) ∂s¯(i, j) ∂k are each stored in their own one-dimensional, three-component textures. ∂i and ∂s¯(i, j) ∂ j are combined into a single two-dimensional, four-component texture. These textures can be built directly from thep ¯(k),n ˆ(k), ands ¯(i, j) textures.

118 Space dimensions 128x128x128 Total samples 819200 Level 0 Blocks 8x(32x32x32) Level 1 Blocks 8x(32x32x32) 8x(16x32x32) 8x(20x32x32)

Table 5.1: Set 1 blocks

Space dimensions 512x512x512 Total samples 189440 Level 0 Blocks 1x(32x32x32) Level 1 Blocks 1x(24x24x24) Level 2 Blocks 1x(24x24x24) Level 3 Blocks 1x(36x32x32) Level 4 Blocks 1x(48x40x48)

Table 5.2: Set 2 blocks

Set 1 Set 2 Data Component(L) Voxels 906048 205921 Profile(UV) Texels 4225 1089 Curve/Normal(XYZ) Texels 130 66

Table 5.3: Data set memory requirements

5.5 Results

Two data sets were used for testing, both from the context of MHD simulations as discussed in section 5.2. Both data sets are curvilinear, utilize adaptive mesh refinement, and contain cell-centered samples. 119 Set 1 has two levels with several blocks within each level. Table 5.1 lists the blocks and their dimensions, and figure 5.4 exhibits the block boundaries. This set is a good test case for the usage of several blocks within a single level, as well as shallow AMR.

Set 2 has one block per level, and five levels. Table 5.2 lists the blocks and their dimen- sions, and figure 5.3 exhibits the block boundaries. This set is a good test case for deep

AMR data sets. Due to the curvilinear AMR representation used as well as the represen- tation for the computational space to physical space transformation, memory requirements are very reasonable. As can be seen from figure 5.3, very little overhead is needed to represent the data, thus permitting increased scalability.

Figures 5.5 and 5.4 show volume rendering times for data sets 1 and 2 respectively.

Set 1 requires more rendering time than set 2 due to the increased number of samples and blocks in set 1.

As can be seen in figures 5.6 and 5.7, the rendering time does not vary linearly with the resolution. It was found that the majority of time within the fragment programs, where the ray casting occurs, is being spent on texture fetches. This indicates that at lower resolutions, the caches on the GPU are not as effective at caching the texture data as they are at higher resolutions.

Also, as expected, as the minimum step length increases, the required computational time decreases due to the decreased number of steps. Figure 5.5 exhibits a range of step lengths for both of the test data sets.

The proposed compact representation of position greatly reduces the amount of memory bandwidth required between GPU caches and the external texture memory. Additionally, it reduces the number of texture fetches required to compute Jacobians for each ray step.

120 MSL 1024x741 768x549 512x357 0.1 0.032s 0.025s 0.020s 0.05 0.049s 0.038s 0.029s 0.01 0.155s 0.112s 0.076s 0.005 0.229s 0.161s 0.106s 0.001 0.396s 0.275s 0.177s

Table 5.4: Set 2 rendering times for different minimum step lengths and viewport resolu- tions.

MSL 1024x741 768x549 512x357 0.1 0.095s 0.083s 0.079s 0.05 0.151s 0.127s 0.110s 0.01 0.60s 0.47s 0.338s 0.005 0.90s 0.70s 0.528s 0.001 1.85s 1.46s 1.01s

Table 5.5: Set 1 rendering times for different minimum step lengths and viewport resolu- tions.

Some positional error results from the profile extrusion on AMR data. On the test data sets, this error remained much smaller than the size of a given cell within the data. Figure

5.8 exhibits the positional error in the data sets.

For a 1024x1024x1024 volume that fits the above constraints to be fully defined for transformations between computational space and physical space, only 4096 three-component,

220 two-component, and 220 four-component texels would be required – a significant im-

provement over a direct 3D representation which would require thousands of times more

memory.

121 Figure 5.6: Data set 1 running times

5.6 Conclusion

We have a presented a technique for memory-efficient and time-efficient volume render- ing of curvilinear adaptive mesh refinement data within extrudable computational spaces.

The volume is represented as a planar two-dimensional surface that is extruded along a profile curve. The Jacobian for points within the volume can also be easily computed us- ing partial derivatives of these functions. This provides significant memory savings over using a a uniformly sampled volume texture to represent the transformation, in addition to

122 Figure 5.7: Data set 2 running times

reduced memory bandwidth requirements because due to more localized texture lookups.

With this technique, curvilinear adaptive mesh refinement data sets defined in extrudable meshes can be more efficiently visualized and manipulated.

123 Figure 5.8: The positional error(the difference between the original mesh position and the mesh point found with equation 5.1) is proportional to the point darkness in these images. From left to right, the images are of set 1 levels 0 to 1, set 1 level 1, set 2 levels 0 to 4, set 2 levels 3 to 4.

124 Chapter 6: Transformations for Volumetric Range Distribution Queries

Volumetric datasets continue to grow in size, and there is continued demand for inter-

active analysis on these datasets. Storage capacities and compute capabilities have also

increased in workstation environments, but the storage throughputs and core memory sizes

available have not increased at a similar rate. This means that an increasing number of

analysis applications are becoming limited by the size of the data required by the algorithm,

rather than by the computation speed or out-of-core storage device capacities available.

Many analysis applications perform data reduction – reducing a subset of data from

a large-scale dataset to a much smaller dataset. For example, in volume rendering, a 3D

volume is reduced to a 2D image, where the size of the image is typically considerably

smaller than the size of the volume. An ideally scalable algorithm, for large-scale data,

would have an asymptotic working set complexity in terms of the image size, rather than

in terms of the volume size.

The working set of an algorithm is the set of data elements required for its execution during a time interval [16]. Assuming that all of the data (with N elements) has a contribu- tion to the solution of an analysis application, it is unrealistic to expect that, in general, we can change the asymptotic working set complexity, for the entire time span of workflow, to be less than O(N). However, it may be possible to change the working set complexity

125 of the interactive analysis portion of the workflow by applying data transformations in the

preprocessing phase. This can facilitate scalable interactivity by making the working set

complexity of the interactive portion primarily depend on the result size, rather than the

size of the input data.

One approach to tackling this challenge is to perform a preprocessing pass that reduces the complexity for traditional analysis methods applied in the interactive phase. This is the single-ended transformation approach. For example, a volume could be downsampled to contain only as many samples as contained by the images being rendered. It could then be directly rendered using traditional volume rendering algorithms. The advantage to this approach is that the interactive portion of the workflow will not require any changes to have reduced working set complexity. However, the major downside is the amount of sampling error that will be introduced for most volumes.

Another approach is to introduce specialized analysis algorithms for the interactive phase of the workflow, in addition to a preprocessing phase. This is the dual-ended trans- formation approach. In keeping with the volume rendering example, one example of this approach is Fourier Volume Rendering (FVR) [68]. Using FVR, the working set complex- ity of the direct volume rendering algorithm is reduced1 from O(N3) to O(N2) during the interactive phase, with an O(N3) preprocessing pass. In this example, several points can be illustrated.

Firstly, the overall computational complexity is greater than the working set complex- ity in this case (O(N3logN3) vs. O(N3)). But, for the interactive portion of the workflow, the working set complexity and computational complexity both decrease from O(N3) to

O(N2) and O(N3) to O(N2logN2), respectively. Secondly, some flexibility (such as it

1for an NxNxN volume to be rendered into an NxN image

126 being reduced to a simple summation with a parallel projection) in volume rendering has

been sacrificed for the sake of interactivity, but this may be acceptable for many users.

Thirdly, the overall working set complexity remains the same, as can be expected, but the

working set complexity of the interactive portion of the workflow has decreased consider-

ably. Finally, this data transformation facilitates only a fairly limited set of data analysis

approaches. In summary, some flexibility and preprocessing time has been sacrificed for

increased interactivity by adapting the application algorithm as well as introducing prepro-

cessing. This exemplifies the approach we take.

An increasing number of analysis applications are considering histograms, or other summary statistics, as their input, rather than just the input volume. For example, instead of computing isosurfaces for every cell in the volume, a lower resolution volume charac- terizing the density of isosurfaces (”fuzzy isosurfaces”) can be computed using histograms

[109]. Another example is Histogram Spectra, where the differences between histograms are used to characterize error in level of detail selections (described in chapter 4). These applications depend on distribution range queries. A distribution range query evaluates an estimate of the probability density function (PDF) of the values contained within a rectan- gular cuboid region of a volume.

The primary goal of this technique is to enable efficient evaluation of distribution range queries on volume data by reducing the working set required. The design is motivated by two observations:

• Approximate queries can be used to reduce metadata storage sizes and reduce the size

of the working set required for a set of query requests (discussed in section 6.2.6).

127 • Similarity between overlapping integral distributions can be utilized to reduce meta-

data storage sizes and the working set required for a set of query requests (discussed

in sections 6.2.4 and 6.2.5).

We propose an approach that enables efficient evaluation of distribution range queries.

This is accomplished by generating metadata during the preprocessing phase, then loading it on-demand for queries in the interactive phase. For multiple applications this enables the working set complexity to be primarily a function of the analysis result size, rather than the size of the input data.

This core contribution has three parts. The first, discussed in §6.2.2, is a generaliza- tion of integral histograms to the continuous domain and to multivariate volumes, integral

distributions. The second, discussed in §6.2.4, is a decomposition of these integral dis-

tributions into a hierarchical structure, span distributions, that facilitates effective storage

as metadata. The third is a proposal, in §6.3, for how to apply the technique for improved

working set complexity in a few different applications with accompanying analyses.

Results are presented to validate two claims:

• Span distributions reduce the size of data sets, enabling reduced working set sizes,

which improves performance over directly storing integral distributions. This is

shown in table 6.5 and graphs 6.10, 6.11, and 6.9. Discussions specific to the storage

of span distributions are in section 6.2.5.

• Approximate span distributions can further improve performance, at the cost of ac-

curacy. This is shown in table 6.6 and graph 6.10. Approximate span distributions

are also discussed in section 6.2.6.

128 Algorithms are provided both for the construction of metadata in the preprocessing

phase, and for the servicing of queries using this metadata in the interactive analysis phase.

To show the generality of the benefits of the approach, a working set complexity analysis

is provided for two applications using this metadata. We believe that this work provides a

good foundation on which to build scalable analysis applications.

6.1 Related Work

Range queries have been widely explored in the field of Online Analytical Processing

(OLAP) and are beginning to be explored in more detail in the field of Visualization. This section will overview some of the higher level techniques that have been applied in OLAP to answer range queries in general, as well as techniques that focus more on computing distributions of ranges. Additionally, techniques in visualization and graphics that facilitate distribution range queries will also be discussed.

In OLAP, typical range queries are simple, such as a summation or maximum. How- ever, these can be viewed merely as specific types of aggregation operators. Similarly, distribution range queries are also a type of aggregation operator. Thus, while most of these techniques seek to solve range sum queries, some of them can be adapted to distribu- tion range queries. One group of techniques [33] [124] applies wavelet decomposition to the space to approximate sums. Hou et al. [43] proposes cosine transforms as an alterna- tive. These techniques work well for scalar data. The authors also introduce the concept of approximate reconstructions that sacrifice spatial accuracy.

Another group of techniques [22] [34] generate histograms to approximate scalar data for the purposes of range queries. These techniques greatly depend on the bounds they choose for the regions they approximate with histograms. Koudas et al. [56], Karras [51],

129 and Poosala et al. [84] focus more on the aspect of the problem involving the choice of

bounding volumes. While these techniques can be motivating applications for the use of

histogram range queries, as discussed in §6.3.1, they would be difficult to directly apply to

answering distribution queries and may consume a considerable amount of space. Hixels

[109] are a simple case, where the volume is broken into blocks over which histograms are

computed. This has a disadvantage in that it must have a high resolution (and consequently

a large working set) to support range queries with varying spatial positions and scales.

Prefix sum-based techniques in OLAP [27] [3] motivate the approach we have taken

to answering distribution range queries. Fundamentally, these techniques compute a prefix

sum then perform a series of additions and subtractions between prefix sum values to com-

pute a range sum. Summed-area tables, in computer graphics, apply the same basic concept

to texture mapping [14]. Integral histograms [85] extend this summed area table approach

to supporting histogram range queries in the context of images. Much of the work in OLAP

in this area focuses on facilitating fast updating of the prefix sum data, rather than just fast

queries, which introduces a design compromise that we do not necessarily need to make in

the visualization context. However, those same works [11] [27] [60] do introduce concepts

that support subdivision of a volume into subdomains for the purposes of improving space

and time complexity, as well as increasing parallelism.

Integral distributions generalize existing techniques to multivariate volumes to support reductions in working set complexity for the interactive portion of multiple applications.

Span distributions leverage spatial coherence in integral distributions to further reduce working set sizes, as well as supporting hierarchical, multiresolution approximate queries.

The next section explains the details of both of these techniques.

130 Figure 6.1: The preprocessing phase transforms the volume data into metadata using the transformation pipeline in equation (6.2). This requires O(N) working set complexity, for a volume with N elements. In the interactive phase, queries for distribution range queries are evaluated by reading parts of the metadata on demand into the transformation pipeline in equation (6.3). The working set complexity for this phase depends primarily on the result query size rather than the size of the input volume.

131 6.2 Technique

The goal of the technique is to facilitate working set-efficient distribution range queries

2d+m of volumetric data. Q(~s0,~s1,~t) : R → R is the probability density for the vector value t within the region of the vector field V : Rd → Rm, with m-component vectors, bounded by the points ~s0 and ~s1:

~s1 h(V(~s),~t)d~s ~s0 Q(~s0,~s1,~t) = R ~s1 1d~s ~s0 (6.1) 0R : u 6= t h(~u,~t) = 1 : u = t  where the integrals are volume integrals and d is the dimensionality of the volume. In the

context of scientific volume data, the vector field V typically contains the contents of the m

dependent variables.

Direct evaluation of equation (6.1) for a discretized volume requires a working set of

size O(N), assuming only the number of samples can change. For interactive queries on

large-scale data this is impractical. Our technique reduces this working set for each query

by transforming V into span distributions, which are a hierarchical repesentation (defined

in section 6.2.4) of integral distributions, in the preprocessing phase. This enables effi-

cient evaluation of an integral distribution field W (~s,~t) : Rd+m → R. Then, W is used,

instead of V , in an alternative formulation of equation (6.1), to evaluate Q. Because eval-

uating Q using W instead of V requires far fewer values, the working set size is reduced.

The high level process is shown in figure 6.1.

The integral distribution field is a mapping from Rd+m to R, rather than being defined

in terms of a discrete domain. Additionally, the above equations are formulated in terms of

general probability density functions rather than probability mass functions (which can be

132 represented by histograms). Discretization and storage strategies must be considered for

both.

6.2.1 High level overview

The two phases of the proposed framework, the preprocessing phase and the interactive phase, are shown in figure 6.1. Metadata is generated in the preprocessing phase using a series of transformations:

I D S V(~s) −→ W(~s,~t) −→ Xi(~s) −→ Yk,i (6.2) where D is a distribution value discretization function, introduced in §6.2.3. S is a spa- tial discretization function, such as span distributions, introduced in §6.2.3. X is a value- discrete, spatially-continuous representation of W and Y is a value-discrete, spatially-discrete representation of X. The integral distribution function, W, is introduced in section 6.2.2.

This metadata is loaded on-demand to evaluate queries. A sequence of transformations are applied for each query:

S−1 D−1 Y˙k,i −−→ X˙i(~s) −−→ W˙ (~s,~t) −→ Q˙(~s0,~s1,~t) (6.3) where the dotted functions depend on only a subset of the metadata, rather than the original

Q function in equation (6.1). The inverse discretization functions needed for evaluating queries are introduced in sections 6.2.3 and 6.2.3.

6.2.2 Integral Distribution Function

The integral distribution function maps a point in a multivariate volume to the distribu- tion of the volume between that point and the origin of the vector field. This is an extension of integral histograms [85], which themselves are an extensionof the use of 2D prefix sums in graphics (summed area tables [14]) and multidimensional prefix sums in OLAP [27] [3].

133 The integral distribution function I is defined as:

I(~s,~t) = Q(0,~s,~t) (6.4) where Q is from equation (6.1). This can be used to transform a vector field V : Rd+m → R

into an integral distribution field, W:

W(~s,~t) = I(~s,~t) (6.5) (∀t ∈ Rm) ∧ (∀s ∈ U) where U is the set of positions in the domain of V , the input volume.

This is the intermediate representation that our metadata seeks to represent, though without directly storing it. For the sake of clarity, let W˙ be equivalent to W, but computed

using the metadata, rather than by directly evaluating equation (6.5). In other words, any

evaluation of W produces the same value as W˙ , but W˙ depends only on the metadata, while

W depends on evaluating Q.

With the integral distribution field, an alternative to equation (6.1) can be constructed

to produce Q. For a one dimensional volume, where the domain of V is R, this is simply:

Q˙(s0,s1,~t) = W˙ (s1,~t) −W˙ (s0,~t) (6.6)

Because W˙ is known a priori from the metadata, Q can effectively be evaluated simply by

evaluating W˙ at two ~s positions, rather than by evaluating an integral over V. This can

trivially be extended to the 2D case, where the domain of V is R2:

Q˙(s~00,~s11,~t) = W˙ (s~00,~t) +W˙ (s~11,~t) −W˙ (s~10,~t) −W˙ (s~01,~t) (6.7)

Generalizing to vector fields with Rd domains, this becomes:

˙ d−||i||1 ˙ Q(s{~0}d ,s{ ~1}d ,~t) = ∑ (−1) W(~si,~t) (6.8) i∈{0,1}d

134 where {S }d raises the set {S } to the dth Cartesian power. For example, {0,1}2 is

{{0,0},{0,1},{1,0},{1,1}}. For an evaluation of Q˙ in d spatial dimensions, 2d lookups

into the W field are required. This formulation of Q˙ is similar to the formulation used for

integral histograms [85], but generalized to a continuous domain.

In this work we do not seek to store the integral distribution field directly. Rather, we

discretize it then decompose it into a more appropriate format for large-scale data. This in-

volves two related parts: discretization of the field in terms of the spatial component(~s) and

a discretization of the field in terms of the value component(~t). The next section addresses

both aspects.

Figure 6.2: In this example, a 1D integral distribution volume (Xi(s)) is discretized into 8 span distributions (Yk,i) as described in equation (6.9). The span distribution at index 6, for example, is computed by subtracting Xi(5) from Xi(7).

135 6.2.3 Discretization

The integral distribution field is an Rd+m → R mapping with two parameters: an Rd

vector field spatial position parameter, and an Rm vector field value parameter. Different

discretization strategies can be taken for these two different components. The former is

discussed in section 6.2.3 and the latter is discussed in section 6.2.3.

Distribution Discretization

The distribution discretization problem seeks to provide two functions: a discretiza- tion function and a reconstruction function. The discretization function, D, maps an input m-dimensional probability density function (PDF) to a finite number of real numbered val- ues. The reconstruction function, D−1, enables the reconstruction of from the values produced by the discretization function.

For example, a simple discretization function could be a mean combined with a vari- ance. The associated reconstruction function would then be a normal distribution. This is not appropriate for many real cases, but it serves as a simple example.

In effect, the goal is to provide compact models of probability density functions that are appropriate for the underlying distributions. Many approaches exist for solving this problem [4]. The following are a few approaches that are appropriate for use with span histograms, and are of relatively low implementation complexity.

For cases where m is small (where there are few variables in the multivariate volume), histograms can be an effective tool for discretization. When m is greater than 1, this dis- cretization takes the form of joint histograms. Due to the curse of dimensionality, joint histograms may not be effective in handling cases where m is large [4]. Gaussian mixture

136 models or polynomial fits may be acceptable alternatives to histograms in cases where m is large.

In our end goal of evaluating queries in the form of Q in equation (6.1), the~t value may actually take multiple values for a single query position. This is because most applications will be interested in the PDF for more than one value. Thus, any discretization model applied should consider this. Unless only very few values of~t are needed, it will generally make more sense to provide the user application the entire distribution model as the result of Q, rather than a single value at a time, due to the overhead associated with repeatedly evaluating Q at a point.

When histograms are used, choices must be made on what binning strategy may be used. When two histograms are to be added or subtracted, the operation reduces to a simple vector addition or subtraction, when the bin bounds are the same for the operands.

If they do not line up, error will be introduced during the addition because in the cases of partial bin overlap, it is not clear how to distribute the values between overlapping bins.

This implies that the same histogram binning should be used for all histograms stored in the metadata. However, it does not apply any restrictions on what the specific binning strategy used should be, other than that it should be suitable for the entire dataset. This is a well-explored problem [115] [55], but there is still substantial room for work in exploring it in the context of large-scale data.

For the purposes of exploring the applications in this paper, a histogram discretization is used and globally-uniform equal-width binning is assumed. However, for the purposes of the definition of span distributions, no assumptions are made about the discretization other than that they can be added and subtracted.

137 Spatial Discretization

Similarly to the distribution discretization problem, the spatial discretization problem seeks to provide two functions: a discretization function, and a reconstruction function.

The discretization function, S, maps the discretized distribution at every point in the input spatial domain, as produced by the distribution discretization function (D), to a finite num- ber of real values. The reconstruction function, S−1, maps the resulting values from the discretization function back to the input domain.

For example, a simple discretization function would be nearest neighbor sampling onto a uniform grid. The associated reconstruction function for a spatial position would re- turn the value at the nearest sample. Combining this spatial discretization function with a histogram distribution discretization function is the approach taken by integral histograms

[85].

Span distributions are an alternative spatial discretization function, designed to be more working set-efficient by supporting multiresolution approximate reconstruction of the W

field (which is assumed to be defined on a uniform grid) and by taking advantage of spatial coherence between nearby regions.

6.2.4 Span Distributions

The volumes affecting each integral distribution tend to have considerable overlap, as shown in figure 6.2. This implies that their respective distributions will be similar. For example, consider the case in a 3D volume where an integral distribution is defined for the point < 1,1,1 >. If an integral distribution is defined for another point, < 1,1,1.01 >,

99.01% of the contributing volume will overlap, placing an upper bound on the possible difference between the distributions.

138 Two key observations follow from this. First, a hierarchical spatial discretization can

be used effectively for facilitating approximate, lossy, integral distribution reconstruction

(as dicussed in section 6.2.6). Secondly, the information entropy of the difference between

two nearby integral distributions will tend to be considerably smaller than the information

entropy of the each of those two integral distributions, individually. Span distributions are

designed to take advantage of both of these observations.

Span distributions are a spatial discretization strategy, mapping Xi(~s) to Yk,i, taking the place of S in equation (6.2). In the case of d = 1, where V has only one spatial dimension, the span distribution discretization function is defined as:

−1 YG(~s),i = Xi(s) − Xi(s − G ) (BG(s))

Lk Bk = 2 (6.9)

Lk =(least significant nonzero bit index in k)

−1 R Z Z R where G(s) and Gk are nearest neighbor mappings from to and to respectively.

Xi(s) : RZ → R is value i of the integral distribution at the point s.

−1 The inverse transform (S in equation (6.3)) maps Yk,i to Xi(~s). In the case of d = 1,

where V has only one spatial dimension, the span distribution reconstruction function is

defined as: −1 YG(s),i + Xi(GB ) : G(s) 6= 0 Xi(s) = G(s) (6.10) ( 0 : G(s) = 0 Intuitively, the forward transformation discretizes the spatial positions to a uniform grid,

then stores a distribution for each nonzero bit in the discretized grid coordinate index. This

decomposes the input integral distributions into one or more span distributions. The inverse

transform performs the reverse of this, fetching one span distribution for each nonzero bit

in the discretized grid coordinate index. Figure 6.3 shows an example of this being used

for range queries.

139 Figure 6.3: Distribution range queries are executed by evaluating the integral distirbution of each corner of the range using equation (6.10), then combining them using equation (6.8). In this example, the range query is evaluated using 4 span distributions, subtracting the span distributions (Y2,i and Y3,i) that contribute to the Xi(4) integral distribution, and adding the span distributions (Y4,i and Y6,i) that contribute to the Xi(7) integral distribution.

140 The following table is a concrete example of this being applied to a one-dimensional

input function, V(s) = s, where the domain of V is [0,1]:

s k X(s) Lk Bk Yk X(s) from Yk 0.000 0 (0,0,0,0) n/a n/a (0,0,0,0) Y0 0.125 1 (1,0,0,0) 0 1 (1,0,0,0) Y1 0.250 2 (2,0,0,0) 1 2 (2,0,0,0) Y2 0.375 3 (2,1,0,0) 0 1 (0,1,0,0) Y3 +Y2 0.500 4 (2,2,0,0) 2 4 (2,2,0,0) Y4 0.625 5 (2,2,1,0) 0 1 (0,0,1,0) Y5 +Y4 0.750 6 (2,2,2,0) 1 2 (0,0,2,0) Y6 +Y4 0.875 7 (2,2,2,1) 0 1 (0,0,0,1) Y7 +Y6 +Y4 1.000 8 (2,2,2,2) 3 8 (2,2,2,2) Y8

In this example, the Xi(s) function is the histogram of the values within the the range of

0 to s in V(s), with 4 evenly-spaced bins. The G(s) function is defined with 9 uniformly- spaced grid-centered sample points from s = 0 to s = 1.

The s column is the spatial continuous-domain position corresponding to the spatial discrete domain position. The X(s) values are the integral distributions (histograms, in this example) at the positions. The Lk is the level for the span distributions and the Bk is the length of the span in terms of indices. The Yk is the span distribution for index k. The“X(s) from Yk” column shows how X(s), for each row, can be reconstructed from Yk values.

Extending the above equations from one dimension to d dimensions, to enable eval-

−1 uation of Xi(~s) in equation 6.2, only requires a modification to the G(~s) and Gk nearest neighbor mappings. The real-valued spatial positions are mapped to integer positions on a uniform grid. Then, a single integer is produced from these coordinates’ integer positions by using the Z-order space-filling curve [74]:

~ 1 G(~s) = Z(diag(N)~s) + 2 (6.11) j k G−1 = diag 1 Z−1(k) (6.12) k ~N   141 Figure 6.4: The Z-order space-filling curve maps a d-dimensional integer coordinate to a 1- dimensional integer coordinate. In this example, a 3D coordinate with 4 bits per component is mapped to a a single 1D coordinate with 12 bits.

where Z is the Z-order space-filling curve and diag(~v) produces a diagonal matrix from vec- tor~v. Figure 6.4 shows an example of a Z-order curve encoding of a 3 dimensional integer vector. Use of Z-order space filling curves for hierarchical representations has been applied before [81], due to their favorable storage locality properties and simplicity. However, they have not been used in the context of representing data structures similar in purpose or structure to integral distributions.

This section has provided a definition of span distributions, in terms of the transforma- tions (discretization and reconstruction) between Yk,i and Xi(~s). The next section discusses

how Yk,i, the span distributions, are stored.

6.2.5 Storage of Span Distributions

The Yk,i field, produced by the transformation discussed in the previous section and shown in equation (6.2) enables lossless reconstruction of integral distributions. A loss- less, compressed encoding of Yk,i is the metadata that is to be stored on disk. Multiple considerations must be made to facilitate efficient storage and use.

142 Because the goal is to leverage the hierarchical representation of span distributions to

facilitate approximate queries, it makes sense to store the elements for each level contigu-

ously, rather than interleaving the elements from different levels. This is taken advantage

of in section 6.3.1. Additionally, because user applications are likely to be interested in

the PDF evaluated at ranges of values, rather than a single value, values that contribute to

the same PDF discretization should be stored contiguously. Finally, working sets can be

further reduced by applying entropy coding.

To construct the hierarchical storage model, Yk,i is separated into levels. The level to which a span distribution is assigned is Lk, from equation (6.9), which is simply the index of the rightmost nonzero bit in k. In the case of k = 0, the level index is log2 N where N is the number of span distributions. For multidimensional indices, k is the Z-order index of the grid coordinate, for a uniform grid superimposed on the volume. The result of this is that the number of span distributions in each level decreases as the level index increases.

Additionally, because the width of each span distribution is a function of the level number

(as can be seen in equation(6.9)), the volumeof the space contributing to a span distribution will tend to increase as the level index increases.

Each level is stored as a set of chunks, with each chunk storing a sequence of entropy coded span distributions. To further improve entropy coding performance, each span dis- tribution within a chunk (other than the first span distribution) is stored differentially with respect to the previous span distribution in the chunk. Each chunk stores an entropy coding model and a set of entropy codes. Because the information entropy of each span distribu- tion can vary per-level, the number of span distributions that are stored per chunk should also vary per-level. This is necessary to maintain a favorable ratio between entropy coding model sizes and entropy code array sizes.

143 In our experiments, we found that the size of levels, in terms of total information en- tropy, varies exponentially with respect to the level number. This can be seen in figure

6.7. Similarly, the number of span distributions per level also varies exponentially. Thus, because the total information entropy of a level is equal to the number of span distributions times the information entropy of each span distribution, the information entropy of the span distributions can also be modeled as an exponential. Additionally, using compressed span distributions can take considerably less space than storing integral distributions directly, as can be seen in figure 6.10 and table 6.5.

Input Volume Uncompressed Span Distribu- Span Distribu- Cells Integral Distribu- tions (no skipped tions (6 skipped tions levels) levels) 0.26x106 128MB 6MB 1MB 1.05x106 512MB 28MB 2MB 3.15x106 1536MB 94MB 7MB 12.6x106 6144MB 467MB 29MB 39.3x106 16000MB 1383MB 94MB 66.1x106 32256MB 3061MB 161MB 101x106 49152MB 4953MB 246MB

Figure 6.5: Because span distributions take advantage of the similarity between neighbor- ing integral distributions for storage, they take considerably less space, even for lossless reconstruction. Additionally, by dropping some of the span distribution levels, the size can be further reduced at the cost of being lossy. In this case the distributions were represented by 64 bin histograms on 3D computational fluid dynamics volume data.

If the level index is ℓ, then the total size for levels can be modeled as:

ℓ Hℓ = α1e + α0 (6.13)

Additionally, the number of span distributions in a level can be modeled as:

log Nd−ℓ Nℓ = 2 2 (6.14)

144 Levels Dropped Span Distribution Size none 4953MB 1 2989MB 2 1459MB 3 1033MB 6 246MB

Figure 6.6: By dropping some levels, which results in queries being approximate, the size of the span distributions necessary can be reduced. This can reduce the working set size of an application. In this case the distributions were represented by 64 bin histograms on 3D computational fluid dynamics volume data.

With this, the entropy per span distribution can then be modeled as:

H ℓ ℓ Mℓ = = β1e + β0 (6.15) Nℓ

Assuming that the size of the entropy model for a given chunk is constant, and is F ,

and the ideal ratio between the entropy model between the size of the model and the size

of the code sequence is γ, the ideal number of span distributions, L , for a given level ℓ is:

F γ = L (6.16) Mℓ

The optimal value for γ depends on the latency to throughput ratio of the storage devices being used. If latency is high, then γ should be high. Similarly, if latency is low, then γ

should be low. At the extreme, if γ is too large, then the cost to perform a fetch of a span

distribution can be excessively large. In practice, for solid-state drives with static Huffman

coding, a γ value of around 30 was found to be reasonable.

Because Lk determines the level number each successively lower level number has

higher spatial precision. If spatial precision can be sacrificed, then span distributions do

145 45 1e+08 40 Entropy/Span Dist. Bytes/level 35 1e+07 30 1e+06 25 20 100000 15 Bytes Per Level 10 10000

Bytes Per Span Distribution 0 2 4 6 8 10 12 Level

Figure 6.7: Both the size of the levels, and the number of span distributions in the levels, exponentially decreases as the level number increases. The ratio between the size of the span distributions and the number of span distributions enables modeling of the entropy per span distribution.

not necessarily need to be loaded for all levels, nor do they need to be stored for all lev-

els. In other words, span distributions can be selectively loaded to facilitate approximate

queries.

6.2.6 Approximate Queries with Span Distributions

Approximate queries can be performed simply by not storing (or not using) some of the lower numbered levels. This can be accomplished by modifying equation (6.10) to not include levels whose index is less than a threshold. For example, in figure 6.2, dropping the highest detail level would be equivalent to not storing the row of span distributions for

Lk = 0.

Each span distribution has a corresponding region of space from which its distribution is drawn. This region is implied by equation (6.9). The corresponding volume of a span distribution is the intersection of the two contributing integral distributions in the equation

(6.9) subtracted from the union of the same two contributing integral distributions.

146 1 30000 Size 25000 0.8 Error 20000 0.6 15000 0.4 10000 Per Bin Relative Size 0.2 5000

0 0 Mean Error Bounds 0 1 2 3 4 5 6 7 Number of Dropped Levels

Figure 6.8: The relationship between the error bound and the stored size for varying num- bers of levels skipped

For one dimensional volumes of size N, the mean volume of a corresponding region

for a span distribution within a level ℓ is N−12ℓ. For d-dimensional Nd volumes, this

generalizes to:

1 d ℓ d 1 d 2( − + d )(N + 1)( − )N− (6.17)

Intuitively, this means that as the level number is increased, the mean volume increases

exponentially. This implies that the the potential error that can be introduced by dropping

a low numbered level will tend to be considerably lower than the potential error that can be

introduced by dropping a high numbered level.

In addition to this, as discussed in the previous section, the total size of span distri- butions on disk increases exponentially, as a function of level number. Combining these two observations, we can see that low levels of the span distribution data contribute the least amount of potential error, yet cost the most amount of space on disk. This enhances the effectiveness of a technique to reduce query cost and metadata size by dropping low numbered levels.

147 6.2.7 Comparing Span Distributions

Span distributions provide a spatial discretization that enables approximate queries, and generalize support for distribution range queries to arbitrary distributions, rather than just histograms.

10

1

0.1 Span Dist. 2016MiB (No dropped levels) Span Dist. 2016MiB (3 dropped levels) Span Dist. 2016MiB (6 dropped levels) Time Per Query (seconds) Direct, 2016MiB 0.01 0 10 20 30 40 50 60 70 80 90 100 Running Time (seconds)

Figure 6.9: Out-of-core data, query time, randomly positioned and sized queries, 2016MiB source data. The majority of the time spent in this test was I/O. Reducing the working set reduces the demands on storage devices, improving performance.

In supporting approximate queries, span distributions introduce a hierarchical compo-

nent to the metadata structure, as discussed in the previous section. If the working set is

considered for the time interval of a single query (rather than the time interval for an entire

workflow), this will introduce an O(logN) factor for spatial discretizations with N samples.

However, if the number of levels stored is held fixed, regardless of the data size, this re-

duces the working set to O(1). In either case, the method is considerably faster than O(N)

148 methods. Figure 6.9 shows a typical result on a dataset several times larger than the core

memory available.

The following table summarizes different aspects of some alternative methods for eval- uating distribution range queries: Span distrib. Integ. Hixels hist.[85] [109] Working set O(logN) , O(1) O(1) O(N) Spatial discretization hierarchical uniform uniform Distribution discretization general histograms histograms Compression entropy coding raw raw The working set in this table refers to the working set required for a single query of random location and size in a volume, where N is the number of discrete elements stored and the number of dimensions is assumed to be constant. While the asymptotic complexity of the working set for Integral Histograms [85] is less than that of Span Distributions for a single query, the size of the stored metadata is considerably larger. Figure 6.10 exhibits this difference in metadata sizes. Larger metadata sizes will affect cache performance, which can be observed in figure 6.11.

The above considers the working set only in terms of the time intervals associated with individual queries. The next section discusses working sets in the context of the time intervals covering entire application workflows.

6.3 Working Sets in Applications

In the context of visualization workflows, working sets are having an increasing impact.

Working set complexity for a time interval, by definition, also places a lower bound on the compute time during that workflow. In fact, in many cases, when the working set is out- of-core, the time due to storage operations will be much larger than the time spent on computation.

149 1e+12 1e+11 1e+10 1e+09 1e+08 1e+07 No dropped levels 6 dropped levels 1e+06 Output Size (bytes) Uncompressed 100000 1e+06 1e+07 1e+08 1e+09 1e+10 Input Size (bytes)

Figure 6.10: Storing the integral distributions directly, sampled on a uniform grid, can take considerably more space than storing compressed span distributions. Span distributions also permit the dropping of levels, which reduces the data size, at the cost of accuracy

We concentrate on the case where the size of the entire dataset is considerably larger than the in-core memory limit of the system, but the working set fits in-core. However, we assume that the working set cannot be adequately predicted a priori. Thus, cache warming cannot be used to pre-load the data outside of the interactive portion of the workflow. In this situation, the application will tend to be throughput-bound in the interactive portion of the workflow. Because of this, the working set of the over the entire time interval of the interactive portion of the workflow can be used to identify the performance characteristics of the workflow.

We take three steps in looking at performing working set analysis for a given applica- tion. First, the application algorithm is characterized. Next, the application query patterns are identified. Then, knowing the query patterns, the working set is analyzed in the context of the workflow.

The following sections apply these steps for a couple different applications. The first application (§6.3.1) exemplifies a class of applications where distribution range queries

150 can be used for error-bounded data reduction. The second application (§6.3.2) exemplifies a class of applications where distribution range queries can be used for interactive data summarization. The different classes have working set characteristics.

10

1

0.1

0.01

0.001 Span Dist. 64MiB (No dropped levels) Span Dist. 64MiB (3 dropped levels) 0.0001 Span Dist. 64MiB (6 dropped levels) Time Per Query (seconds) Direct 64MiB 1e-05 0 10 20 30 40 50 60 70 80 90 100 Running Time (seconds)

Figure 6.11: Out-of-core data, query time transient response, randomly positioned and sized queries, 64MiB source data. The majority of the time spent in this test was I/O for the top two lines of the graph. For the bottom two lines I/O has a substantial impact at the left end of the graph, but this effect is quickly reduced as the file cache warms. Using span distributions reduces the working set size required over performing direct queries. Reducing the number of levels used for span distributions reduces the working set as well. Reducing the working set reduces the demands on storage devices and reduces file cache rates, improving performance.

6.3.1 Application: Hovmoller¨ diagrams

H¨ovmoller diagrams are used in meteorology to highlight wave phenomena [44]. They are 2D plots where one axis typically shows longitude or latitude and the other axis shows time. Each point in the plot shows the aggregation, or sum, of the remaining axes. Effec- tively, these diagrams are sum aggregation queries, aggregating sequences of samples along

151 one axis into sums on a per element basis in the other axes. Similar aggregation queries

have also been performed in higher dimensional visualization contexts [110]. We propose

a method using histograms to estimate these diagrams subject to interactively-chosen error

constraints.

Sum aggregation queries of this form can be evaluated using histograms. Suppose we pick a bounding box within the volume, across which we want to evaluate sums down one axis. In 3D this will produce a 2D image. To estimate the range of each of these sums we can take the histogram of the bounding box region. If the axis along which we want to sum values has n entries within the bounding box, then the sum of the top n events in the

histogram is the upper bound of the sum for each entry on the other two axes. Similarly, the

sum of the bottom n events in the histogram is the lower bound of the sum for each entry on the other axes. To compute the complete image, quadtree subdivision can be performed subject to a constraint on error.

The algorithm applied to produce this is effectively performing a breadth-first search, within the quadtree of the 2D projection for squares of the image space that can have their sums approximated with a single distribution range query. Because the octree subdivision is in image space, the maximum number of distributions that can be executed is O(M logM) where M is the number of image space pixels. Note that the number of samples in the volume, N, is not present.

Span distributions and integral histograms can both yield O(M) performance in this case. This is substantially better than the O(NM) performance that will result from the

use of uniform hixels or directly computing the histograms off the volume data. However,

there is a substantial difference between the actual access patterns of span distributions

and integral histograms. Because span distributions are hierarchical, and this algorithm

152 is a breadth-first search, the accesses to span distributions will be concentrated into more localized regions of the metadata, potentially yielding improved utilization of read-ahead.

Figure 6.11 shows this behavior, where decreasing the working set size can increase the performance due to reduced file cache misses.

6.3.2 Application: Transfer function design

Performing transfer function design on time-varying data has long been a challenge, with the dynamic range of values of interest being unclear for the entire time series if they are not known a priori [123]. For static datasets, histograms have been used in the context of interactive workflows to generate transfer functions [88] [67] [13]. Using span distributions enables the use of interactively computed transfer functions using regions covering the entire time domain.

Figure 6.13 exhibits an application where this is applied on the 4D NCAR dataset 2, the result of a fluid dynamics simulation. In the left pane of the application window an aggregation of the 4D volume onto a 2D plane, computed with the same method discussed in §6.3.1, is applied. The user can drag and resize the query region box in the left pane to select a 4D range of interest for the transfer function. The cumulative distribution function of the histogram is used as a lookup table to warp the color portion of the transfer function such that contrast is maximized for values that have a high frequency of occurrence in the region of interest selected. The opacity of the transfer function is determined by the values of the histogram bins. Figure 6.13 shows different regions of interest selected for the transfer function with the same timestep.

2this is the same dataset applied by Akiba et al. [1]

153 Queries in the left pane are performed, in real-time, on the entire time series. The

volume rendered in the right pane is for a single time step. This enables the user to inter-

actively select a transfer function that is generated in a way consistent with the entire time

series, without needing to have the entire dataset resident in memory.

Additionally, with fast interactive queries, techniques that depend on cursors for his-

togram determination can be supported on large data. In chapter 7, a technique is explored

that uses the interactive manipulation of cursors on slice planes to incrementally construct

transfer functions, which requires fast distribution range queries. For multidimensional

transfer function construction, the technique introduced by Kniss et al. [54] could be ex-

tended to use distribution range queries of subvolumes, in addition to the entire volume.

In this application, the algorithm itself performs single range queries for histograms in a 4D volume, then uses the histograms to compute the transfer function. Every time the user moves the cursor for a region of interest, a new query is performed. If L unique queries are perormed, then using integral histograms will require a working set of O(L).

Span distributions, however, can take advantage of other properties of this workflow.

Generally, query regions will overlap and be performed over subvolumes of the input

4D volume large enough such that some levels of input span histograms can be dropped without a large impact on error. Because of this, approximate span distributions can be used effectively and the number of levels used can be chosen to be appropriate for the granularity of the queries that the user wishes to perform. This results in the span distribution method also requiring O(L), but with a considerably smaller actual working set size due to the approximate queries.

154 6.4 Extensions and Conclusion

Working set reduction, in the context of interactive workflows, will continue to be of great interest for many applications in visual analysis. In this work we proposed a general framework within which this problem can be approached. A transformation is performed in the preprocessing phase to facilitate distribution range queries, then the application is adapted to utilize approximation algorithms that can make use of these range queries. The general pipeline is broken into two transformations and their inverses: a distribution dis- cretization, and a spatial discretization. Both of these transformations are used to facilitate an effective representation of integral distributions, a continuous multivariate generaliza- tion of integral histograms.

Building on this framework, we focus on a spatial discretization strategy: span distri- butions. Span distributions facilitate efficient, potentially-approximate, distribution range queries through a hierarchical decomposition of integral distributions. We show that this approach can be used to construct scalable algorithms for analysis applications – algorithms whose time complexity varies in terms of the analysis result size, not the input data size.

Extensions could include extending this approach to more applications. For example, the same range queries applied for transfer function design and H¨ovmoller diagrams could also be used to enable new volume rendering algorithms whose working sets depend pri- marily on the target image resolution rather than the volume size. Other applications could include fuzzy isosurfaces, classification, and feature detection. With the proposed frame- work, a single pass of preprocessing can produce metadata that enables algorithms with scalable working set characteristics. For a range of applications, the working set complex- ity can be changed to depend primarily on the result size, rather than the input data size.

155 (a) Tolerance of ±10

(b) Tolerance of ±5

Figure 6.12: Approximate sum aggregation of 3D volumes for H¨ovmoller diagrams as discussed in §6.3.1. The horizontal axis is longitude and the vertical axis is time. The tolerance provides a bound on how far the approximate sums may be from the true sums, in terms the value of the sum. The dataset is from a simulation produced by the Pacific Northwest National Laboratory to examine the Madden-Julian Oscillation [37].

156 Figure 6.13: Interactive transfer function design for large-scale time-varying volume data, using interactive 4D distribution range queries, as discussed in §6.3.2. The user moves a region of interest in the left pane on a projection of the volume. The distribution of the region of interest is then used to generate transfer functions in the right pane, using the technique discussed in chapter 7

157 Chapter 7: Interactive Transfer Function Design on Large Multiresolution Volumes

Direct volume rendering (DVR) is widely used in the visualization of volume data. Key to the creation of high quality visualizations using DVR is the construction of effective transfer functions [67]. Effective transfer functions emphasize salient information while deemphasizing unimportant information. Interactive, semi-automatic, transfer function de- sign seeks to leverage users’ domain-specific knowledge [88] to progressively develop per- interval volume salience.

Interactive transfer function design techniques rely on iterative refinement, by users considering visual feedback, to guide a transfer function generation algorithm. In the case of DVR using optical models that consider opacity [71], modifications to the transfer func- tion require re-rendering of the volume, placing DVR into the interactive portion of the workflow. Level of detail techniques are commonly applied to enable interactive DVR of large-scale data, seeking to take advantage of the typically nonuniform salience of volumes.

The use of interactive transfer function design with salience-dependent level of detail selection creates a cyclic dependency. Interactive transfer function design techniques seek to enable discovery of interval volume salience, but also depend on interactive volume

158 rendering. At the same time, interactive volume rendering on large-scale data using work- stations depends on level of detail selection, which, to be effective, depends on knowledge of the salience of different parts of the volume. Figure 7.1a depicts this cycle.

The core contribution of this work is a technique that reduces the impact of this cyclic dependency by enabling interactive, incremental construction of target histograms that can simultaneously be used to support transfer function construction and level of detail selec- tion. Target histograms are used to drive the construction of both the transfer functions and the level of detail selection. This is accomplished by using histogram expressions to com- bine multiple local histograms into a target histogram. The target histogram is then used to generate a transfer function. Using Histogram Spectra (4), the target histogram is then also used to compute the optimal levels of detail for the multiresolution input volume. This en- ables interactive transfer function design in the context of DVR on data considerably larger than the available system memory.

7.1 Related Work

Due to its importance in direct volume rendering, transfer function design has been a widely explored topic. Pfister et al. [82] provide a comparison of a selection of trial and error, data-driven, and image-driven techniques. Kindlmann [52] extends this discussion to include feature detection based techniques. Fundamentally, it is unlikely that one type of technique will be appropriate for all applications. This paper concentrates on an interactive, data-driven approach intended to leverage the domain-specific knowledge of users.

Most similar to our transfer function design technique are selection-based techniques, where the user selects regions of interest and the technique generates transfer functions based on the selections. Wu et al. [125] describe an image-space technique to define which

159 regions are important and which regions are not. They then apply a genetic algorithm to generate transfer functions that expose details in the important regions. This technique bears some similarity to our technique in that it enables salient and non-salient regions to be identified by the user, though ours operates in the data space and avoids a highly iterative technique like genetic programming in the interest of interactivity. Ropinski et al. [88] propose a technique in which users use mouse strokes to identify which regions belong to which material, making use of the ray histograms of those regions. This technique is similar to ours in that it allows for salience to be interactively specified by the user, though it does not provide a similar scheme for providing logical combinations of different regions and it relies on having some amount of pre-segmentation of the data. Similarly to our technique,

Huang et al. [45] use slice planes as a tool to help provide a context in which users can interactively select regions to guide the construction of transfer functions.

Level of detail selection has also been a long-explored problem, with many solutions attempting to maximize the amount of salient data visible (or minimize error) for the ap- plication of interest. Guthe et al. [35] and Wang et al. [118] both propose techniques that optimize LOD selections, in the context of in-core data, using screen space error metrics.

Both of these techniques consider the final image, including visibility, thus they both also take into account the salience implied by the transfer function. Ljung et al. [64] and chapter

4 both propose methods that perform LOD selection by utilizing precomputed metadata to minimize error with respect to a target distribution. The former work concentrates more on potential compression aspects of the problem, while the latter concentrates more on the optimization aspects and extensions to multivariate data. Our work focuses on a less- explored aspect: interactive transfer function design in the context of a workflow using level-of-detail selection on large-scale out-of-core data.

160 (a) Work flow (b) Data flow

Figure 7.1: Level of detail selection and transfer function design both depend on interval salience.

7.2 Technique

The fundamental goal of our technique is to facilitate the identification of interval salience on large-scale data. Interval salience defines how important a given interval vol- ume is. With interval salience, transfer functions can be constructed and levels of detail can be automatically selected. Figure 7.1a illustrates the basic workflow.

In our system, the user interactively changes four controls to construct a transfer func- tion: a set of cursors, a set of slice planes, an expression for combining the distributions from each cursor, and the camera. The level of detail selection and transfer function are both updated on the fly using the first three controls. The view is re-rendered for changes to any of the controls.

The data flow for the system is shownin figure 7.1b. The user interface provides cursors to the cursor sampler. The cursor sampler then, using the current LOD selection, evaluates

161 the histogram of the region within each cursor. These cursor histograms are subsequently

combined using histogram expressions to compute a target histogram. The target histogram is then used to generate both an LOD selection and a transfer function. The LOD selection and transfer function are then used by the renderer to generate an image that is then passed to the user interface. The entire process can be interactive, even on a workstation with considerably less memory than the size of the dataset, as exhibited in section 7.3.

The result of this data flow is that level of detail selection and transfer function design

are both directly driven by the target histogram, which is incrementally constructed by the

user using the cursors and expressions. By incrementally constructing a transfer function,

the user is also incrementally constructing a level of detail selection that produces good

quality for a given working set size constraint. This increases usability of the transfer func-

tion design algorithm on large-scale data. Additionally, incremental construction can help

users maintain a mental mapping between color and value during the interaction process.

7.2.1 Cursor Histograms

A cursor histogram is the histogram of the set of sample values for all points within a

cursor. We chose the cursors to be circular discs within slice planes due to their simplicity,

but other shapes (such as a 3D sphere or box) or a sketching interface could be used. Cur-

sors are moved and resized by clicking on the slice planes. Slice plane rotation, translation,

and visibility can be changed with other UI elements.

The method used for generating cursor histograms is important, because a poor sam-

pling pattern with a large number of bins and large gradients may yield aliased histograms,

producing ineffective transfer functions. Two potential approaches for sampling these cur-

sor regions are direct histogram computation assuming a trilinear interpolation function,

162 and adaptive point sampling. We found an adaptive point sampling algorithm to be ef-

fective. Sampling on a uniform grid is used within each block, with the resolution of the

sampling grid being proportional to the resolution of the block. The resolution of each

block is adaptively chosen by the level of detail selection algorithm to minimize error sub-

ject to a global size constraint.

7.2.2 Histogram Expressions

Histogram expressions combine multiple cursor histograms into a single target his- togram. A target histogram defines the importance of different value ranges. Value ranges with high probability in the target histogram are deemed salient. Conversely, value ranges with low probability in the target histogram are deemed unimportant. In a histogram ex- pression, three operators are defined: disjunction, conjunction, and negation. Operator Histogram Expression Result histogram bin k Conjunction A ∧ B min(Ak,Bk) Disjunction A ∨ B max(Ak,Bk) Negation ¬A max∀i(Ai) − Ak The disjunction operator is useful for combining two cursor histograms such that both

of their histograms appear in the target histogram. The conjunction operator is used to

combine two cursor histograms to find the bins that share high values in both histograms,

implying a common importance. The negation operator is useful for expressing that the

frequent values within a cursor histogram are unimportant, but the infrequent ones are

important. An example of this is in figure .

These operators can also be composed to form expressions, enabling the generation of target histograms using a combination of several cursors. For example, D =(A ∧ B) ∨ (B ∧

C) will combine three cursors into a single histogram, D. Bin values in D will be high only when they are high in B and A, or B and C. This kind of expression could be used to

163 select two thin shells of values around a boundary region to explore adjacent values to the

boundary.

7.2.3 Level of Detail Selection

Because the volume data is too large to fit in-core, level of detail selection is of critical importance. Our technique assumes the data is bricked into blocks of grid-centered cells, with each block having multiple levels of detail stored. Given a target histogram, we want to compute the level of detail that minimizes the error (maximizing the salient information available) for a given size constraint.

The technique described in chapter 4 is used to compute an LOD selection. This tech- nique stores metadata called Histogram Spectra. This metadata consists of a matrix, stored for each block, that contains the per-bin difference between the histogram of a block at a given level of detail and the ground truth histogram of a block computed at the maximum level of detail. For a given target histogram and a given LOD, this enables an estimate of the amount of salient information that has been lost as a result of downsampling. The metadata is generated during preprocessing, with the ground truth data only being needed for comparison during the preprocessing process. During the computation of level of detail selections in the interactive portion of the workflow, only this compact metadata needs to be accessed to estimate error, rather than needing to access the original volume. Using the data from this estimation algorithm, a greedy optimization algorithm is applied to com- pute the minimal error LOD selection subject to a user-defined working set size constraint.

This enables fast, salience-aware, LOD selection for large out-of-core volumes even when interval volume salience changes within the interactive portion of the workflow.

164 The transient response of LOD selection algorithms is important in interactive work-

flows. Two aspects of this are data flow cycle convergence and working set stability. Data

flow cycle convergence refers to the tendency for the LOD selection to converge to a single

solution given a set of parameters, despite cycles in the data flow. Working set stability

refers to the magnitude of the change in the resulting working set for a given change in the

target histogram.

In figure 7.1b, it can be seen that there are two cycles involving the level of detail selector, one completely automated, and one involving the user. In the automated cycle, the cursor sampler, histogram expression evaluator, and LOD selector are involved. When the LOD selection changes, it can affect the histograms sampled by the cursor sampler, which will change the target distribution, which may change the optimal LOD selection.

In practice it was observed that the system converges within one or two iterations. This is reasonable because the potential change introduced in the target histogram by a change in

LOD tends to be relatively small compared to the overall set of events contributing to the target histogram. In the cycle involving the user, all elements of the system are involved.

However, the same property that allows the automated cycle to converge also allows the cycle involving the user to converge. In both cases, working set stability helps to contribute to fast convergence.

Working set stability affects both the cycle convergence and the number of transfers re- quired from out-of-core to in-core for a given change in the target histogram. Consider the case that a small change is made to the target histogram. Because this is a small change, it will likely have a small influence on per-block salience, as computed with Histogram

Spectra, used in the LOD selection algorithm. This reduces the chance that a large number

165 of blocks will have different optimal LODs chosen. This stability is enabled by the incre-

mental trasfer fuction construction aspect of this technique, where small changes tend to

be made to the target distribution, relative to the size of the entire target distribution.

7.2.4 Transfer Function Construction

The transfer function is constructed by using a user-provided color ramp, opacity fac- tor, and the target histogram. The general goal of the construction algorithm is to gener- ate transfer functions that emphasize values with high frequencies in the target histogram, while deemphasizing values with low frequencies in the target histogram.

The opacity component of the transfer function, Ta(u), is constructed such that it is

linearly proportional to the target histogram, H(u), for every value, u. A user-provided

opacity factor is used as the coefficient of proportionality to adjust the compromise between

clarity and occlusion. This approach is taken so that values with high frequency in the target

histogram will be more strongly visible than those with low frequency.

The color component of the transfer function, Trgb(u), is constructed such that its con-

trast is linearly proportional to the target histogram, H(u), for every value, u. This is

accomplished by warping a user-provided color ramp. The contrast, C(u), for a point, u,

in the transfer function is defined as the color difference between a color in the transfer

function at u − h and at u + h, where h is a step size.

When using a linearly interpolated texture as a transfer function, using the width of one

texel was found to be an effective step size. While many color difference metrics could

be applied, we found that using the L2 norm of the difference between the colors in the

CIE 1976 color space to be effective. We found this metric to be more effective in giving

166 visually-intuitive results than using the L2 norm of the difference between the colors in the

sRGB color space.

7.2.5 Interaction

An example interaction sequence of this technique on the Flame volume is shown in

figure 7.2. Each cursor is assigned a letter which is subsequently used in the expressions used to compute the target histograms. Figure 7.2a shows the initial cursor the user is presented with, which provides a general overview of the dataset. In the next step (figure

7.2b), the user moves the cursor on the slice plane to a region that looks interesting, shrink- ing the cursor to focus on that region. That region is then grown by adding another cursor and applying the conjunction operator between it and the previous cursor. This exposes the common areas of interest between the two, allowing for greater opacity to be applied in the result as seen in figure 7.2c. Removing the slice planes, it can be seen in figure 7.2d that there is still some clutter in the background. This can be removed with another cursor and expression as shown in figure 7.2e, yielding the result in figure 7.2f. The entire process is interactive and allows for incremental exploration.

7.3 Results

The technique was implemented using C++ and CUDA, with CUDA being used for the volume rendering and cursor sampling. The test platform was a Linux workstation with 4 GiB of main memory, an NVIDIA GPU, and a 128 GB OCZ Vertex 2 SSD. The test implementation maintained the loaded volume data within GPU memory, while the standard Linux VM was allowed to manage the file cache. Between 100MiB and 800MiB of GPU memory was used for working set space.

167 Two multiresolution CFD datasets were used: the 14GiB Flame dataset shown in figure

7.2 and the 62GiB Nek dataset used to produce figures 7.3 and 7.4. Both of them were in-

teractively manipulated on our test platform. Timings were analyzed to identify scalability

as well as transient response.

Scalability was analyzed by performing an automated sequence of actions for different dataset sizes and different working set size constraints. The action sequence was similar to that used in figure 7.2, involving the movement of histogram cursors and the editing of histogram expressions. Figure 7.3 shows the results. The per-frame times depend primarily on the user-defined working set size, rather than the volume size. This is because the amount of data that can be possibly loaded for a frame, the number of samples that need to be taken for cursor histograms, and the number of samples that need to be taken for rendering all depend on the working set size rather than the volume size. The ratio of the working set size to the volume size affects the quality of the results, but does not strongly affect the frame times. Due to this relationship, it is possible for the user to apply the technique to very large data, even with fairly limited working set sizes.

Figure 7.4 shows the typical amounts of time needed to process the various aspects needed for a given frame, as a function of the time since the program has started, using the same automated test procedure used to generate figure 7.3. The system file caches were

flushed immediately prior to execution of the run, thus no volume data was resident at the start of execution and some modest initial loading is necessary. Importantly, it can be seen that the loading times required during interactions with the system are reasonable. This further reinforces the case that the technique is well conditioned – a small change in the input target histogram will tend to yield a small change in the resulting LOD selection.

168 This facilitates interactive, incremental construction of interval value salience for transfer

function design and level of detail selection.

7.4 Conclusion and Extensions

Histogram expressions combined with an interactive slice plane interface enable in-

cremental, interactive construction of target histograms describing salience. These target

histograms are then used to directly construct transfer functions and level of detail selec-

tions simultaneously, enabling interactive transfer function design on large-scale data.

The technique could easily be extended to support joint distributions between two vari- ables, such as gradient magnitude and value, to enable more complex transfer function design techniques. Additionally, preservation of the mental mapping between color and value could be considered in the transfer function construction algorithm. Finally, view- dependent transfer function construction and LOD selection could also be useful exten- sions.

169 (a) Initial overview (b) A: an initial region of interest

(c) A ∧ B: adding a region of interest (d) A ∧ B: no slices, more opacity

(e) A ∧ B ∧ ¬C: reducing clutter (f) A ∧ B ∧ ¬C: result

Figure 7.2: An example of the technique being applied to the Flame test volume, discussed in section 7.2.5

170 0.25 100MiB working set 200MiB working set 0.2 400MiB working set 600MiB working set 800MiB working set 0.15

0.1

0.05 Top quintile of frame time(s)

0 0 10 20 30 40 50 60 70 Data size(GiB)

Figure 7.3: The performance as a function of volume size and working set size is largely a function of the working set size, rather than the volume size, facilitating scalability for large-scale data.

10 Loading Cursor histogram sampling LOD selection 1 Rendering

0.1 Time per frame (s) 0.01

0.001 0 10 20 30 40 50 60 70 80 Time elapsed since start of program (s)

Figure 7.4: An example of the per-frame performance, as a function of running time, for a test run using the 62GiB Nek dataset with a 600MiB working set limit. In this case cursors are being moved around and expressions edited, yielding incremental updates to the target histogram.

171 Chapter 8: Extensions and Conclusion

Salience discovery techniques seek to facilitate the discovery of salience of different in-

terval volumes. Salience-aware techniques leverage the tendency for different parts of vol-

ume data to be of differing importance. In this dissertation, four techniques were proposed

in each of these categories. Two additional techniques were presented that can build on

these salience-aware and salience discovery techniques. All of these techniques can make

use of precomputed metadata to move compute-intensive components of the application

from the interactive portion of the workflow to the preprocessing phase of the workflow,

increasing scalability.

The salience discovery techniques presented in chapters 6 and 7 facilitate salience- aware techniques by enabling user identification of salient portions of data. In “Trans- formations for Volumetric Range Distribution Queries” (chapter 6), a framework for data transformations that enable fast queries for the distributions of subvolume ranges was pre- sented. In “Interactive Transfer Function Design on Large Multiresolution Volumes” (chap- ter 7), a procedure was shown that enables level of detail selections and transfer functions to be incrementally and interactively constructed, making use of the salience-aware level of detail selection technique proposed in chapter 4. “Efficient Rendering of Extrudable

Curvilinear Volumes” (chapter 5) can utilize this technique, enabling the construction both

172 of transfer functions and appropriate level of detail selection for adaptive mesh refinement

rendering.

The salience-aware techniques presented in chapters 2 and 4 leverage the results of salience discovery techniques to enable more efficient data analysis. In “Load-Balanced

Isosurfacing on Multi-GPU Clusters” (chapter 2), metadata was generated in a preprocess- ing phase to enable fast interactive load balancing for isosurfacing on multi-GPU clusters in the context of interactively changing salience. “Stereo Frame Decomposition for Error-

Constrained Remote Visualization” (chapter 3) can be used to enable remote interaction with this cluster-hosted load-balanced isosurfacing solution. In “Histogram Spectra for

Multivariate Time-Varying Volume LOD Selection” (chapter 4), metadata was generated in a preprocessing phase to enable interactive level of detail selection in the context of interactively changing salience.

Two possible categories of extensions are to support new applications and extensions that focus on workflow optimization. The above techniques primarily focus on supporting data volumes with one to four dimensions. Further extensions could increase the scalability of these techniques to support higher dimensional volumes, enabling their use in more applications.

Workflow optimization extensions could automatically generate metadata considering usage patterns. Additionally, for datasets originating from simulations on large compute cluster resources, metadata could be generated in-place on the clusters to reduce data move- ment and allow for analysis to begin before simulations end. Finally, extensions could be considered to support workflows outside the field of scientific visualization, such as in fi- nance or business analytics.

173 Bibliography

[1] H. Akiba, Kwan-Liu Ma, and J. Clyne. End-to-end data reduction and hardware accelerated rendering techniques for visualizing time-varying non-uniform grid vol- ume data. International Workshop on Volume Graphics, pages 31–225, 2005.

[2] Hiroshi Akiba, Nathaniel Fout, and Kwan-Liu Ma. Simultaneous classification of time-varying volume data based on the time histogram. In EuroVis, pages 171–178, 2006.

[3] Fredrik Bengtsson and Jingsen Chen. Space-efficient range-sum queries in olap. In In Yahiko Kambayashi, Mukesh Mohania, and Wolfram W, editors, Data Warehous- ing and Knowledge Discovery: 6th International Conference DaWaK, volume 3181 of Lecture Notes in Computer Science, pages 87–96. Springer, 2004.

[4] Christopher M. Bishop. Pattern Recognition and Machine Learning (Information Science and Statistics). Springer-Verlag New York, Inc., Secaucus, NJ, USA, 2006.

[5] Imma Boada, Isabel Navazo, and Roberto Scopigno. Multiresolution volume visu- alization with a texture-based octree. The Visual Computer, 17(3):185–197, 2001.

[6] Hamish Carr, Duffy Brian, and Denby Brian. On histograms and isosurface statis- tics. IEEE Transactions on Visualization and Computer Graphics, 12:1259–1266, September 2006.

[7] A. Cedilnik, B. Geveci, K. Ahrens, and J. Favre. Remote large data visualization in the ParaView framework. Eurographics Parallel Graphics and Visualization, pages 163–170, 2006.

[8] Andrew Certain, Jovan Popovic, Tony DeRose, Tom Duchamp, David Salesin, and Werner Stuetzle. Interactive multiresolution surface viewing. In SIGGRAPH ’96: Proceedings of the 23rd annual conference on Computer graphics and interactive techniques, pages 91–98, New York, NY, USA, 1996. ACM.

[9] Yuan Chen, Jonathan Cohen, and Subodh Kumar. Visualization of time-varying curvilinear grids using a 3d warp texture. In Vision, Modeling, and Visualization 2005: Proceedings, pages 281–230, 2005.

174 [10] Y.J. Chiang, R. Farias, C.T. Silva, and B. Wei. A unified infrastructure for parallel out-of-core isosurface extraction and volume rendering of unstructured grids. In Proceedings of the IEEE 2001 symposium on parallel and large-data visualization and graphics, pages 59–66. IEEE Press Piscataway, NJ, USA, 2001.

[11] Seok-Ju Chun, Chin-Wan Chung, and Seok-Lyong Lee. Space-efficient cubes for olap range-sum queries. Decis. Support Syst., 37(1):83–102, April 2004.

[12] C.S. Co, B. Hamann, and K.I. Joy. Iso-splatting: A point-based alternative to isosur- face visualization. In Proceedings of the Eleventh Pacific Conference on Computer Graphics and Applications-Pacific Graphics 2003, pages 325–334. Citeseer, 2003.

[13] Carlos D. Correa and Kwan-Liu Ma. Visibility-driven transfer functions. In Pro- ceedings of the 2009 IEEE Pacific Visualization Symposium, PACIFICVIS ’09, pages 177–184, Washington, DC, USA, 2009. IEEE Computer Society.

[14] Franklin C. Crow. Summed-area tables for texture mapping. SIGGRAPH Comput. Graph., 18(3):207–212, January 1984.

[15] John Danskin and Pat Hanrahan. Fast algorithms for volume ray tracing. In VVS ’92: Proceedings of the 1992 workshop on Volume visualization, pages 91–98, New York, NY, USA, 1992. ACM.

[16] Peter J. Denning and Stuart C. Schwartz. Properties of the working-set model. Com- mun. ACM, 15(3):191–198, March 1972.

[17] C. Dyken, G. Ziegler, C. Theobalt, and H.P. Seidel. High-speed marching cubes using histopyramids. In Computer Graphics Forum, volume 27, pages 2028–2039. Blackwell Publishing, 2008.

[18] D. Ellsworth, B. Green, C. Henze, P. Moran, and T. Sandstrom. Concurrent vi- sualization in a production supercomputing environment. IEEE Transactions on Visualization and Computer Graphics, 12(5):997 –1004, September-October 2006.

[19] Klaus Engel, Ove Sommer, and Thomas Ertl. A framework for interactive hardware accelerated remote 3D visualization. In Proceedings of Joint Eurographics - IEEE VGTC Symposium on Visualization, pages 167–177, 2000.

[20] M. Flierl and B. Girod. Generalized B pictures and the draft H.264/AVC video com- pression standard. IEEE Transactions on Circuits and Systems for Video Technology, 13(7):587 – 597, July 2003.

[21] Thomas A. Funkhouser and Carlo H. S´equin. Adaptive display algorithm for in- teractive frame rates during visualization of complex virtual environments. In SIG- GRAPH ’93: Proceedings of the 20th annual conference on Computer graphics and interactive techniques, pages 247–254, New York, NY, USA, 1993. ACM.

175 [22] Filippo Furfaro, Giuseppe M. Mazzeo, Domenico Saccà, and Cristina Sirangelo. Compressed hierarchical binary histograms for summarizing multi- dimensional data. Knowl. Inf. Syst., 15(3):335–380, May 2008.

[23] R.S. Gallagher. Span filtering: an optimization scheme for volume visualization of large finite element models. In Proceedings of the 2nd conference on Visualiza- tion’91, pages 68–75. IEEE Computer Society Press Los Alamitos, CA, USA, 1991.

[24] J. Gao and H.W. Shen. Parallel view-dependent isosurface extraction using multi- pass occlusion culling. In Proceedings of the IEEE 2001 symposium on parallel and large-data visualization and graphics, page 74. IEEE Press, 2001.

[25] J. Gao and H.W. Shen. Hardware-assisted view-dependent isosurface extraction using spherical partition. In Proceedings of the symposium on Data visualisation 2003, page 276. Eurographics Association, 2003.

[26] Jinzhu Gao. Distributed data management for large volume visualization. In in Proc. IEEE Visualization, pages 183–189. IEEE Computer Society Press, 2005.

[27] Steve Geffner, Divyakanth Agrawal, Amr El Abbadi, and Terry Smith. Relative prefix sums: An efficient approach for querying dynamic olap data cubes. Technical report, Santa Barbara, CA, USA, 1999.

[28] Allen Van Gelder and Jane Wilhelms. Rapid exploration of curvilinear grids using direct volume rendering. In VIS ’93: Proceedings of the 4th conference on Visual- ization ’93, pages 70–77, 1993.

[29] Joachim Georgii and Rdiger Westermann. A generic and scalable pipeline for gpu tetrahedral grid rendering. IEEE Transactions on Visualization and Computer Graphics, 12(5):1345–1352, 2006.

[30] T. Gerstner and M. Rumpf. Multiresolutional parallel isosurface extraction based on tetrahedral bisection. In Volume Graphics, volume 278, 2000.

[31] F. Goetz, T. Junklewitz, and G. Domik. Real-time marching cubes on the vertex shader. In Proceedings of Eurographics, volume 2005, 2005.

[32] Benjamin Gregorski, Joshua Senecal, Mark A. Duchaineau, and Kenneth I. Joy. Adaptive extraction of time-varying isosurfaces. IEEE Transactions on Visualiza- tion and Computer Graphics, 10(6):683–694, 2004.

[33] Sudipto Guha. Space efficiency in synopsis construction algorithms. In Proceedings of the 31st international conference on Very large data bases, VLDB ’05, pages 409–420. VLDB Endowment, 2005.

176 [34] Sudipto Guha, Nick Koudas, and Divesh Srivastava. Fast algorithms for hierarchical range histogram construction. In Proceedings of the twenty-first ACM SIGMOD- SIGACT-SIGART symposium on Principles of database systems, PODS ’02, pages 180–187, New York, NY, USA, 2002. ACM.

[35] S. Guthe and W. Strasser. Advanced techniques for high quality multiresolution volume rendering. In In Computers and Graphics (2004, pages 51–58. Elsevier Science, 2004.

[36] Attila Gyulassy, Lars Linsen, and Bernd Hamann. Time- and space-efficient error calculation for multiresolution direct volume rendering, 2005.

[37] S. Hagos and L. R. Leung. Moist Thermodynamics of the Madden-Julian Oscillation in a Cloud-Resolving Simulation. Journal of Climate, 24:5571–5583, November 2011.

[38] C.D. Hansen and P. Hinker. Massively parallel isosurface extraction. In Proceedings of the 3rd conference on Visualization’92, page 83. IEEE Computer Society Press, 1992.

[39] M. Harris, S. Sengupta, and J.D. Owens. Parallel prefix sum (scan) with CUDA. GPU Gems, 3(39):851–876, 2007.

[40] W.D. Hillis and G.L. Steele Jr. Data parallel algorithms. Communications of the ACM, 29(12):1183, 1986.

[41] Lichan Hong and Arie Kaufman. Accelerated ray-casting for curvilinear volumes. In VIS ’98: Proceedings of the conference on Visualization ’98, pages 247–253, Los Alamitos, CA, USA, 1998. IEEE Computer Society Press.

[42] Lichan Hong and Arie E. Kaufman. Fast projection-based ray-casting algorithm for rendering curvilinear volumes. IEEE Transactions on Visualization and Computer Graphics, 5(4):322–332, 1999.

[43] Wen-Chi Hou, Cheng Luo, Zhewei Jiang, Feng Yan, and Qiang Zhu. Approximate range-sum queries over data cubes using cosine transform. In Proceedings of the 19th international conference on Database and Expert Systems Applications, DEXA ’08, pages 376–389, Berlin, Heidelberg, 2008. Springer-Verlag.

[44] Ernest Hovm¨oller. The trough-and-ridge diagram. Tellus, 1(2):62–66, 1949.

[45] Runzhen Huang and Kwan-Liu Ma. Rgvis: Region growing based techniques for volume visualization. In Proceedings of the 11th Pacific Conference on Computer Graphics and Applications, PG ’03, pages 355–, Washington, DC, USA, 2003. IEEE Computer Society.

177 [46] T. Itoh, Y. Yamaguchi, and K. Koyamada. Fast isosurface generation using the vol- ume thinning algorithm. IEEE Transactions on Visualization and Computer Graph- ics, pages 32–46, 2001.

[47] Anil K. Jain. Fundamentals of digital image processing. Prentice-Hall, Inc., Upper Saddle River, NJ, USA, 1989.

[48] Jianmin Jiang and Eran A. Edirisinghe. A hybrid scheme for low bit-rate coding of stereo images. IEEE Transactions on Image Processing, 11(2):123–134, 2002.

[49] G. Johansson and H. Carr. Accelerating marching cubes with graphics hardware. In Proceedings of the 2006 conference of the Center for Advanced Studies on Collabo- rative research, October, pages 16–19. Citeseer, 2006.

[50] Ralf K¨ahler, Mark Simon, and Hans-Christian Hege. Interactive volume rendering of large sparse data sets using adaptive mesh refinement hierarchies. IEEE Trans. Vis. Comput. Graph., 9(3):341–351, 2003.

[51] Panagiotis Karras. Optimality and scalability in lattice histogram construction. Proc. VLDB Endow., 2(1):670–681, August 2009.

[52] Gordon Kindlmann. Transfer functions in direct volume rendering: Design, inter- face, interaction. In SIGGRAPH 2002 Course Notes, 2002.

[53] T. Klein, S. Stegmaier, and T. Ertl. Hardware-accelerated reconstruction of polyg- onal isosurface representations on unstructured grids. In Proceedings of Pacific Graphics 04, pages 186–195, 2004.

[54] J. Kniss, G. Kindlmann, and C. Hansen. Multidimensional transfer functions for interactive volume rendering. Visualization and Computer Graphics, IEEE Transac- tions on, 8(3):270 – 285, jul-sep 2002.

[55] K. H. Knuth. Optimal Data-Based Binning for Histograms. ArXiv Physics e-prints, May 2006.

[56] Nick Koudas, S. Muthukrishnan, and Divesh Srivastava. Optimal histograms for hierarchical range queries (extended abstract). In Proceedings of the nineteenth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems, PODS ’00, pages 196–204, New York, NY, USA, 2000. ACM.

[57] J. Kruger and R. Westermann. Acceleration techniques for gpu-based volume ren- dering. In VIS ’03: Proceedings of the 14th IEEE Visualization 2003 (VIS’03), page 38, Washington, DC, USA, 2003. IEEE Computer Society.

178 [58] Eric LaMar, Bernd Hamann, and Kenneth I. Joy. Multiresolution techniques for interactive texture-based volume visualization. In VIS ’99: Proceedings of the con- ference on Visualization ’99, pages 355–361, Los Alamitos, CA, USA, 1999. IEEE Computer Society Press.

[59] F. Lamberti and A. Sanna. A streaming-based solution for remote visualization of 3D graphics on mobile devices. Visualization and Computer Graphics, IEEE Transactions on, 13(2):247 –260, March-April 2007.

[60] Weifa Liang, Hui Wang, and Maria E. Orlowska. Range queries in dynamic olap data cubes. Data Knowl. Eng., 34(1):21–38, July 2000.

[61] Peter Lindstrom, David Koller, William Ribarsky, Larry F. Hodges, Nick Faust, and Gregory A. Turner. Real-time, continuous level of detail rendering of height fields. In SIGGRAPH ’96: Proceedings of the 23rd annual conference on Computer graph- ics and interactive techniques, pages 109–118, New York, NY, USA, 1996. ACM.

[62] Y. Livnat, H.W. Shen, and C.R. Johnson. A near optimal isosurface extraction al- gorithm using the span space. IEEE Transactions on Visualization and Computer Graphics, 2(1):73–84, 1996.

[63] Y. Livnat and X. Tricoche. Interactive point-based isosurface extraction. In Proceed- ings of the conference on Visualization’04, pages 457–464. IEEE Computer Society Washington, DC, USA, 2004.

[64] Patric Ljung, Claes Lundstrom, Anders Ynnerman, and Ken Museth. Transfer func- tion based adaptive decompression for volume rendering of large medical data sets. In Proceedings of the 2004 IEEE Symposium on Volume Visualization and Graphics, VV ’04, pages 25–32, Washington, DC, USA, 2004. IEEE Computer Society.

[65] A. Lopes and K. Brodlie. Improving the robustness and accuracy of the marching cubes algorithm for isosurfacing. IEEE Transactions on Visualization and Computer Graphics, pages 16–29, 2003.

[66] W.E. Lorensen and H.E. Cline. Marching cubes: A high resolution 3D surface con- struction algorithm. In Proceedings of the 14th annual conference on Computer graphics and interactive techniques, page 169. ACM, 1987.

[67] C. Lundstrom, P. Ljung, and A. Ynnerman. Local histograms for design of transfer functions in direct volume rendering. Visualization and Computer Graphics, IEEE Transactions on, 12(6):1570 –1579, nov.-dec. 2006.

[68] Tom Malzbender. Fourier volume rendering. ACM Trans. Graph., 12(3):233–250, July 1993.

179 [69] Gerd Marmitt, Heiko Friedrich, and Philipp Slusallek. Interactive Volume Rendering with Ray Tracing. In Eurographics State of the Art Reports, 2006. [70] Emin Martinian, Alexander Behrens, Jun Xin, and Anthony Vetro. View synthesis for multiview video compression. In IN PICTURE CODING SYMPOSIUM, 2006. [71] Nelson Max. Optical models for direct volume rendering. IEEE Transactions on Visualization and Computer Graphics, 1(2):99–108, June 1995. [72] P. Merkle, Y. Wang, K. Muller, A. Smolic, and T. Wiegand. Video plus depth com- pression for mobile 3D services. pages 1–4, 2009. [73] Mark S. Moellenhoff and Mark W. Maier. Transform coding of stereo image resid- uals. IEEE Transactions on Image Processing, 7(6):804–812, 1998. [74] Morton. A computer oriented geodetic data base and a new technique in file se- quencing. Technical Report Ottawa, Ontario, Canada, 1966. [75] Michael Niedermayer. Description of the FFV1 video codec. http://mplayerhq.hu/ michael/ffv1.html, March 2004. [76] G.M. Nielson. Dual marching cubes. In Proceedings of the conference on Visual- ization’04, pages 489–496. IEEE Computer Society Washington, DC, USA, 2004. [77] Markus Oberhumer. LZO real-time data compression library. http://www.oberhumer.com/opensource/lzo/, August 2011. [78] N¨ukhet Ozbek,¨ A. Murat Tekalp, and E. Turhan Tunali. Rate allocation between views in scalable stereo video coding using an objective stereo video quality mea- sure. In ICASSP (1), pages 1045–1048, 2007. [79] S. Parker, P. Shirley, Y. Livnat, C. Hansen, and P.P. Sloan. Interactive ray tracing for isosurface rendering. In Proceedings of the conference on Visualization’98, pages 233–238. IEEE Computer Society Press Los Alamitos, CA, USA, 1998. [80] V. Pascucci. Isosurface computation made simple: Hardware acceleration, adaptive refinement and tetrahedral stripping. UCRL-CONF-202459, Lawrence Livermore National Laboratory (LLNL), Livermore, CA, 2004. [81] V. Pascucci and R.J. Frank. Global static indexing for real-time exploration of very large regular grids. In Supercomputing, ACM/IEEE 2001 Conference, page 45, nov. 2001. [82] Hanspeter Pfister, Bill Lorensen, Chandrajit Bajaj, Gordon Kindlmann, Will Schroeder, Lisa Sobierajski Avila, Ken Martin, Raghu Machiraju, and Jinho Lee. The transfer function bake-off. IEEE Comput. Graph. Appl., 21(3):16–22, May 2001.

180 [83] B.T. Phong. Illumination for computer generated pictures. Communications of the ACM, 1975.

[84] Viswanath Poosala and Venkatesh Ganti. Fast approximate answers to aggregate queries on a data cube. In Proceedings of the 11th International Conference on Scientific and Statistical Database Management, SSDBM ’99, pages 24–33, Wash- ington, DC, USA, 1999. IEEE Computer Society.

[85] F. Porikli. Integral histogram: a fast way to extract histograms in cartesian spaces. In Computer Vision and Pattern Recognition, 2005. CVPR 2005. IEEE Computer Society Conference on, volume 1, pages 829 – 836 vol. 1, june 2005.

[86] David M. Reed, Roni Yagel, Asish Law, Po-Wen Shin, and Naeem Shareef. Hard- ware assisted volume rendering of unstructured grids by incremental slicing. In VVS ’96: Proceedings of the 1996 symposium on Volume visualization, pages 55–ff., Pis- cataway, NJ, USA, 1996. IEEE Press.

[87] J. Reinders. Intel threading building blocks: outfitting C++ for multi-core processor parallelism. O’Reilly Media, Inc., 2007.

[88] Timo Ropinski, J¨org-Stefan Praßni, Frank Steinicke, and Klaus H. Hinrichs. Stroke- based transfer function design. In IEEE/EG International Symposium on Volume and Point-Based Graphics, pages 41–48. IEEE, 2008.

[89] S. R¨ottger, M. Kraus, and T. Ertl. Hardware-accelerated volume and isosurface rendering based on cell-projection. In Proceedings of the conference on Visualiza- tion’00, pages 109–116. IEEE Computer Society Press Los Alamitos, CA, USA, 2000.

[90] S´ebastien Roy. Stereo without epipolar lines: A maximum-flow formulation. Inter- national Journal of Computer Vision, 34(2-3):147–161, 1999.

[91] Takafumi Saito. Real-time previewing for volume visualization. In VVS ’94: Pro- ceedings of the 1994 symposium on Volume visualization, pages 99–106, New York, NY, USA, 1994. ACM.

[92] R. Samtaney, B. van Straalen, P. Colella, and S. C. Jardin. Adaptive mesh simu- lations of multi-physics processes during pellet injection in tokamaks. Journal of Physics Conference Series, 78:2062–+, July 2007.

[93] Ashutosh Saxena, Sung H. Chung, and Andrew Y.Ng. 3D depth reconstruction from a single still image. International Journal of Computer Vision, 76:53–69, January 2008.

[94] AshutoshSaxena, Jamie Schulte, and Andrew Y.Ng. Depth estimation using monoc- ular and stereo cues. In In IJCAI, 2007.

181 [95] G. Schaefer, R. Starosolski, and Shao Ying Zhu. An evaluation of lossless com- pression algorithms for medical infrared images. In Engineering in Medicine and Biology Society, 2005. IEEE-EMBS 2005. 27th Annual International Conference of the, pages 1673 –1676, January 2005.

[96] Carlos E. Scheidegger, John M. Schreiner, Brian Duffy, Hamish Carr, and Cl´audio T. Silva. Revisiting histograms and isosurface statistics. IEEE Transactions on Visual- ization and Computer Graphics, 14:1659–1666, November 2008.

[97] C.E. Scheidegger, J. Schreiner, B. Duffy, H. Carr, and C.T. Silva. Revisiting his- tograms and isosurface statistics. IEEE Transactions on Visualization and Computer Graphics, 14(6):1659–1666, 2008.

[98] Andrew Secker and David Taubman. Highly scalable video compression with scal- able motion coding. In ICIP (3), pages 273–276, 2003.

[99] H.W. Shen, C.D. Hansen, Y. Livnat, and C.R. Johnson. Isosurfacing in span space with utmost efficiency. In Proceedings of the 7th conference on Visualization’96. IEEE Computer Society Press Los Alamitos, CA, USA, 1996.

[100] H.W. Shen and C.R. Johnson. Sweeping simplices: A fast iso-surface extraction algorithm for unstructured grids. In IEEE Visualization: Proceedings of the 6 th conference on Visualization’95. Association for Computing Machinery, Inc, One Astor Plaza, 1515 Broadway, New York, NY, 10036-5701, USA,, 1995.

[101] Byeong-Seok Shin and Ei-Kyu Choi. An efficient clod method for large-scale terrain visualization. In ICEC, pages 592–597, 2004.

[102] Peter Shirley and Allan Tuchman. A polygonal approximation to direct scalar vol- ume rendering. SIGGRAPH Comput. Graph., 24(5):63–70, 1990.

[103] Aljoscha Smolic, Karsten Mueller, Philipp Merkle, Peter Kauff, and Thomas Wie- gand. An overview of available and emerging 3D video formats and depth enhanced stereo as efficient generic solution. In Proceedings of the 27th conference on Pic- ture Coding Symposium, PCS’09, pages 389–392, Piscataway, NJ, USA, 2009. IEEE Press.

[104] Aljoscha Smolic, Karsten M¨uller, Nikolce Stefanoski, J¨orn Ostermann, A. Gotchev, G. B. Akar, George A. Triantafyllidis, and A. Koz. Coding algorithms for 3DTV - a survey. IEEE Trans. Circuits Syst. Video Techn., 17(11):1606–1621, 2007.

[105] S. Stegmaier, M. Strengert, T. Klein, and T. Ertl. A Simple and Flexible Volume Rendering Framework for Graphics-Hardware–based Raycasting. In Proceedings of the International Workshop on Volume Graphics ’05, pages 187–195, 2005.

182 [106] G.J. Sullivan and T. Wiegand. Video compression - from concepts to the H.264/AVC standard. Proceedings of the IEEE, 93(1):18 –31, January 2005.

[107] N. Tatarchuk, J. Shopf, and C. DeCoro. Real-Time isosurface extraction using the GPU programmable geometry pipeline. In ACM SIGGRAPH 2007 courses, page 137. ACM, 2007.

[108] The MPlayer Project. MPlayer - The Movie Player. http://www.mplayerhq.hu/, 2011.

[109] D. Thompson, J.A. Levine, J.C. Bennett, P.-T. Bremer, A. Gyulassy, V. Pascucci, and P.P. Pebay. Analysis of large-scale scalar data using hixels. In Large Data Analysis and Visualization (LDAV), 2011 IEEE Symposium on, pages 23 –30, oct. 2011.

[110] Jarke J. van Wijk and Robert van Liere. Hyperslice: visualization of scalar functions of many variables. In Proceedings of the 4th conference on Visualization ’93, VIS ’93, pages 119–125, Washington, DC, USA, 1993. IEEE Computer Society.

[111] A. Vetro, T. Wiegand, and G.J. Sullivan. Overview of the stereo and multiview video coding extensions of the H.264/MPEG-4 AVC standard. Proceedings of the IEEE, 99(4):626 –642, April 2011.

[112] Ivan Viola, Armin Kanitsar, and Meister Eduard Groller. Importance-driven volume rendering. In VIS ’04: Proceedings of the conference on Visualization ’04, pages 139–146, Washington, DC, USA, 2004. IEEE Computer Society.

[113] J. E. Vollrath, T. Schafhitzel, and T. Ertl. Employing Complex GPU Data Structures for the Interactive Visualization of Adaptive Mesh Refinement Data. In Proceedings of the International Workshop on Volume Graphics ’06, 2006.

[114] Gregory K. Wallace. The JPEG still picture compression standard. Commun. ACM, 34:30–44, April 1991.

[115] M. P. Wand. Data-based choice of histogram bin width. The American Statistician, 51:59–64, 1996.

[116] Chaoli Wang, Jinzhu Gao, Liya Li, and Han-Wei Shen. A multiresolution vol- ume rendering framework for large-scale time-varying data visualization. In Volume Graphics, pages 11–19, 2005.

[117] Chaoli Wang, Jinzhu Gao, and Han-Wei Shen. Parallel multiresolution volume ren- dering of large data sets with error-guided load balancing. In EGPGV, pages 23–30, 2004.

183 [118] Chaoli Wang, Antonio Garcia, and Han-Wei Shen. Interactive level-of-detail selec- tion using image-based quality metric for large volume visualization. IEEE Trans- actions on Visualization and Computer Graphics, 13(1):122–134, January 2007.

[119] Gunther H. Weber, Oliver Kreylos, Terry J. Ligocki, John Shalf, Hans Hagen, Bernd Hamann, Kenneth I. Joy, and Kwan-Liu Ma. High-quality volume rendering of adaptive mesh refinement data. In VMV ’01: Proceedings of the Vision Modeling and Visualization Conference 2001, pages 121–128. Aka GmbH, 2001.

[120] R¨udiger Westermann. A multiresolution framework for volume rendering. In VVS ’94: Proceedings of the 1994 symposium on Volume visualization, pages 51–58, New York, NY, USA, 1994. ACM.

[121] J. Wilhelms and A. Van Gelder. Octrees for faster isosurface generation. ACM Transactions on Graphics (TOG), 11(3):201–227, 1992.

[122] Jane Wilhelms, Judy Challinger, Naim Alper, Shankar Ramamoorthy, and Arsi Vaziri. Direct volume rendering of curvilinear volumes. In Computer Graphics (San Diego Workshop on Volume Visualization), pages 41–7, 1990.

[123] Jonathan Woodring and Han-Wei Shen. Semi-automatic time-series transfer functions via temporal clustering and sequencing. Computer Graphics Forum, 28(3):791–198, June 2009.

[124] Yi-Leh Wu, Divyakant Agrawal, and Amr El Abbadi. Using wavelet decomposi- tion to support progressive and approximate range-sum queries over data cubes. In Proceedings of the ninth international conference on Information and knowledge management, CIKM ’00, pages 414–421, New York, NY, USA, 2000. ACM.

[125] Yingcai Wu and Huamin Qu. Interactive transfer function design based on editing direct volume rendered images. Visualization and Computer Graphics, IEEE Trans- actions on, 13(5):1027 –1040, sept.-oct. 2007.

[126] Luo Yan, Zhang Zhaoyang, and An Ping. Stereo video coding based on frame esti- mation and interpolation. IEEE Transactions on Broadcasting, 49(1):14 – 21, March 2003.

[127] Ruigang Yang, Marc Pollefeys, Hua Yang, and Greg Welch. A unified approach to real-time, multi-resolution, multi-baseline 2D view synthesis and 3D depth es- timation using commodity graphics hardware. International Journal of Image and Graphics (IJIG, 4:2004, 2004.

[128] Yang Yang, Vladimir Stankovic, Wei Zhao, and Zixiang Xiong. Multiterminal video coding. In ICIP (3), pages 25–28, 2007.

184 [129] H. Zhang and T.S. Newman. Efficient parallel out-of-core isosurface extraction. In Proceedings of the 2003 IEEE Symposium on Parallel and Large-Data Visualization and Graphics, page 3. IEEE Computer Society, 2003.

[130] X. Zhang, C. Bajaj, and W. Blanke. Scalable isosurface visualization of massive datasets on COTS clusters. In Proceedings of the IEEE 2001 symposium on parallel and large-data visualization and graphics, page 58. IEEE Press, 2001.

[131] Yuan Zhou and Michael Garland. Interactive point-based rendering of higher- order tetrahedral data. IEEE Transactions on Visualization and Computer Graphics, 12(5):1229–1236, 2006.

[132] G. Ziegler, A. Tevs, C. Theobalt, and H.P. Seidel. GPU point list generation through histogram pyramids. Technical Reports of the MPI for Informatics, pages 2006–4, 2006.

[133] K.J. Zuiderveld, P.M.A. van Ooijen, J.W.C. Chin-A-Woeng, P.C. Buijs, M. Olree, and F.H. Posti. Clinical evaluation of interactive volume visualization. In Visualiza- tion ’96. Proceedings., pages 367 –370, November 1996.

185