Explicit Cache Management for Volume Ray-Casting on Parallel Architectures Jönsson, Daniel

Explicit Cache Management for Volume Ray-Casting on Parallel Architectures Jönsson, Daniel; Ganestam, Per; Doggett, Michael; Ynnerman, Anders; Ropinski, Timo Published in: Symposium on Parallel Graphics and Visualization Published: 2012-01-01 Link to publication Citation for published version (APA): Jönsson, D., Ganestam, P., Doggett, M., Ynnerman, A., & Ropinski, T. (2012). Explicit Cache Management for Volume Ray-Casting on Parallel Architectures. In Symposium on Parallel Graphics and Visualization. Eurographics Association. General rights Copyright and moral rights for the publications made accessible in the public portal are retained by the authors and/or other copyright owners and it is a condition of accessing publications that users recognise and abide by the legal requirements associated with these rights. • Users may download and print one copy of any publication from the public portal for the purpose of private LUND UNIVERSITY study or research. • You may not further distribute the material or use it for any profit-making activity or commercial gain • You may freely distribute the URL identifying the publication in the public portal ? PO Box 117 Take down policy 221 00 Lund If you believe that this document breaches copyright please contact us providing details, and we will remove +46 46-222 00 00 access to the work immediately and investigate your claim. Eurographics Symposium on Parallel Graphics and Visualization (2012) H. Childs, T. Kuhlen, and F. Marton (Editors) Explicit Cache Management for Volume Ray-Casting on Parallel Architectures Daniel Jönsson1, Per Ganestam2, Michael Doggett2, Anders Ynnerman1 and Timo Ropinski1 1C-Research, Linköping University, Sweden 2Lund University, Sweden Abstract A major challenge when designing general purpose graphics hardware is to allow efficient access to texture data. Although different rendering paradigms vary with respect to their data access patterns, there is no flexibility when it comes to data caching provided by the graphics architecture. In this paper we focus on volume ray-casting, and show the benefits of algorithm-aware data caching. Our Marching Caches method exploits inter-ray coherence and thus utilizes the memory layout of the highly parallel processors by allowing them to share data through a cache which marches along with the ray front. By exploiting Marching Caches we can apply higher-order reconstruction and enhancement filters to generate more accurate and enriched renderings with an improved rendering performance. We have tested our Marching Caches with seven different filters, e. g., Catmul-Rom, B- spline, ambient occlusion projection, and could show that a speed up of four times can be achieved compared to using the caching implicitly provided by the graphics hardware, and that the memory bandwidth to global memory can be reduced by orders of magnitude. Throughout the paper, we will introduce the Marching Cache concept, provide implementation details and discuss the performance and memory bandwidth impact when using different filters. Categories and Subject Descriptors (according to ACM CCS): I.3.3 [Computer Graphics]: Volume Rendering— 1. Introduction and memory bandwidth. To reduce this effect, specialized memory is used to grant fast data access through caching. Modern Graphics Processing Units (GPU) are complex par- This goes even that far, that the Larrabee architecture other- allel devices which allow SIMD processing. To acquire high wise solely build with standard CPUs, integrates dedicated image fidelity and high rendering performance at the same texturing hardware to allow better performance for memory time, algorithms need to take advantage of several aspects bound algorithms [SCS∗08]. However, when providing such of a GPU’s architecture, which itself is subject to change dedicated hardware, besides the costs mainly two problems undergoes technological advancement. One aspect of high occur. First, the same data caching strategy is used indepen- importance which needs to be considered, is the increasing dent of the algorithm executed by the parallel processing gap between computational capability and memory band- units. This is problematic, since data access patterns vary width of modern GPUs [OHL∗08]. While compute power drastically even among algorithms performing similar tasks. grows, memory bandwidth does not grow at the same rate. For instance, when implementing volume rendering through Algorithms running on modern GPUs can be roughly classi- texture slicing, the 3D texture representing the volumetric fied on whether they are compute bound or memory bound. data set is accessed in a slice pattern, while ray-casting ac- The former spend a major amount of cycles performing com- cesses the 3D texture in a pattern following each viewing putations, while the latter use most of the cycles accessing ray. It is obvious, that these data access patterns require dif- data usually located in graphics memory. While compute ferent caching strategies to minimize cache misses. Second, bound algorithms are supported by the high degree of par- todays texturing hardware is designed to offer hardware sup- allelism of modern GPUs, memory bound algorithms suffer port for linear interpolation only, while in many application from the increasing gap between computational capabilities c The Eurographics Association 2012. paper1008 / Volume Cache Management (a) Trilinear interpolation (speed: -2.5x). (b) Catmul-Rom (speed: 3.3x). (c) B-spline (speed: 3.3x). Figure 1: Comparison of data reconstruction and performance improvement for the Marshner-Lobb dataset using our method. Hardware trilinear interpolation was used in (a) for comparison, while our method implements it in software which causes the slowdown. areas it can be necessary to apply more complex filters when slow memory or running the risk that data is evicted by the accessing data [HAM11]. implicit cache. We have tested our Marching Caches with seven differ- Within this paper we address these two problems, by in- ent filters: Catmul-Rom, B-spline, ambient occlusion pro- troducing explicit data caching strategies for memory bound jection, trilinear, bilateral, gradient and filtered gradient. For volume rendering algorithms. By using our approach, we can some of these filters we were able to obtain a speed up of increase rendering performance and improve image quality four times as compared to using the caching implicitly pro- by enabling the application of higher-order reconstruction vided by the graphics hardware. Furthermore, the proposed and enhancement filters when accessing data. We have cho- explicit caching reduces the memory bandwidth to global sen volume rendering as our application scenario, as it is memory by orders of magnitude. Besides these benefits, our known to be memory bound by dealing with often large 3D explicit caching also allows the programmer who knows the data sets. To reduce the limitations resulting from the mem- data access pattern, to reorder computations to make max- ory bound behavior, we introduce the Marching Caches con- imum use of the available shared memory. Thus, it allows cept as an explicit caching strategy for state-of-the-art GPU- further improvements for specialized rendering algorithms. based volume ray-casting. Marching Caches improve per- While we mainly focus on volume ray-casting, a similar formance by reducing the amount of volume data read from analysis as performed within this paper can be applied to slow memory. This is done by exploiting inter-ray coher- other data intensive algorithms and potentially lead to ded- ence, whereby computational power is traded for memory icated explicit caching schemes, which may allow similar bandwidth through neighboring rays sharing fast memory gains. which they incrementally update. Thus, it becomes possible to apply higher-order reconstruction and enhancement filters 2. Previous Work to generate more accurate and enriched renderings with an improved rendering performance as demonstrated in Figure The concept of tracing several rays together has been used 1. This is mainly achieved by taking advantage of data lo- previously in several contexts including volume render- cality, which has previously been pointed out as being im- ing [MKW∗02]. Meissner et al. ray cast four rays together to portant by Rixner et al. in the development of the Imagine ensure enough computation is available to cover the latency Stream Processor [RDK∗98][Cor11]. Thus, by moving data of the compositing pipeline. For the generation of realistic closer in terms of the architecture to the thread execution, images using interactive ray tracing the concept of packet performance can be improved. Within our approach, we uti- tracing [WBWS01] also involves tracing several rays at the lize a thread’s ability to access fast memory regions and use same time. In particular the rays in each packet are forced to it as an explicit cache, by moving voxel data into the memory travel together, to ensure the same data is used for each ray that is shared between the group of threads. By moving the as it traces its path through the 3D dataset. While originally data from slow memory into the fast memory regions this used in a four-wide configuration to make best use of CPU ensures data is available for reuse by several threads all at SIMD engines [WBWS01], it has also been applied to GPU very low latency, unlike having to read the data again from based ray tracing as well [HSHH07]. c The Eurographics Association 2012. paper1008 / Volume Cache Management Marching Cache ri ri ri ri rj rj rj rk rk rk Global memory Local memory (a) Regular raycasting (b) Our method Figure 2: A vast amount of reads from slow memory, indicated by grey boxes, are necessary when performing filtering for regular ray-casting. Our method works by incrementally updating a cache which marches along a ray front. Dark (short dashes) and light (long dashes) green represent fast memory where only the light green is updated from slow memory. The neighboring rays, ri, r j and rk, work together to fill up the missing values. Using Marching Caches in this example, ri will read two values from global memory and the other rays one each.

Load more