Efficient Management of Last-Level Caches in Graphics Processors For

Efficient Management of Last-level Caches in Graphics Processors for 3D Scene Rendering Workloads ∗ Jayesh Gaury Raghuram Srinivasanz Sreenivas Subramoneyy Mainak Chaudhuri] y Intel Architecture Group, Bangalore 560103, INDIA z The Ohio State University, Columbus, OH 43210, USA ] Indian Institute of Technology, Kanpur 208016, INDIA ABSTRACT General Terms Three-dimensional (3D) scene rendering is implemented in the Algorithms, design, measurement, performance form of a pipeline in graphics processing units (GPUs). In different stages of the pipeline, different types of data get ac- Keywords cessed. These include, for instance, vertex, depth, stencil, render target (same as pixel color), and texture sampler data. The Caches, graphics processing units, 3D scene rendering GPUs traditionally include small caches for vertex, render target, depth, and stencil data as well as multi-level caches for the 1. INTRODUCTION texture sampler units. Recent introduction of reasonably large High-quality and high-performance 3D scene rendering is last-level caches (LLCs) shared among these data streams in central to the success of several important graphics applica- discrete as well as integrated graphics hardware architectures tions. A general 3D rendering pipeline found in a typical GPU has opened up new opportunities for improving 3D rendering. consists of a front-end that processes the input geometry by The GPUs equipped with such large LLCs can enjoy far-flung transforming the vertices of each primitive polygon from the intra- and inter-stream reuses. However, there is no comprehen- local co-ordinates to the perspective view co-ordinates. The sive study that can help graphics cache architects understand back-end of the pipeline assigns color to each pixel within each how to effectively manage a large multi-megabyte LLC shared transformed primitive by possibly blending multiple render tar- between different 3D graphics streams. gets in the screen-space frame buffer (also known as the back In this paper, we characterize the intra-stream and inter- buffer in DirectX1), applies texture maps to each pixel to bring stream reuses in 52 frames captured from eight DirectX game realism to the scene, and resolves the visible pixels in the frame titles and four DirectX benchmark applications spanning three buffer through a depth test. Today's GPUs routinely carry out different frame resolutions. Based on this characterization, we hierarchical depth tests and early depth tests to reduce depth propose graphics stream-aware probabilistic caching (GSPC) buffer bandwidth and eliminate shading of pixels belonging to that dynamically learns the reuse probabilities and accordingly the occluded parts of a surface [12, 35, 36, 37]. Also, these manages the LLC of the GPU. Our detailed trace-driven simu- processors incorporate stencil tests to apply per-pixel masks lation of a typical GPU equipped with 768 shader thread con- for realizing sophisticated control over the set of retained or texts, twelve fixed-function texture samplers, and an 8 MB 16- discarded pixels in the rendering pipeline. The stencil buffer way LLC shows that GSPC saves up to 29.6% and on average stores these masks and is often used together with the depth 13.1% LLC misses across 52 frames compared to the baseline buffer. In summary, a 3D rendering pipeline generates access state-of-the-art two-bit dynamic re-reference interval predic- streams to different data structures such as vertices of the ge- tion (DRRIP) policy. These savings in the LLC misses result ometry primitives, hierarchical depth buffer (HiZ buffer), reg- in a speedup of up to 18.2% and on average 8.0%. On a 16 MB ular depth buffer (Z buffer), render targets (the pixel colors of LLC, the average speedup achieved by GSPC further improves the surfaces being rendered), texture maps, and stencil buffer. to 11.8% compared to DRRIP. Traditionally, the GPUs have included small independent on- die caches for each access stream type. For example, a single Categories and Subject Descriptors level of vertex and vertex index cache, Z cache, render target cache (also known as the color cache), stencil cache, HiZ B.3 [Memory Structures]: Design Styles cache, and multiple levels of texture caches can be found in ∗Contributed to this work as an intern at Intel India. any typical GPU. We will refer to these caches collectively as render caches. While these small render caches (few tens to few Permission to make digital or hard copies of all or part of this work for hundreds of KB) improve performance by exploiting near-term personal or classroom use is granted without fee provided that copies are not temporal locality, they fail to offer much in terms of exploit- made or distributed for profit or commercial advantage and that copies bear ing far-flung reuses even within a frame of animation. Recent this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with discrete and integrated graphics hardware architectures from credit is permitted. To copy otherwise, or republish, to post on servers or to 1Rendering takes place in the back buffer. A color or texture redistribute to lists, requires prior specific permission and/or a fee. Request surface that needs to be rendered is referred to as a render permissions from [email protected]. MICRO ’46, December 7-11, 2013, Davis, CA, USA target. When a frame is completely rendered and ready for Copyright 2013 ACM 978-1-4503-2638-4/13/12 ...$15.00. presentation, the back buffer is swapped with the front buffer http://dx.doi.org/10.1145/2540708.2540742. and the front buffer pixels get displayed. Nvidia, AMD, and Intel have included reasonably large last- count by 6.2% on average. Belady's optimal policy saves 36.6% level caches that can be shared by all the data streams. For ex- of LLC misses averaged over all the frames compared to the ample, the Fermi and Kepler architectures from Nvidia respec- baseline DRRIP policy pointing to the large opportunity of tively include 768 KB and up to 1.5 MB of shared L2 cache [34, saving LLC misses, DRAM traffic, and system bandwidth. 49]. The GPUs used in the AMD Radeon 7900, 7800, and 7700 Inspired by the data in Figure 1, we characterize the reuses series (Tahiti, Pitcairn, and Cape Verde of the Southern Is- observed within each graphics data stream as well as across lands family) designed based on the AMD Graphics Core Next different streams (Section 2). This characterization shows that architecture have 64 KB to 128 KB of shared read/write L2 Belady's optimal policy benefits from both intra-stream and cache attached to each memory controller channel [32]. Finally, inter-stream reuses. The inter-stream reuses primarily stem Intel's integrated GPUs found in Sandy Bridge, Ivy Bridge, from the process of dynamic texture mapping where the dy- and Haswell share a large multi-megabyte LLC with the CPU namically produced render targets (i.e., color surfaces) get con- cores [9, 21, 22, 39, 44, 52]. sumed by the texture samplers and reused for texturing other Data from different 3D graphics streams can co-exist in such surfaces [14]. Based on this detailed characterization, we sys- a large shared graphics LLC. An efficiently-managed LLC can tematically derive our proposal of graphics stream-aware prob- offer far-flung intra-stream reuses and significantly accelerate abilistic caching (GSPC), which dynamically learns the reuse inter-stream reuses, which was impossible in the architectures probability of a block belonging to a 3D graphics stream and with an independent cache hierarchy for each different stream. accordingly modulates the RRPV of the block (Section 3). Our Efficient management of LLC shared among the 3D graphics detailed trace-driven simulation results (Sections 4 and 5) show streams is the central focus of this paper. This is the first that GSPC saves 13.1% LLC misses averaged over 52 discrete study that characterizes the reuse profile of 3D scene rendering frames drawn from twelve DirectX 3D rendering workloads workloads and exploits the reuse behavior of different graphics compared to the baseline DRRIP policy. The LLC miss sav- data to design policies for the LLC of a GPU. ings achieved by our proposal lead to an average speedup of 1.3 1.2 DirectX 10 DirectX 11 NRU 8.0% compared to the baseline. For a 16 MB 16-way LLC, the 1.1 Belady 1 speedup further improves to 11.8%. 0.9 0.8 0.7 0.6 0.5 1.1 Related Work 0.4 0.3 0.2 In this section, we review the contributions on the manage- 0.1 Normalized LLC miss count 0 ment of LLCs in general-purpose processors and studies explor- Dirt DMC ing memory management in 3D rendering hardware. HAWX Heaven Unigine 3DMarkVAGT1 BioShock Civilization 3DMarkVAGT2Assn Creed Lost PlanetStalker COP Frame AVG Figure 1: Number of LLC misses in twelve DirectX ap- 1.1.1 General-purpose LLC Management plications for NRU and Belady's optimal policy normal- We focus on three classes of the general-purpose LLC man- ized to two-bit DRRIP in an 8 MB 16-way LLC. agement algorithms that most closely relate to our proposal. To understand the potential of improving the LLC perfor- We discuss algorithms for deciding the age of a block on inser- mance for 3D graphics workloads, Figure 1 shows the num- tion and subsequent hits in the LLC, algorithms for predicting ber of LLC misses for two different LLC replacement policies, dead blocks, and algorithms for dynamically partitioning the namely, single-bit not-recently-used (NRU) and Belady's op- LLC among multiple independent threads in a multi-core set- timal [2, 33] normalized to the baseline two-bit dynamic re- ting to meet certain performance, fairness, or quality goals.

Load more