Efficient Management of Last-level Caches in Graphics Processors for 3D Scene Rendering Workloads

∗ Jayesh Gaur† Raghuram Srinivasan‡ Sreenivas Subramoney† Mainak Chaudhuri]

† Intel Architecture Group, Bangalore 560103, INDIA ‡ The Ohio State University, Columbus, OH 43210, USA ] Indian Institute of Technology, Kanpur 208016, INDIA

ABSTRACT General Terms Three-dimensional (3D) scene rendering is implemented in the Algorithms, design, measurement, performance form of a pipeline in graphics processing units (GPUs). In different stages of the pipeline, different types of data get ac- Keywords cessed. These include, for instance, vertex, depth, stencil, ren- der target (same as pixel color), and texture sampler data. The Caches, graphics processing units, 3D scene rendering GPUs traditionally include small caches for vertex, render tar- get, depth, and stencil data as well as multi-level caches for the 1. INTRODUCTION texture sampler units. Recent introduction of reasonably large High-quality and high-performance 3D scene rendering is last-level caches (LLCs) shared among these data streams in central to the success of several important graphics applica- discrete as well as integrated graphics hardware architectures tions. A general 3D rendering pipeline found in a typical GPU has opened up new opportunities for improving 3D rendering. consists of a front-end that processes the input geometry by The GPUs equipped with such large LLCs can enjoy far-flung transforming the vertices of each primitive polygon from the intra- and inter-stream reuses. However, there is no comprehen- local co-ordinates to the perspective view co-ordinates. The sive study that can help graphics cache architects understand back-end of the pipeline assigns color to each pixel within each how to effectively manage a large multi-megabyte LLC shared transformed primitive by possibly blending multiple render tar- between different 3D graphics streams. gets in the screen-space frame buffer (also known as the back In this paper, we characterize the intra-stream and inter- buffer in DirectX1), applies texture maps to each pixel to bring stream reuses in 52 frames captured from eight DirectX game realism to the scene, and resolves the visible pixels in the frame titles and four DirectX benchmark applications spanning three buffer through a depth test. Today’s GPUs routinely carry out different frame resolutions. Based on this characterization, we hierarchical depth tests and early depth tests to reduce depth propose graphics stream-aware probabilistic caching (GSPC) buffer bandwidth and eliminate shading of pixels belonging to that dynamically learns the reuse probabilities and accordingly the occluded parts of a surface [12, 35, 36, 37]. Also, these manages the LLC of the GPU. Our detailed trace-driven simu- processors incorporate stencil tests to apply per-pixel masks lation of a typical GPU equipped with 768 con- for realizing sophisticated control over the set of retained or texts, twelve fixed-function texture samplers, and an 8 MB 16- discarded pixels in the rendering pipeline. The stencil buffer way LLC shows that GSPC saves up to 29.6% and on average stores these masks and is often used together with the depth 13.1% LLC misses across 52 frames compared to the baseline buffer. In summary, a 3D rendering pipeline generates access state-of-the-art two-bit dynamic re-reference interval predic- streams to different data structures such as vertices of the ge- tion (DRRIP) policy. These savings in the LLC misses result ometry primitives, hierarchical depth buffer (HiZ buffer), reg- in a speedup of up to 18.2% and on average 8.0%. On a 16 MB ular depth buffer (Z buffer), render targets (the pixel colors of LLC, the average speedup achieved by GSPC further improves the surfaces being rendered), texture maps, and stencil buffer. to 11.8% compared to DRRIP. Traditionally, the GPUs have included small independent on- die caches for each access stream type. For example, a single Categories and Subject Descriptors level of vertex and vertex index cache, Z cache, render tar- get cache (also known as the color cache), stencil cache, HiZ B.3 [Memory Structures]: Design Styles cache, and multiple levels of texture caches can be found in ∗Contributed to this work as an intern at Intel India. any typical GPU. We will refer to these caches collectively as render caches. While these small render caches (few tens to few Permission to make digital or hard copies of all or part of this work for hundreds of KB) improve performance by exploiting near-term personal or classroom use is granted without fee provided that copies are not temporal locality, they fail to offer much in terms of exploit- made or distributed for profit or commercial advantage and that copies bear ing far-flung reuses even within a frame of animation. Recent this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with discrete and integrated graphics hardware architectures from credit is permitted. To copy otherwise, or republish, to post on servers or to 1Rendering takes place in the back buffer. A color or texture redistribute to lists, requires prior specific permission and/or a fee. Request surface that needs to be rendered is referred to as a render permissions from [email protected]. MICRO ’46, December 7-11, 2013, Davis, CA, USA target. When a frame is completely rendered and ready for Copyright 2013 ACM 978-1-4503-2638-4/13/12 ...$15.00. presentation, the back buffer is swapped with the front buffer http://dx.doi.org/10.1145/2540708.2540742. and the front buffer pixels get displayed. Nvidia, AMD, and Intel have included reasonably large last- count by 6.2% on average. Belady’s optimal policy saves 36.6% level caches that can be shared by all the data streams. For ex- of LLC misses averaged over all the frames compared to the ample, the Fermi and Kepler architectures from Nvidia respec- baseline DRRIP policy pointing to the large opportunity of tively include 768 KB and up to 1.5 MB of shared L2 cache [34, saving LLC misses, DRAM traffic, and system bandwidth. 49]. The GPUs used in the AMD 7900, 7800, and 7700 Inspired by the data in Figure 1, we characterize the reuses series (Tahiti, Pitcairn, and Cape Verde of the Southern Is- observed within each graphics data stream as well as across lands family) designed based on the AMD Graphics Core Next different streams (Section 2). This characterization shows that architecture have 64 KB to 128 KB of shared read/write L2 Belady’s optimal policy benefits from both intra-stream and cache attached to each memory controller channel [32]. Finally, inter-stream reuses. The inter-stream reuses primarily stem Intel’s integrated GPUs found in Sandy Bridge, Ivy Bridge, from the process of dynamic texture mapping where the dy- and Haswell share a large multi-megabyte LLC with the CPU namically produced render targets (i.e., color surfaces) get con- cores [9, 21, 22, 39, 44, 52]. sumed by the texture samplers and reused for texturing other Data from different 3D graphics streams can co-exist in such surfaces [14]. Based on this detailed characterization, we sys- a large shared graphics LLC. An efficiently-managed LLC can tematically derive our proposal of graphics stream-aware prob- offer far-flung intra-stream reuses and significantly accelerate abilistic caching (GSPC), which dynamically learns the reuse inter-stream reuses, which was impossible in the architectures probability of a block belonging to a 3D graphics stream and with an independent cache hierarchy for each different stream. accordingly modulates the RRPV of the block (Section 3). Our Efficient management of LLC shared among the 3D graphics detailed trace-driven simulation results (Sections 4 and 5) show streams is the central focus of this paper. This is the first that GSPC saves 13.1% LLC misses averaged over 52 discrete study that characterizes the reuse profile of 3D scene rendering frames drawn from twelve DirectX 3D rendering workloads workloads and exploits the reuse behavior of different graphics compared to the baseline DRRIP policy. The LLC miss sav- data to design policies for the LLC of a GPU. ings achieved by our proposal lead to an average speedup of 1.3 1.2 DirectX 10 DirectX 11 NRU 8.0% compared to the baseline. For a 16 MB 16-way LLC, the 1.1 Belady 1 speedup further improves to 11.8%. 0.9 0.8 0.7 0.6 0.5 1.1 Related Work 0.4 0.3 0.2 In this section, we review the contributions on the manage- 0.1 Normalized LLC miss count 0 ment of LLCs in general-purpose processors and studies explor- Dirt DMC ing memory management in 3D rendering hardware. HAWX Heaven Unigine 3DMarkVAGT1 BioShock Civilization 3DMarkVAGT2Assn Creed Lost PlanetStalker COP Frame AVG Figure 1: Number of LLC misses in twelve DirectX ap- 1.1.1 General-purpose LLC Management plications for NRU and Belady’s optimal policy normal- We focus on three classes of the general-purpose LLC man- ized to two-bit DRRIP in an 8 MB 16-way LLC. agement algorithms that most closely relate to our proposal. To understand the potential of improving the LLC perfor- We discuss algorithms for deciding the age of a block on inser- mance for 3D graphics workloads, Figure 1 shows the num- tion and subsequent hits in the LLC, algorithms for predicting ber of LLC misses for two different LLC replacement policies, dead blocks, and algorithms for dynamically partitioning the namely, single-bit not-recently-used (NRU) and Belady’s op- LLC among multiple independent threads in a multi-core set- timal [2, 33] normalized to the baseline two-bit dynamic re- ting to meet certain performance, fairness, or quality goals. reference interval prediction (DRRIP) policy [19]. The exper- Dynamic insertion policy (DIP) adaptively inserts a block iments are conducted on a non-inclusive/non-exclusive 8 MB into the LLC at the least recently used (LRU) or the most 16-way LLC shared by all the graphics streams. The results are recently used (MRU) position of the access recency stack de- generated in an offline cache simulator, which takes as input the pending on the outcome of a set-sampling-based duel between sequence of load/store accesses to the LLC in a typical GPU LRU insertion and MRU insertion policies [40]. On a cache hit, with 768 shader thread contexts (96 shader cores×8 threads a block is always upgraded to the MRU position. The replace- per core) and twelve fixed-function (i.e., hardwired) texture ment policy always victimizes the block at the LRU position. samplers (one for every eight shader cores). The 3D graphics This algorithm tries to eliminate the single-use blocks from the workloads consist of 52 discrete frames selected from eight Di- LLC as early as possible without disturbing the rest of the rectX games and four DirectX benchmark applications (details contents of the LLC. A subsequent proposal has shown how to of the applications will be discussed in Section 2). The two-bit employ this policy in a shared LLC of a multi-core processor DRRIP policy inserts every cache block into the LLC with a so that each thread can choose the best insertion policy [20]. re-reference prediction value (RRPV) of two or three signifying In a more recent work, the notions of re-reference interval the possibility of a reuse in intermediate-future or no reuse in prediction and re-reference prediction value (RRPV) have been near-future, respectively. The choice between these two pos- proposed [19]. The age of a block in the LLC is determined sible insertion RRPVs is made through a dynamic set-dueling based on its RRPV. If n bits are used to store the RRPV, the technique [40]. On a hit, the RRPV of the block is made zero static re-reference interval prediction (SRRIP) algorithm stat- signifying a possibility of a reuse in immediate-future. A block ically assigns an RRPV of 2n − 2 to a block on insertion into with RRPV three is selected for victimization. Such a block the LLC. On a hit, the RRPV of the block is updated to zero. is predicted to have no reuse in immediate- or intermediate- A block with RRPV 2n − 1 is selected as the victim. If there future. If no such block exists, the RRPV of all the blocks in is no such block in the target set, the RRPV of each block in the target set are incremented in steps of one until a block with the set is incremented until the RRPV of at least one block RRPV three is found. Ties are broken by selecting the block attains a value of 2n − 1. The dynamic re-reference interval with the minimum physical way id. prediction (DRRIP) algorithm dynamically chooses between As shown in Figure 1, the NRU policy performs worse than two insertion RRPVs, namely, 2n − 2 and 2n − 1 based on the DRRIP in several applications and increases the LLC miss outcome of a set-dueling. Thread-aware DRRIP (TA-DRRIP) applies the technique proposed in [20] to allow multiple inde- of studies on texture cache architectures including single-level pendent threads to execute DRRIP in a multi-core shared LLC. texture caches [13], two-level texture caches [7], texture caches Recent proposals exploit signature-based hit prediction (SHiP) in parallel renderers [16, 47], prefetching into texture caches to improve the RRIP policies by using the program counters, with deep FIFO structures [1, 17, 26, 45], victim caching [6], memory addresses, or code path signatures of the load/store in- four-dimensional and six-dimensional tiling of texture data [13], structions [50] or extend the RRIP policies to the LLCs shared and customized DRAM architectures for fast texture access [8, between CPU cores and a GPU running traditional GPGPU- 43]. A texture cache architecture with on-the-fly decompression style scientific computation workloads [28]. In contrast to these of texture data from a second level compressed texture cache algorithms, our proposal deals with 3D scene rendering work- has also been explored [45]. Reordering of ray-object intersec- loads and dynamically modulates the RRPV of a cache block tion tests to enhance the locality of geometry and texture has based on the observed reuse probability of the 3D graphics been explored for a ray-tracing-based rendering engine [38]. stream the block belongs to. Our policy takes into account the All these studies have explored optimizations of the 3D graph- reuse probability within and across the 3D graphics streams. ics workloads and hardware to improve data locality that can The dead block prediction algorithms correlate the program be exploited by the individual render caches attached to the dif- counters of the load/store instructions with the death of the ferent 3D graphics pipeline stages. However, it is not clear how cache blocks that these instructions touch [15, 23, 24, 25, 27, these workloads can extract the full potential of a large LLC 29]. These algorithms victimize the predicted dead blocks early shared between the different pipeline stages generating differ- to make room for more useful blocks in the LLC. Probabilistic ent streams of accesses with possibly disparate patterns. We escape LIFO is a light-weight dead block prediction technique explore solutions to this problem by understanding the mem- that does not require the program counter signature and relies ory behavior of 3D graphics workloads and incorporating this only on the fill order of the cache blocks within a cache set [5]. learning to develop efficient LLC management algorithms. Simple measures of dynamic reuse probability in conjunction with a clever partitioning of the address space have also been 2. REUSE PROFILE OF 3D RENDERING used to effectively identify the dead and live LLC blocks [4, 11]. In this paper, we apply the concept of dynamic reuse WORKLOADS probability-based LLC management to the domain of GPUs In this section, we present a detailed characterization of the running 3D scene rendering workloads. LLC behavior of 52 3D frame rendering jobs drawn from twelve Algorithms have been proposed to explicitly partition the DirectX applications. In the process, we identify the access shared LLC among the competing threads of a multi-core pro- patterns that can be potentially exploited to improve the effi- cessor. The utility-based cache partitioning (UCP) algorithm ciency of caching in the LLC. The details of the DirectX ap- carries out a coarse-grain partitioning of the LLC by dynam- plications are presented in Table 1. The first column lists the ically assigning a number of ways to each thread [41]. The applications along with the abbreviated names, if any, within UCP algorithm has been extended to LLCs shared between parentheses. Among these, four are benchmark applications, CPU cores and a GPU where the graphics processor is em- namely, 3D Mark Vantage Graphics Tests 1 and 2 (3DMark- ployed to execute GPGPU-style workloads [28]. The promo- VAGT1 and 3DMarkVAGT2) [53], Unigine Heaven 2.1 and the tion/insertion pseudo-partitioning (PIPP) policy improves UCP Unigine 3D engine [46]. The remaining eight applications are by designing smart insertion and promotion policies for cache 3D games. The second column shows the DirectX version of blocks within each partition [51]. Subsequent proposals such as the applications. The last column lists the frame resolution Vantage [42] and PriSM [31] eliminate the limitations of way- for each application. In this paper, we consider graphics work- grain partitioning and allow each thread to have an arbitrary loads built with the DirectX APIs (the Direct3D APIs within fine-grained partition. Our proposal does not carry out any ex- DirectX) only, since a large number of Windows PC games are plicit partitioning of the LLC among the 3D graphics streams. designed using the DirectX APIs and these games are routinely In fact, since the existing cache partitioning techniques do not used to benchmark the performance of the commercial GPUs. take into account cross-thread sharing and treat the threads as Table 1: Details of the DirectX applications independent, these techniques cannot be applied directly to the 3D graphics streams, which have significant inter-stream data Application DirectX Resolution sharing. We induce implicit fine-grain partitions among the 3D Mark Vantage GT1 10 1920×1200 streams by efficiently managing the RRPVs of the cache blocks (3DMarkVAGT1) 3D Mark Vantage GT2 10 1920×1200 based on the intra- and inter-stream reuses and propose for the (3DMarkVAGT2) first time an effective way of improving the LLC performance Assassin’s Creed (Assn Creed) 10 1680×1050 of the 3D graphics applications. BioShock 10 1920×1200 Devil May Cry 4 (DMC) 10 1680×1050 1.1.2 Memory Management in 3D Rendering Civilization V (Civilization) 11 1920×1200 Dirt 2 (Dirt) 11 1680×1050 The GPUs have traditionally incorporated caching of poly- HAWX 2 (HAWX) 11 1920×1200 gon vertices (vertex cache), depth buffer (depth cache), frame Unigine Heaven 2.1 (Heaven) 11 2560×1600 or color buffer (color or render target cache), and texture (tex- Lost Planet 2 (Lost Planet) 11 1920×1200 ture or sampler cache). Among these, texture caches [10] have Stalker COP 11 1680×1050 drawn significant attention of the designers due to the high Unigine 3D engine (Unigine) 11 1920×1200 memory bandwidth consumed by the texture mapping pro- cess [3]. Texture caching is an attractive solution to reduce the We trace the DirectX calls generated while rendering each texture memory bandwidth demand. The bilinear, trilinear, application frame and replay this trace of calls through a de- and anisotropic filters applied to texture stored in the form of tailed in-house simulator modeling a high-end GPU. The GPU a MIP map pyramid [48] offer significant amount of spatial and has a non-inclusive/non-exclusive 8 MB 16-way set-associative temporal locality. This observation has inspired a large number LLC shared by all the data streams. This LLC configuration corresponds to an integrated GPU of today or a futuristic dis- tests use the HiZ and Z caches and these cache misses consti- crete GPU with a large read/write cache shared by all the tute the HiZ and Z streams to the LLC. The early depth test is graphics data to reduce the bandwidth demand on the DRAM disabled for those render targets, pixels of which may undergo system. A miss in the LLC always fills the requested block into modifications to their depth values in the pixel shader. the LLC and the requesting render cache. An eviction from the The pixel shader stage uses the programmable shader threads LLC does not send invalidation to the internal render caches to compute the color of each pixel in a render target. This of the GPU. All the results presented in this section are gen- is also the stage where the texture data get sampled through erated by an offline cache simulator, which has been validated specific commands indicating the type of sampling (e.g., point, against the LLC of the detailed GPU simulator. The offline bilinear, trilinear, anisotropic, etc.) issued to the fixed-function model digests the LLC load/store access trace collected from texture sampler units. Certain texturing techniques such as dis- the detailed simulator for each frame. placement mapping (used to render realistic bumps, crevices, water waves, etc.) may require the vertex shader stage to in- 2.1 DirectX Rendering Pipeline voke the texture sampler units. All sampler accesses go through We begin our analysis by understanding the anatomy of the the texture cache hierarchy and the texture cache misses con- DirectX rendering pipeline and how it interacts with the graph- stitute the read-only texture sampler stream to the LLC. ics cache hierarchy. Figure 2 shows the organization of the Di- The output merger stage carries out the late depth test and rectX 10 rendering pipeline and Figure 3 shows a high-level stencil test to decide the final set of visible pixels. The sten- view of how various render caches interface with the rendering cil buffer accesses go through the stencil (STC) cache and the pipeline and the LLC of the GPU. The input assembler stage stencil cache misses constitute the stencil stream to the LLC. reads the scene geometry data such as the vertices and vertex The output merger stage also blends the corresponding pixels of indices from memory (through the LLC and the vertex and multiple non-opaque render targets to generate the final color vertex index caches) and uses them to assemble the geomet- of the pixels in the back buffer. The render target cache (RT ric primitives such as triangles, lines, etc.. The vertex (VTX) cache) is used to hold the pixel colors of the render targets dur- and vertex index (VTX index) cache misses constitute the ver- ing creation, before blending, and after blending. DirectX 10 tex stream to the LLC. The vertex shader stage uses the pro- allows eight render targets to be simultaneously bound to the grammable shader thread contexts to transform each vertex output merger stage and each can have blending enabled or from the local co-ordinates to the world co-ordinates, from the disabled if the application needs. The RT cache misses consti- world co-ordinates to the view space (also known as the cam- tute the render target stream to the LLC. The final displayable era or eye space) co-ordinates, and finally from the view space pixel color values written to the back buffer constitute the dis- to the perspective projection space, which projects each vertex playable color stream to the LLC. of the input geometry onto a projection plane within the view The rendering of one complete frame may require multiple pyramid of the camera. The optional geometry shader stage passes through this pipeline. For example, certain blended ren- takes the assembled primitives in the perspective projection der targets can be later accessed by the samplers to use them space, creates or destroys primitives, and writes the newly cre- as textures for certain surfaces within the same frame. Us- ated vertices to the memory hierarchy through the stream-out ing a render target as a shader resource for texture sampling stage for later consumption. Since none of the rendered frames is usually known as render to texture [30] or dynamic tex- considered in this paper uses this stage, we show the vertex turing [14] and constitutes the primary source of inter-stream stream as a read-only stream to the LLC. reuses (from render target production to sampler consumption) INPUT VERTEX GEOMETRY in the frames that we consider. Also, the depth buffer con- ASSEMBLER SHADER SHADER STREAM OUTPUT tents (when rendered from the light source view) can be con-

OUTPUT PIXEL sumed by the texture sampler during shadow mapping. How- MERGER SHADER RASTERIZER ever, this particular type of inter-stream reuse is not observed much in the frames that we consider. Shadow can also be im- Figure 2: DirectX 10 rendering pipeline. plemented by creating a render target holding the depth values RENDERING PIPELINE of the pixels viewed from each light source and reusing the ren- der target as a texture sampler input (as in render to texture). VTX VTX HIZ RT Z STC INDEX TEXTURE The DirectX 11 pipeline introduces three new stages between CACHE CACHE CACHE CACHE CACHE CACHE CACHES the vertex shader and the geometry shader to carry out hard- HIZ RENDER Z STREAM TARGET TEXTURE STREAM STREAM STREAM wired tessellation of the geometry primitives. These stages are hull shader, tessellator, and domain shader. GRAPHICS PROCESSOR LLC VERTEX STENCIL STREAM STREAM

TO/FROM DRAM BANKS 2.2 Graphics Data Streams This section analyzes the 3D graphics data streams and iden- Figure 3: Graphics streams and the LLC interface. tifies the ones that have the biggest impact on the overall LLC The rasterization stage clips the primitives and the portions performance. Figure 4 shows the stream-wise distribution of of the primitives that fall outside the view frustum, transforms the accesses to the LLC. Across the board, the major fraction each perspective image point to a pixel co-ordinate in the screen of the LLC traffic is contributed by the render target and tex- space or the back buffer (known as viewport transform), dis- ture sampler accesses. On average, these two streams consti- cards back-facing primitives (known as back-face culling), and tute 40% and 34% LLC accesses, respectively. The only other interpolates the vertex attributes (color, normals, texture co- stream that contributes at least 10% to the LLC accesses on ordinates, depth value, etc.) of a primitive to calculate the at- average is the Z stream. The vertex and HiZ streams have 4% tribute values for each pixel covering the primitive. Also, this and 7% LLC accesses on average, respectively. The remaining is the stage where hierarchical and early depth tests, if enabled, 5% of the LLC accesses come from stencil, display, and other are carried out to discard some of the hidden surfaces. These accesses such as shader code, constants, etc.. Based on this STC+Display+Others 2.3 Inter- and Intra-stream Reuse Analysis Texture sampler Render targets We first explore the nature of the reuses enjoyed by the tex- Z HiZ ture sampler accesses to the LLC. Textures are either created VTX 1 a priori and not modified during rendering (static texture) or

0.9 created during rendering as render targets and possibly up- dated in every frame (dynamic texture). Dynamic texture is 0.8 the primary source of the inter-stream reuses seen in the texture 0.7 sampler accesses to the LLC (from render target production to 0.6 texture sampler consumption). To identify these reuses, we tag 0.5

Fraction of LLC accesses every render target block in the LLC with an RT bit. When 0.4 such a block gets consumed by the texture sampler stream from 0.3 the LLC, the bit is reset and the LLC hit is counted toward 0.2 an inter-stream reuse. The RT bit is also reset when a render 0.1 target block gets evicted from the LLC (we are not interested 0 in tracking render target to texture reuses that do not happen

DMC Dirt HAWX Heaven Unigine in the LLC). Apart from these inter-stream reuses, both static 3DMarkVAGT13DMarkVAGT2 Assn CreedBioShock Civilization Lost PlanetStalker COP Frame AVG and dynamic texture can undergo intra-stream reuses i.e., a Figure 4: Stream-wise distribution of the LLC accesses block that does not have the RT bit set may get reused from in an 8 MB 16-way LLC. the LLC by the texture samplers before it gets evicted. distribution, in the remainder of this section, we analyze the Belady DRRIP Intra−stream texture sampler, render target, and Z accesses in greater detail. NRU Inter−stream 1 The display stream is the end-result of rendering of a frame, is 0.9 0.8 consumed by the display driver, and does not enjoy any reuse. 0.7 0.6 In Section 5, we will explore the performance benefit of not 0.5 0.4 0.3 caching this stream in the LLC. 0.2 100 0.1 Belady 0 Normalized texture sampler hits 80 DRRIP DMC Dirt 60 NRU 3DMarkVAGT1 HAWX Heaven Unigine 3DMarkVAGT2 BioShock 40 Civilization Frame AVG 100 Assn Creed Lost PlanetStalker COP 20 90 Belady DRRIP Texture hit rate (%) 0 80 70 NRU 100 Belady 60 50 80 DRRIP NRU 40 60 30 40 20 10 20 0 0 Dirt DMC

Render target hit rate (%) HAWX

Render targets consumed by sampler (%) Heaven Unigine 100 3DMarkVAGT13DMarkVAGT2 Assn Creed BioShock Civilization Lost PlanetStalker COP Frame AVG 80 Figure 6: Upper panel: classification of the texture sam- 60 Belady pler reuses into inter-stream and intra-stream. Lower 40 DRRIP panel: percentage of the render targets consumed by

Z hit rate (%) 20 NRU 0 the texture sampler stream through LLC hits.

DMC Dirt HAWX Heaven Unigine The upper panel of Figure 6 shows the texture sampler hits in 3DMarkVAGT1 Assn CreedBioShock Lost PlanetStalker COP Frame AVG 3DMarkVAGT2 Civilization the LLC classified into inter- and intra-stream hits for Belady’s Figure 5: Hit rates for the texture sampler, render tar- optimal policy, DRRIP, and NRU normalized to the number of get, and Z accesses in an 8 MB 16-way LLC. texture sampler hits enjoyed by Belady’s optimal policy. On Figure 5 presents the LLC hit rates enjoyed by the texture average, 55% of all texture sampler hits enjoyed by Belady’s sampler (topmost panel), render target (middle panel), and optimal policy come from inter-stream reuses, while DRRIP Z (bottom panel) accesses for three different policies, namely, and NRU fall significantly short across the board. The lower Belady’s optimal policy, DRRIP, and NRU. For the texture panel of Figure 6 further explores these inter-stream reuses by sampler stream, Belady’s optimal policy experiences an average presenting the percentage of blocks with the RT bit set that are hit rate of 53.4%. Across the board, DRRIP and NRU deliver consumed by the texture sampler from the LLC. On average, significantly lower hit rates averaging at 22.0% and 18.4%, re- 51% of all render target blocks are consumed by the texture spectively. On the other hand, for the render target accesses, samplers for Belady’s optimal policy, while DRRIP and NRU DRRIP delivers an average hit rate (50.1%) that is within 10% achieve only 16% and 13%, respectively. In Assassin’s Creed, of Belady’s optimal (59.8% average). In this case, however, the potential render target to texture consumption rate is as NRU lags significantly and delivers an average hit rate of only high as 90%, but only a small portion of it materializes in DR- 41.5%. For the Z accesses, DRRIP and NRU perform similarly RIP and NRU (21% and 15%, respectively). In summary, DR- on average exhibiting a hit rate of about 58%, while Belady’s RIP and NRU are not very efficient in facilitating render target optimal policy achieves a hit rate of 77.1%. From these results, to texture consumption through the LLC. A good policy should it is clear that a properly managed large LLC can significantly manage the render target blocks in such a way that maximizes reduce the volume of DRAM accesses in the GPUs. The largest the texture reuses. Without specific hints from the driver, it is performance improvement can come from optimizing the tex- not possible to know which render targets will get consumed as ture sampler accesses. In the next section, we delve deeper into textures in future. Therefore, our proposal will treat all render the analysis of reuses enjoyed by these three streams within targets as potential texture sources and manage them based on each frame and identify the avenues for improvement. the observed probability of inter-stream consumption. Next, we analyze the intra-stream reuses in the texture sam- reasonably high reuse probability (nearly half) with an average pler accesses. We divide the life of a cache block in the LLC death ratio of 0.53 meaning that a randomly picked texture from the time it is filled to the time it is evicted into epochs de- block from E2 is almost equally likely to enjoy at least one marcated by the LLC hits the block enjoys. The epoch between more hit or get evicted from the LLC. In several applications, hit counts k and k + 1 will be denoted by Ek. The first epoch the reuse probability of the E2 blocks is higher than half. In is E0, which a block enters after it is filled into the LLC. Also, summary, a good policy must differentiate between the E0 and if a block with the RT bit set gets consumed by the texture E1 texture blocks and handle them differently than the blocks sampler, the block becomes a texture block and enters E0.A belonging to the higher epochs. While we can assume the E2 block currently in Ek gets promoted to Ek+1 on observing an blocks to be mostly live, the reuse probabilities of the E0 and LLC hit. Naturally, the set of blocks in Ek is a subset of those E1 texture partitions should be learned dynamically, since they in Ek−1 for all k ≥ 1. The death ratio of epoch Ek is defined as vary widely across the applications, E1 in particular. (|Ek| − |Ek+1|)/|Ek|. This is the fraction of blocks in Ek that To further understand the low texture hit rate of two-bit get evicted from the LLC and fail to make it to Ek+1. The DRRIP, Figure 8 presents the percentage of the render tar- ratio |Ek+1|/|Ek| will be referred to as the reuse probability of get blocks and texture blocks filled by DRRIP into the LLC Ek. The goal of a good LLC management policy would be to with RRPV equal to three. These blocks are predicted to have assign high victimization priority to the blocks belonging to the no reuses in the intermediate- or near-future. While DRRIP epochs with high death ratios. In the following, we explore the correctly learns that the texture blocks have high death ratios death ratio of the epochs within the texture sampler stream. and inserts, on average, 36% of these blocks with RRPV=3, our analysis shows that this percentage needs to be much higher to 1 0.9 prevent the texture blocks from thrashing the LLC. Also, on a 0.8 hit to a texture block, DRRIP promotes it to RRPV zero ex- 0.7 0.6 Epoch>=3 pecting a reuse in the near-future. However, our analysis shows 0.5 Epoch=2 0.4 Epoch=1 that the E1 texture blocks have very low optimal reuse prob- 0.3 Epoch=0 ability (0.27 on average). Turning to the render target blocks, 0.2 0.1 we find that about one quarter of these are filled with RRPV 0

Dirt equal to three. This may be detrimental to the inter-stream DMC HAWX Heaven Unigine Fraction of intra−stream texture hits BioShock reuses because the render targets that have the potential to be 3DMarkVAGT1 Civilization Frame AVG 1 3DMarkVAGT2Assn Creed Lost PlanetStalker COP 0.9 consumed by the samplers may get evicted from the LLC early. 0.8 Our analysis shows that under the optimal policy about half 0.7 0.6 of the render targets are consumed by the sampler on average 0.5 0.4 Epoch=0 and in a few applications this fraction is much higher (lower 0.3 Epoch=1 0.2 Epoch=2 panel of Figure 6). The render targets should be offered high 0.1

Texture stream death ratio protection when there is a high probability of render target 0

DMC Dirt consumption from the LLC by the texture samplers. HAWX Heaven Unigine 3DMarkVAGT13DMarkVAGT2 Assn CreedBioShock Civilization Lost PlanetStalker COP Frame AVG 50 Figure 7: Upper panel: epoch-wise distribution of the intra-stream texture sampler hits. Lower panel: death 40 ratio of each epoch of the texture blocks. The LLC exe- 30 cutes Belady’s optimal policy. 20 Render targets The upper panel of Figure 7 shows the epoch-wise distribu- 10 Texture sampler tion of the intra-stream texture hits when the LLC executes 0

RT or texture fills with RRPV=3 (%) Dirt DMC Belady’s optimal policy. Across the board, most of the intra- HAWX Heaven Unigine BioShock Civilization Frame AVG Assn Creed Lost Planet stream texture sampler hits come from E0 averaging at 79% of 3DMarkVAGT13DMarkVAGT2 Stalker COP all texture sampler hits. A much smaller fraction (15%) of hits Figure 8: Percentage of the render target and texture come from the E1 blocks. For the E2 and E≥3 blocks, this frac- fills with RRPV=3 in two-bit DRRIP. tion is 4% and 2%, respectively. Since this is the behavior of 1 0.9 Epoch=0 the optimal policy, we conclude that it is enough to keep track 0.8 Epoch=1 of the first three epochs (0, 1, and 2) of the texture blocks. 0.7 Epoch=2 0.6 These data, however, do not offer any information about the 0.5 0.4 reuse probability or death ratio of the individual epochs. 0.3 The lower panel of Figure 7 presents the death ratio of each 0.2 Z stream death ratio 0.1 of the first three epochs of the texture blocks in the presence of 0

Dirt Belady’s optimal policy. Even though we have seen that most DMC HAWX Heaven Unigine texture sampler hits come from the E0 blocks, the death ratio of BioShock Civilization Lost PlanetStalker COP Frame AVG 3DMarkVAGT1 Assn Creed E0 is extremely high averaging at 0.81. This essentially means 3DMarkVAGT2 Figure 9: Death ratio of each epoch of the Z stream that the reuse probability of an E0 texture block is only 0.19. blocks with Belady’s optimal policy. Across the board, we find that the E0 texture blocks have very low (at most 0.3) reuse probability. The death ratio of the E1 Before closing this section, we present the epoch-wise death blocks is only slightly lower than that of the E0 blocks averaging ratios for the Z stream blocks under Belady’s optimal policy in at 0.73. This means that the E0 blocks that enjoy hits and Figure 9. The trend in the death ratio of Z blocks differs sig- move to E1 are highly likely to get evicted before enjoying any nificantly from that of the texture blocks. The death ratios of further hits. For Assassin’s Creed, BioShock, DMC, HAWX, the E0, E1, and E2 blocks are 0.61, 0.38, and 0.26, respectively. and Lost Planet, the death ratio of the E1 blocks is, in fact, Since only the E0 Z blocks have relatively high death ratio, we higher than that of the E0 blocks. Only the E2 blocks show a do not keep track of the epochs for the Z stream. Instead, we maintain the collective reuse probability experienced by all the t.HIT (TEX), the block is inserted with RRPV of three; other- Z blocks and use it to decide their insertion RRPV. wise the texture block is filled with RRPV zero because filling it with RRPV two hurts performance. All render target blocks 3. STREAM-AWARE LLC MANAGEMENT are filled into non-samples with RRPV zero to give them the highest possible protection so that the render target to texture In this section, we progressively incorporate the observations sampler reuses can happen through the LLC. All other blocks from the last section to derive three increasingly better LLC are filled with RRPV two. On a hit, the RRPV of the block management policies for 3D graphics. We partition the LLC is updated to zero irrespective of the stream. We will refer to accesses into four streams, namely, Z, texture sampler, render this policy as graphics stream-aware probabilistic Z and texture targets, and the rest. Each LLC access is tagged with the iden- caching (GSPZTC). Table 3 summarizes all the actions of the tity of the source cache (Z, texture, render target, or otherwise), LLC controller (except halving of the counters) relevant to this but we do not need to store the stream identity of any block (ex- policy. We discuss the influence of the parameter t in Section 5. cept the render targets) in the LLC. All our policies dedicate Since we choose t to be a power of two, the RRPV computation sixteen sets in every 1024 LLC sets to learn various reuse prob- requires only a left-shift by a constant amount followed by a abilities pertaining to the streams. These sets are identified by comparison and a two-to-one multiplexing. simple Boolean functions on the LLC index bits. We will refer Table 3: LLC actions in the GSPZTC policy to these sets as samples. The samples always execute the two- bit static re-reference interval prediction (SRRIP) policy for all Event Action the streams. In other words, a block is filled into a sample with Sample sets RRPV equal to two. On a hit to a block in a sample, the RRPV Z fill RRP V ← 2, F ILL(Z)++, ACC(ALL)++ Z hit RRP V ← 0, HIT (Z)++, ACC(ALL)++ of the block is updated to zero. A block with RRPV three is se- TEX fill RRP V ← 2, F ILL(TEX)++, ACC(ALL)++ lected as victim with ties broken by victimizing the block with RT → TEX hit RRP V ← 0, F ILL(TEX)++, ACC(ALL)++ the least physical way id. In fact, in all our policies, the victim TEX hit RRP V ← 0, HIT (TEX)++, ACC(ALL)++ selection algorithm is the same as this. Since the samples exe- Other fill RRP V ← 2, ACC(ALL)++ cute the SRRIP policy, they are expected to experience reuse Other hit RRP V ← 0, ACC(ALL)++ probabilities that are much lower than what Belady’s optimal Non-sample sets policy could have experienced. These small reuse probabilities Z fill RRP V ← (F ILL(Z) > t.HIT (Z)) ? 3 : 2 detected in the samples must be amplified in the non-samples TEX fill RRP V ← (F ILL(TEX) > t.HIT (TEX)) ? 3 : 0 by controlling the RRPV of the non-sample blocks. Different RT fill RRP V ← 0 reuse probability amplification techniques form the crux of our Other fill RRP V ← 2 policy proposals, which we discuss next. Table 2 summarizes Any hit RRP V ← 0 the activities of the LLC sample sets. To compare with this policy, we derive a graphics stream- Table 2: Activities of the LLC sample sets aware DRRIP (GS-DRRIP) policy from the thread-aware DR- Updates RRPV and selects victims based on the SRRIP policy. RIP policy [19], which uses set-dueling to decide between the Learns the reuse probabilities of different graphics streams by insertion RRPVs of two and three for each of the four streams. updating a few counters on fills and hits to the sample sets. Render Texture These probabilities are amplified by our policy proposals in the target sampler non-sample sets of the LLC. fill Texture sampler hit fill 11 00 (Render to texture) Our first policy proposal incorporates rudimentary proba- Texture sampler hit bilistic caching for Z and texture sampler streams. Each LLC Render target hit (blend operation) block is assumed to have an RT bit to identify the render target Texture sampler hit 10 01 blocks. This bit is set on a render target access or fill and reset on a texture sampler consumption or LLC eviction. We asso- Texture sampler hit ciate four eight-bit saturating counters with each LLC bank. These will be referred to as F ILL(Z), HIT (Z), F ILL(TEX), Figure 10: Two additional state bits per LLC block. and HIT (TEX). As the names suggest, the F ILL(Z) counter Our second policy proposal refines GSPZTC by incorporat- counts the number of Z stream fills into the samples. The ing the texture sampler epochs, while keeping everything else HIT (Z) counter counts the number of Z stream hits enjoyed unchanged. To keep track of the E0, E1, and E≥2 epochs, we by the samples. Similarly, we define the F ILL(TEX) and incorporate two state bits with each LLC block. The epochs HIT (TEX) counters. The F ILL(TEX) counter is also incre- E0, E1, and E≥2 are denoted by the states 00, 01, and 10. mented when a texture sampler access gets satisfied from the The state 11 replaces the RT bit i.e., a render target block is LLC by a block with the RT bit set. The HIT (TEX) counter is identified by the state 11. When a texture sampler access gets incremented when a texture sampler access gets satisfied from satisfied by a render target block in the LLC, its state changes the LLC by a block with the RT bit reset. There is a separate from 11 to 00. Also, when a texture sampler access misses the seven-bit counter ACC(ALL) that counts all accesses to the LLC, the newly filled block is installed with state 00. Every samples. Whenever this counter saturates, the F ILL and HIT subsequent texture sampler hit to such a block increases the counters of the Z and texture sampler streams are halved and state by one (done through simple logic as opposed to an incre- the ACC(ALL) counter is reset. menter) until the state reaches 10, which remains unchanged When filling a new Z block into a non-sample set of an until the block is evicted from the LLC. The transitions be- LLC bank, the reuse probability of the Z stream is checked tween these states are shown in Figure 10. In a few cases, the 1 in that bank. If this probability is below a threshold t+1 i.e., LLC may receive a render target access to a block in the state F ILL(Z) > t.HIT (Z), the new block is filled with RRPV of 00, 01, or 10. Such a situation may arise if an existing render three; otherwise it is filled with RRPV of two. Similarly, when target object is reused by the DirectX application for produc- filling a texture block in a non-sample set if F ILL(TEX) > ing a new render target. In these cases, the block transitions to the state 11 and its RRPV is updated according to the RRPV saturating counters PROD and CONS with each LLC bank. update rule for a render target hit (same as shown in Table 3). The PROD counter is incremented when a render target block These transitions are omitted from Figure 10 for brevity. is filled into an LLC sample set. The CONS counter is incre- We replace the F ILL(TEX) and HIT (TEX) counters by mented when a texture sampler access to an LLC sample set four saturating counters per LLC bank: F ILL(E,TEX) and hits a block in the state 11 (recall that this state identifies a HIT (E,TEX) each of size eight bits, where E denotes the render target block). At this time the state of the block changes epoch and can take values zero and one (as already discussed, to 00, as already discussed in Table 4. The CONS/P ROD ra- we maintain the reuse probabilities of the first two epochs tio is an estimate of the probability that a render target block only). The F ILL(E,TEX) counter is incremented when a tex- is consumed by the texture sampler from the LLC. The PROD ture sampler request to an LLC sample set causes the accessed and CONS counters are halved together with the F ILL and block to enter the state 00 or 01 denoting epoch E = 0 or 1, HIT counters within each LLC bank. respectively. The HIT (E,TEX) counter is incremented when When a render target block is filled into a non-sample set a texture sampler request to an LLC sample set is satisfied by of a particular LLC bank, the inter-stream reuse probability a block in the LLC currently in the state 00 or 01 denoting in that bank is consulted. The new block is filled with RRPV epoch E = 0 or 1, respectively. three if P ROD > 16.CONS i.e., if the inter-stream reuse prob- When a texture sampler access to a non-sample set of a par- ability is below 1/16. The new block is filled with RRPV two if ticular LLC bank causes the accessed block to enter the state 16.CONS ≥ P ROD > 8.CONS; otherwise if the render target 00 (happens on a texture fill or render target to texture hit), to texture reuse probability is at least 1/8, the block is filled the RRPV of the block is set to three if F ILL(0,TEX) > with RRPV zero. On a hit to a render target block from a t.HIT (0,TEX) in that bank; otherwise the RRPV is set to render target access (due to blending operations), the RRPV zero. If the accessed block enters the state 01 (happens only of the block is always updated to zero. on a texture sampler hit in the state 00), the RRPV of the block The reuse probability thresholds (i.e., 1/16 and 1/8) need to is set to three if F ILL(1,TEX) > t.HIT (1,TEX); otherwise be small because we detect these reuse probabilities from the the RRPV is set to zero. In all other cases, a texture sampler LLC samples which execute SRRIP. The lower panel of Figure 6 hit updates the RRPV of the block to zero. We will refer to shows that even DRRIP, which is expected to be better than this policy as graphics stream-aware probabilistic Z and texture SRRIP, has an average inter-stream reuse probability of 0.16. caching with texture sampler epochs (GSPZTC+TSE). Table 4 If our policy detects a reuse probability as small as 1/8 in the summarizes the additional LLC controller actions relevant to samples, it tries to amplify this probability in the non-samples this policy on top of the GSPZTC policy. In this table, we have by offering the highest possible protection to the render target shortened F ILL(E,TEX) and HIT (E,TEX) to F ILL(E) and blocks. If the detected reuse probability is smaller, our policy HIT (E), respectively. The primary difference of this policy offers a lower level of protection to the render target blocks. We from DRRIP or GS-DRRIP (other than not resorting to set- will refer to this policy as graphics stream-aware probabilistic dueling) is that on a texture sampler hit, it does not always caching (GSPC). Table 5 summarizes the new LLC controller update the RRPV to zero, but deduces the new RRPV based actions (except halving of the PROD and CONS counters) on the reuse probability of the new epoch. relevant to this policy. On top of two-bit DRRIP, this policy requires two state bits per LLC block and eight eight-bit and Table 4: LLC actions for the texture sampler epochs one seven-bit saturating counters per LLC bank (two for Z, Event Action four for texture sampler, two for render target to texture, and Sample sets one to count all accesses to the LLC sample sets). RT fill/RT hit state ← 11 Table 5: Additional LLC actions for the GSPC policy TEX fill F ILL(0)++, state ← 00 Event Action RT → TEX hit F ILL(0)++, state ← 00 TEX hit If state is 00 { HIT (0)++, F ILL(1)++, Sample sets state ← 01 } RT fill state ← 11, P ROD++ Else if state is 01 { HIT (1)++, state ← 10 } RT hit (blending) state ← 11 Else { state ← 10 } RT → TEX hit state ← 00, CONS++ Non-sample sets Non-sample sets RT fill/RT hit RRP V updated as in Table 3, state ← 11 RT fill state ← 11 TEX fill RRP V ← (F ILL(0) > t.HIT (0)) ? 3 : 0, If P ROD > 16.CONS then RRP V ← 3 state ← 00 Else if 16.CONS ≥ P ROD > 8.CONS RT → TEX hit RRP V ← (F ILL(0) > t.HIT (0)) ? 3 : 0, then RRP V ← 2 state ← 00 Else RRP V ← 0 TEX hit If state is 00 { RT hit (blending) state ← 11, RRP V ← 0 RRP V ← (F ILL(1) > t.HIT (1)) ? 3 : 0, RT → TEX hit state ← 00, RRP V updated as in Table 4 state ← 01 } Else { RRP V ← 0, state ← 10 } 4. SIMULATION ENVIRONMENT The GSPZTC and GSPZTC+TSE policies fill the render tar- We evaluate our proposal on 52 discrete frames captured get blocks with RRPV zero. This is done to facilitate render from eight DirectX game titles and four DirectX benchmark target to texture sampler reuses through the LLC. However, applications. The details of these applications were presented such a static policy unnecessarily increases the effective cache in Table 1. We simulate the rendering of each frame entirely occupancy of the render target blocks in situations where such capturing several distinct phase changes that occur as rendering inter-stream reuses are unlikely. Our third policy incorporates progresses. The DirectX calls generated while rendering each a dynamic mechanism to manage the render target blocks on of these frames are replayed through a detailed GPU simulator. top of GSPZTC+TSE by observing the probability of consum- In this paper, we focus on the GPU architectures that dedi- ing render targets as textures. We associate two new eight-bit cate the entire LLC capacity to cache different graphics data, as found in the discrete GPUs. We simulate a high-end GPU with 96 shader cores clocked at 1.6 GHz. Each core has eight thread 5 4 t=2 contexts. Every cycle one SIMD instruction each from two se- 3 t=4 lected threads are issued to two parallel ALU pipelines within 2 t=8 1 3DMarkVAGT13DMarkVAGT2 BioShockDMC each core. Each pipeline can execute four-wide single-precision 0 Assn Creed SIMD operations including multiply-accumulates. Each core −1 Dirt −2 HAWX Heaven Unigine −3 has a peak throughput of sixteen single-precision floating-point Civilization Lost Planet Frame AVG −4 Stalker COP operations every cycle leading to an aggregate peak throughput −5 of nearly 2.5 TFLOPS across 96 shader cores. The microarchi- tecture of each shader core resembles that of the Gen7 core used Change in LLC miss count w.r.t t=16 (%) Figure 11: Sensitivity to parameter t. in the Intel’s Ivy Bridge GPU [21], but we configure our sim- 1 expect t+1 to be small because this probability is detected from ulated GPU such that the aggregate peak shader throughput the SRRIP samples. Figure 11 shows the percent change in the either exceeds or closely matches that of the recent commercial number of LLC misses relative to t = 16 as t is varied. We only discrete GPUs.2 The shader cores share twelve texture sam- consider power-of-two values of t to keep the implementation plers clocked at 1.6 GHz. Each sampler has a throughput of simple. These data are collected for the GSPZTC policy. A four 32-bit texels per cycle leading to a peak texture fill rate of positive (negative) change indicates more (less) LLC misses. 76.8 GTexels/second. We model a three-level texture cache hi- While on average the four values of t experience almost the erarchy with the third level texture cache being 384 KB 48-way same number of LLC misses, there are visible losses in a few set-associative with 64-byte blocks. In addition to the texture applications for t = 2 and t = 4. For t = 2, 3D Mark Van- cache hierarchy, we model a 1 KB 16-way vertex index cache, tage GT1, Dirt, HAWX, and Unigine suffer from at least 2% a 16 KB 128-way vertex cache, a 12 KB 24-way HiZ cache, additional LLC misses relative to t = 16. For t = 4, Dirt and a 16 KB 16-way stencil cache, a 24 KB 24-way render target Unigine suffer from at least 1% additional LLC misses. On the cache, and a 32 KB 32-way Z cache. other hand, Assassin’s Creed delivers the best performance for The GPU has a non-inclusive/non-exclusive 8 MB 16-way t = 2 saving more than 4% LLC misses relative to t = 16. We set-associative LLC. This LLC capacity corresponds to a fu- use t = 8 in this paper, since it offers the most robust perfor- turistic discrete GPU with a large read/write cache shared mance across the board. by all graphics data. We expect the LLC capacity of the Table 6: Evaluated policies GPUs to increase in future as the LLC is usually more power Policy Description and bandwidth-efficient up to a certain capacity compared to DRRIP Dynamic re-reference interval prediction the GDDRx DRAM.3 Our default LLC configuration allows NRU Single-bit not-recently-used caching of all graphics data. We will also explore the impact SHiP-mem Memory signature-based hit prediction of not caching the final displayable color data. The LLC has a GS-DRRIP Graphics stream-aware DRRIP block size of 64 bytes and runs at 4 GHz (modeled GSPZTC Graphics stream-aware probabilistic Z and after the 3.9 GHz LLC of the Intel Core i7-3770 processor [18]). texture caching The round-trip load-to-use latency of an LLC bank (2 MB per GSPZTC+TSE GSPZTC with texture sampler epochs bank) is minimum twenty cycles. The large, fast, banked LLC GSPC Graphics stream-aware probabilistic caching helps absorb a significant amount of DRAM bandwidth de- GSPC+UCD GSPC with uncached displayable color mand. We model a dual channel DDR3-1600 memory system. DRRIP+UCD DRRIP with uncached displayable color The DRAM part is 8-way banked with a burst length of eight and 15-15-15 (tCAS-tRCD-tRP) latency parameters. Figure 12 presents the LLC miss count for several policies For a four-way banked 8 MB 16-way set-associative LLC with normalized to two-bit DRRIP. In addition to NRU, GS-DRRIP, 64-byte blocks, our GSPC policy incurs an additional overhead and our proposals (GSPZTC, GSPZTC+TSE, GSPC), we eval- of 32 KB in two state bits per LLC block and 284 bits in sat- uate SHiP-mem [50] and the impact of not caching the final urating counters (see Section 3) on top of the baseline DRRIP displayable color data (GSPC+UCD and DRRIP+UCD). Ta- policy. This is less than 0.5% of the LLC data array bits. ble 6 summarizes all the evaluated policies. Signature-based hit prediction (SHiP) infers, at the time of filling a block in the LLC, whether the block is likely to experience future reuses. 5. SIMULATION RESULTS Accordingly, the newly filled block is assigned an RRPV of We evaluate our proposal in this section. First, we present three (no near-future reuse) or two (possible future reuses). a detailed performance and hardware overhead analysis in Sec- This inference about future reuses is carried out by learning tions 5.1, 5.2, and 5.3. Next, we evaluate our best proposal the reuse counts observed by different types of signature as- on different memory system and GPU configurations in Sec- sociated with the cache block, e.g., program counter of the tion 5.4 to understand its sensitivity to changed environments. instruction that fills the block in the LLC (SHiP-PC), instruc- tion sequence leading to the instruction that fills the block in the LLC (SHiP-Iseq), and the memory region containing the 5.1 Analysis of LLC Misses block (SHiP-mem). For graphics data, it is not possible to Our policy proposals have a threshold parameter t that must associate a program counter or an instruction sequence with be fixed first. As we have already mentioned in Section 3, we every LLC fill because a large number of accesses come from the fixed-function hardware such as the texture samplers as 2The aggregate peak shader throughput of our simulated GPU well as the render target blending units and the depth test exceeds that of the GeForce GTX 760 and closely matches that units. Therefore, we can only evaluate SHiP-mem from this of the GeForce GTX 780M, both based on the GK104 Kepler family of policies. In SHiP-mem, as proposed originally, we architecture from Nvidia. divide the physical address space into contiguous 16 KB re- 3Nvidia has doubled the shared L2 cache capacity in the gions. For each region, we learn the count of reuses by hashing highest-end Kepler GPUs compared to the Fermi GPU. a 14-bit region identifier (address bits [27:14]) into a 16K-entry table (per LLC bank) of three-bit saturating counters. On an Lost Planet (11.9%). On average, 11.5% LLC misses are saved LLC hit to a block belonging to a particular region, the cor- by the GSPZTC+TSE policy compared to DRRIP. responding region counter is incremented by one. If a block Our final policy, GSPC, incorporates a dynamic caching al- gets evicted from the LLC without experiencing any reuse, the gorithm for the render targets. This optimization offers less corresponding region counter is decremented by one. A newly protection to the render targets in the phases where the prob- filled block is assigned an RRPV of three, if the corresponding ability of render target to texture reuses is low. However, since region counter is zero; otherwise the block is inserted with an all of our applications heavily exploit render target to texture RRPV of two. uses (see Figure 6), it is unlikely that this optimization will In Figure 12, for each application, we evaluate eight poli- bring much benefit to this set of applications. As shown in cies: NRU, SHiP-mem, GS-DRRIP, GSPZTC, GSPZTC+TSE, Figure 12, a few applications benefit from this optimization, GSPC, GSPC+UCD, and DRRIP+UCD (from left to right). DMC and Dirt in particular (see the GSPC bar). Dirt was suf- All bars are normalized to two-bit DRRIP. NRU suffers from fering from an increase in the LLC misses with GSPZTC+TSE, an average increase of 6.2% in the LLC misses compared to which we are able to eliminate in GSPC. In GSPC, there is no DRRIP. SHiP-mem saves visible amounts of LLC misses in application that suffers from additional LLC misses compared BioShock (3.8%), DMC (6.4%), Lost Planet (1.1%), and Stalker to DRRIP. Assassin’s Creed and DMC enjoy more than 20% COP (2%), but, on average, experiences the same volume of savings in the LLC misses compared to DRRIP. On average, LLC misses as DRRIP. In most applications, a 16 KB con- GSPC saves 11.8% LLC misses. tiguous physical address region contains blocks from different Uncached displayable color further improves GSPC (see the streams and as a result, it is not possible to decipher the correct GSPC+UCD bar) saving additional LLC misses across the behavior of a stream by observing the collective reuse behavior board, the most notable being HAWX and Stalker COP. On of a region, as done by SHiP-mem. GS-DRRIP is able to save average, this policy saves 13.1% LLC misses compared to DR- moderate to large amount of LLC misses with Assassin’s Creed RIP. The individual application-wise savings range from 1.7% and BioShock being the biggest beneficiaries enjoying 8.5% and in Dirt to 29.6% in Assassin’s Creed. We note that these im- 5.1% less LLC misses compared to DRRIP, respectively. On av- pressive savings in the LLC misses come with a negligible book- erage, GS-DRRIP saves 2.9% LLC misses compared to DRRIP. keeping overhead that is less than 0.5% of the total data ar- ray capacity of the LLC. Not caching the displayable color in 1.3 1.2 NRU SHiP−mem GS−DRRIP GSPZTC GSPZTC+TSE DRRIP does not lead to any significant benefits, however (see 1.1 GSPC GSPC+UCD DRRIP+UCD the DRRIP+UCD bar). Only a few applications enjoy ad- 1 ditional savings in LLC misses. These are BioShock (4.2%), 0.9 DMC (6.7%), Lost Planet (1.1%), and Stalker COP (2%). On 0.8 average, DRRIP with uncached displayable color experiences 0.7 the same number of LLC misses as the baseline DRRIP. The Normalized LLC miss count 0.6 primary reason why DRRIP is less sensitive to uncached dis- DMC

BioShock playable color is that it already inserts these blocks with RRPV Civilization 1.3 3DMarkVAGT1 3DMarkVAGT2 Assn Creed three, while GSPC often fails to learn this and ends up offering NRU SHiP−mem GS−DRRIP GSPZTC GSPZTC+TSE 1.2 the same level of protection to the display target blocks as the 1.1 GSPC GSPC+UCD DRRIP+UCD other render target blocks (displayable color is a render target). 1 0.9 DRRIP 0.8 100 NRU 90 0.7 80 SHiP−mem

Normalized LLC miss count 70 0.6 60 GS−DRRIP Dirt 50 GSPZTC HAWX 40 Heaven Unigine GSPZTC+TSE Lost Planet Stalker COP Frame AVG 30 Percentage 20 Figure 12: LLC miss count normalized to DRRIP. 10 GSPC 0 GSPC+UCD Texture hit rate RT to texture RT hit rate Z hit rate Turning to our policy proposals, we observe that GSPZTC DRRIP+UCD is highly effective across the board (except for DMC and Dirt) Figure 13: Texture sampler hit rate, render target pro- saving 4.8% LLC misses compared to DRRIP, on average. The duction to texture consumption rate, render target hit applications that enjoy more than 5% savings in LLC misses rate, and Z hit rate. compared to DRRIP are 3D Mark Vantage GT1 (7.6%), As- sassin’s Creed (15.3%), and Civilization (5.1%). There are two Figure 13 further analyzes the policies in terms of the texture reasons why GSPZTC performs better than GS-DRRIP across sampler access hit rate, percentage of the render target blocks the board (except in BioShock, Dirt, and Lost Planet). First, consumed by the texture samplers from the LLC, hit rate of the GSPZTC offers better protection to the render targets by filling render target accesses due to blending, and Z hit rate. These all render target blocks with RRPV zero. Second, GS-DRRIP results are averaged over 52 frames. The texture sampler hit often fails to converge to the global optimum due to the nature rate and render target to texture consumption rate gradually of the feedback-based dueling that it uses [19, 20]. In certain increase in GSPZTC and GSPZTC+TSE, as expected. Slight situations, the dueling algorithm gets stuck in a local optimum. drops in these two dimensions are observed in GSPC due to the Inclusion of the texture sampler epochs further saves LLC introduction of probabilistic render target insertion, which, due misses significantly across the board (see the GSPZTC+TSE to its statistical nature, victimizes some render targets before bar). Compared to GSPZTC, the most remarkable gains are they could be consumed by the texture samplers. However, this enjoyed by 3D Mark Vantage GT1 and GT2, Assassin’s Creed, is compensated by uncaching the displayable color data. The DMC, and Lost Planet. The applications enjoying more than render target hit rate (LLC hits enjoyed by the render target ac- 10% savings in the LLC misses with GSPZTC+TSE compared cesses originating from blending operations) increases through to DRRIP are 3D Mark Vantage GT1 (11.8%), 3D Mark Van- GSPZTC+TSE and GSPC, since these two policies gradually tage GT2 (10%), Assassin’s Creed (31.3%), DMC (16.6%), and create more space in the LLC by eliminating the less useful texture and render target blocks early. In fact, the average latency of memory accesses. As a result, it is necessary to save render target hit rate achieved by GSPC (57.7%) comes very a significantly large volume of LLC misses to achieve reasonable close to what Belady’s optimal policy achieves (59.8% as was performance improvements in graphics applications. shown in Figure 5). The Z hit rate drops visibly in GSPZTC and GSPZTC+TSE compared to GS-DRRIP due to unneces- 1.2 1.15 NRU GS−DRRIP sarily high LLC occupancy of some of the render target blocks 1.1 GSPC in these two policies leading to premature eviction of the Z 1.05 blocks. GSPC is able to address this drawback to a great ex- 1 0.95 tent by evicting the less useful render targets early. GS-DRRIP, 0.9 however, achieves the best Z hit rate among all the policies. 0.85 0.8

Performance normalized to DRRIP Dirt DMC 5.2 Analysis of Hardware Overhead HAWXHeaven BioShock Unigine Civilization Assn Creed Lost Planet Frame AVG The GSPC policy requires four replacement state bits (two 3DMarkVAGT13DMarkVAGT2 Stalker COP new bits in addition to the two existing RRPV bits) per LLC Figure 15: Normalized perf. on an 8 MB 16-way LLC. block. Figure 14 compares the LLC miss counts for four poli- GSPC improves performance across the board with gains cies with identical replacement state bit overhead. The results ranging from 1.8% in Heaven to 18.2% in Assassin’s Creed com- are normalized to two-bit DRRIP. In the LRU, four-bit DRRIP, pared to DRRIP. On average, GSPC improves performance by and four-bit GS-DRRIP policies, a block, after experiencing a 8% compared to DRRIP and by 16% compared to NRU. Av- hit, may take a long time to become eligible for replacement. eraged over all frames, GSPC delivers 26.1 frames per second. Further, the LRU policy inserts all blocks with the highest pos- sible protection. This particularly hurts texture management. 1.3 NRU On average, LRU suffers from a 7.2% increase in the LLC miss 1.2 GS−DRRIP GSPC count compared to two-bit DRRIP. Four-bit DRRIP is only 1.1 slightly better than two-bit DRRIP (0.4% on average), while 1 four-bit GS-DRRIP saves 1.7% LLC misses compared to the two-bit DRRIP baseline. GSPC saves 11.8% LLC misses com- 0.9 pared to the baseline, on average. These results clearly bring 0.8

Performance normalized to DRRIP Dirt out the fact that GSPC makes efficient use of the additional DMC HAWX Heaven Unigine BioShock two state bits and saves an impressive volume of LLC misses Assn Creed Civilization Lost Planet Frame AVG 3DMarkVAGT13DMarkVAGT2 Stalker COP with small logic and storage overhead (less than 0.5% addi- Figure 16: Normalized perf. on a 16 MB 16-way LLC. tional overhead per LLC block on top of baseline). To understand how our GSPC proposal scales to a bigger LRU LLC, Figure 16 shows the performance of NRU, GS-DRRIP, 4−bit DRRIP and GSPC relative to DRRIP on a 16 MB 16-way LLC. The 1.3 4−bit GS−DRRIP 1.2 GSPC trends are similar to those observed in an 8 MB LLC. NRU 1.1 1 loses 3% performance on average compared to DRRIP. It fails 0.9 0.8 to offer any visible gain in any of the applications. GS-DRRIP 0.7 0.6 improves performance by 4% on average compared to DRRIP.

Dirt DMC However, DMC loses by 11.2% with GS-DRRIP compared to

Normalized LLC miss count HAWX Heaven Unigine BioShock DRRIP. Across the board, GSPC achieves impressive perfor- Civilization 3DMarkVAGT13DMarkVAGT2Assn Creed Lost PlanetStalker COP Frame AVG Figure 14: Comparison between iso-overhead policies. mance improvements compared to DRRIP. While DMC is the only application that suffers from a loss in performance (by In all our subsequent analyses, we will continue to use two-bit 3.5%), some of the significant gainers are Assassin’s Creed (by DRRIP and two-bit GS-DRRIP, since they offer much better 27%), Lost Planet (by 17.3%), and Stalker COP (by 17.5%). cost-performance compared to their four-bit counterparts. We On average, GSPC improves performance by 11.8% compared will also disable caching of displayable color in the LLC, but to DRRIP and by 15.2% compared to NRU. In absolute terms, will not mention it explicitly in the policy names i.e., NRU, GSPC delivers an average frame rate of 32.4 frames per sec- GS-DRRIP, GSPC, and DRRIP will stand for NRU+UCD, GS- ond, which translates to a 24.1% improvement over its own DRRIP+UCD, GSPC+UCD, and DRRIP+UCD, respectively. performance on an 8 MB LLC. 5.3 Performance Analysis 5.4 Sensitivity Studies An increase in the volume of the LLC hits saves latency as In this section, we evaluate GSPC in two different environ- well as improves the average delivered bandwidth of the mem- ments, one with a faster DRAM system and another with a less ory subsystem, since the LLC is more bandwidth-efficient than aggressive GPU. In both the studies, the GPU is equipped with the DRAM modules. In this section, we explore how the LLC an 8 MB 16-way LLC. The upper panel of Figure 17 shows the miss savings translate to performance improvement. We mea- performance of NRU and GSPC normalized to DRRIP for an sure the performance of each application in terms of the number architecture with a dual-channel eight-way banked DDR3-1867 of frames rendered per second. Figure 15 evaluates the perfor- 10-10-10 DRAM system. While NRU suffers from an average mance of NRU, GS-DRRIP, and GSPC relative to DRRIP on performance loss of 7%, GSPC continues to improve perfor- our baseline 8 MB 16-way LLC. On average, NRU delivers 7% mance across the board achieving an average speedup of 7.1% worse performance compared to DRRIP. GS-DRRIP fails to compared to DRRIP. The gains are slightly smaller compared convert most of its LLC miss savings into performance gains. to the slower baseline DRAM model, as expected. On average, GS-DRRIP improves performance by only 0.8% The lower panel of Figure 17 presents the performance of compared to DRRIP. It is important to note that the GPUs NRU and GSPC normalized to DRRIP for a GPU with 512 traditionally exploit fast thread switching to partially hide the shader thread contexts (64 cores× 8 threads per core) and 1.3 SIGGRAPH/EUROGRAPHICS Workshop on Graphics 1.2 NRU Hardware, pages 97–101, August 1997. 1.1 GSPC [2] L. A. Belady. A Study of Replacement Algorithms for a 1 Virtual-storage Computer. In IBM Systems Journal, 0.9 0.8 5(2): 78–101, 1966. 0.7 [3] E. Catmull. A Subdivision Algorithm for Computer Normalized performance DMC Dirt HAWX Display of Curved Surface. PhD thesis, University of 3DMarkVAGT1 Heaven Unigine 1.3 3DMarkVAGT2Assn CreedBioShock Civilization Lost PlanetStalker COP Frame AVG Utah, 1974. 1.2 NRU 1.1 GSPC [4] M. Chaudhuri et al. Introducing Hierarchy-awareness in 1 Replacement and Bypass Algorithms for Last-level 0.9 Caches. In Proceedings of the 21st International 0.8 Conference on Parallel Architecture and Compilation 0.7 Normalized performance Techniques, pages 293–304, September 2012. DMC Dirt 3DMarkVAGT13DMarkVAGT2Assn CreedBioShock Civilization HAWX Heaven Lost PlanetStalker COPUnigine Frame AVG [5] M. Chaudhuri. Pseudo-LIFO: The Foundation of a New Figure 17: Perf. normalized to DRRIP for a DDR3- Family of Replacement Policies for Last-level Caches. In 1867 DRAM system (upper panel) and for a less aggres- Proceedings of the 42nd International Symposium on sive GPU (lower panel), using an 8 MB 16-way LLC. , pages 401–412, December 2009. eight texture samplers. Everything else, including the dual- [6] C. J. Choi et al. Performance Comparison of Various channel eight-way banked DDR3-1600 15-15-15 DRAM system, Cache Systems for Texture Mapping. In Proceedings of is left unchanged as our baseline GPU. On this less aggressive the 4th International Conf. on High Perf. Computing in graphics processor, NRU suffers from 5.3% loss in performance Asia-Pacific Region, pages 374–379, May 2000. compared to DRRIP. GSPC continues to improve performance [7] M. Cox, N. Bhandari, and M. Shantz. Multi-level across the board achieving an average speedup of 5.9% com- Texture Caching for 3D Graphics Hardware. In pared to DRRIP. The less aggressive GPU has internal bot- Proceedings of the 25th International Symposium on tlenecks and therefore, rendering performance is expected to Computer Architecture, pages 86–97, June/July 1998. have less sensitivity toward memory subsystem optimizations. [8] M. F. Deering, S. A. Schlapp, and M. G. Lavelle. Overall, these sensitivity studies clearly show that GSPC is a FBRAM: A New Form of Memory Optimized for 3D robust algorithm that continues to deliver significant perfor- Graphics. In Proceedings of the 21st SIGGRAPH Annual mance improvements even when the DRAM systems are made Conference on Computer Graphics and Interactive faster or in scenarios where the GPU is not very efficient. In Techniques, pages 167–174, July 1994. addition to these sensitivity results, the last section has already [9] M. Demler. Iris Pro Takes On Discrete GPUs. In explored the sensitivity of GSPC to the LLC capacity. Microprocessor Report, September 9, 2013. [10] M. Doggett. Texture Caches. In IEEE Micro, 32(3): 6. SUMMARY 136–141, May/June 2012. We have presented a family of reuse probability-based last- [11] J. Gaur, M. Chaudhuri, and S. Subramoney. Bypass and level caching schemes for 3D graphics. This is the first study Insertion Algorithms for Exclusive Last-level Caches. In to explore caching opportunities of 3D graphics workloads in Proceedings of the 38th International Symposium on the context of multi-megabyte last-level caches. We present a Computer Architecture, pages 81–92, June 2011. detailed characterization of the rendering of 52 DirectX frames [12] N. Greene, M. Kass, and G. Miller. Hierarchical Z-buffer drawn from eight game applications and four benchmark ap- Visibility. In Proceedings of the 20th SIGGRAPH Annual plications, spanning three different resolutions and two differ- Conference on Computer Graphics and Interactive ent versions of DirectX. The characterization results show that Techniques, pages 231–238, August 1993. the depth buffer values, render target colors, and texture sam- [13] Z. S. Hakura and A. Gupta. The Design and Analysis of pling requests are the primary contributors to the last-level a Cache Architecture for Texture Mapping. In cache access traffic. Our proposal systematically incorporates Proceedings of the 24th International Symposium on optimizations in the caching policies to improve the volume Computer Architecture, pages 108–120, May 1997. of the last-level cache reuses enjoyed by these access streams. [14] M. Harris. Dynamic Texturing. Available at Our best graphics stream-aware probabilistic caching proposal http://developer.download.nvidia.com/assets/gamedev/docs/ achieves an average performance improvement of 8% compared DynamicTexturing.pdf. to two-bit DRRIP on an 8 MB 16-way last-level cache while [15] Z. Hu, S. Kaxiras, and M. Martonosi. Timekeeping in the incurring a negligible book-keeping overhead that is less than Memory System: Predicting and Optimizing Memory 0.5% of the total data array capacity of the LLC. On a 16 MB Behavior. In Proceedings of the 29th International 16-way last-level cache, the speedup improves further to 11.8%. Symposium on Computer Architecture, pages 209–220, May 2002. 7. ACKNOWLEDGMENTS [16] H. Igehy, M. Eldridge, and P. Hanrahan. Parallel Texture The authors thank Rakesh Ramesh and Suresh Srinivasan for Caching. In Proceedings of the their help during the early phases of this work, and Aravindh SIGGRAPH/EUROGRAPHICS Workshop on Graphics Anantaraman, Ajay Joshi, and Supratik Majumder for useful Hardware, pages 95–106, August 1999. discussions. This research effort is funded by Intel Corporation. [17] H. Igehy, M. Eldridge, and K. Proudfoot. Prefetching in a Texture Cache Architecture. In Proceedings of the 8. REFERENCES SIGGRAPH/EUROGRAPHICS Workshop on Graphics [1] B. Anderson et al. Accommodating Memory Latency in a Hardware, pages 133–142, August/September 1998. Low-cost Rasterizer. In Proceedings of the [18] Intel Core i7-3770 Processor. [36] D. Nehab, J. Barczak, and P. V. Sander. Triangle Order http://ark.intel.com/products/65719/. Optimization for Graphics Hardware Computation [19] A. Jaleel et al. High Performance Cache Replacement Culling. In Proceedings of the Symposium on Interactive using Re-reference Interval Prediction (RRIP). In 3D Graphics and Games, pages 207–211, March 2006. Proceedings of the 37th International Symposium on [37] E. Persson. Depth In-depth. Available at Computer Architecture, pages 60–71, June 2010. http://developer.amd.com/media/gpu assets/Depth in- [20] A. Jaleel et al. Adaptive Insertion Policies for Managing depth.pdf. Shared Caches. In Proceedings of the 17th International [38] M. Pharr et al. Rendering Complex Scenes with Conference on Parallel Architecture and Compilation Memory-coherent Ray Tracing. In Proceedings of the 24th Techniques, pages 208–219, October 2008. SIGGRAPH Annual Conference on Computer Graphics [21] D. Kanter. Intel’s Ivy Bridge Graphics Architecture. and Interactive Techniques, pages 101–108, August 1997. April 2012. Available at [39] T. Piazza. Intel Processor Graphics. In Symposium on http://www.realworldtech.com/ivy-bridge-gpu/. High-Performance Graphics, August 2012. [22] D. Kanter. Intel’s Sandy Bridge Graphics Architecture. [40] M. K. Qureshi et al. Adaptive Insertion Policies for High August 2011. Available at Performance Caching. In Proceedings of the 34th http://www.realworldtech.com/sandy-bridge-gpu/. International Symposium on Computer Architecture, [23] S. Khan, Y. Tian, and D. A. Jim`enez. Dead Block pages 381–391, June 2007. Replacement and Bypass with a Sampling Predictor. In [41] M. K. Qureshi and Y. N. Patt. Utility-Based Cache Proceedings of the 43rd International Symposium on Partitioning: A Low-Overhead, High-Performance, Microarchitecture, pages 175–186, December 2010. Runtime Mechanism to Partition Shared Caches. In [24] S. Khan et al. Using Dead Blocks as a Virtual Victim Proceedings of the 39th International Symposium on Cache. In Proceedings of the 19th International Microarchitecture, pages 423–432, December 2006. Conference on Parallel Architectures and Compilation [42] D. Sanchez and C. Kozyrakis. Vantage: Scalable and Techniques, pages 489–500, September 2010. Efficient Fine-grain Cache Partitioning. In Proceedings of [25] M. Kharbutli and Y. Solihin. Counter-based Cache the 38th International Symposium on Computer Replacement and Bypassing Algorithms. In IEEE Architecture, pages 57–68, June 2011. Transactions on Computers, 57(4): 433–447, April 2008. [43] A. Schilling, G. Knittel, and W. Strasser. Texram: A [26] M. J. Kilgard. Realizing OpenGL: Two Implementations Smart Memory for Texturing. In IEEE Computer of One Architecture. In Proceedings of the Graphics and Applications, 16(3): 32–41, May 1996. SIGGRAPH/EUROGRAPHICS Workshop on Graphics [44] A. L. Shimpi. Intel Iris Pro 5200 Graphics Review: Core Hardware, pages 45–56, August 1997. i7-4950HQ Tested. June 2013. Available at [27] A-C. Lai, C. Fide, and B. Falsafi. Dead-block Prediction http://www.anandtech.com/show/6993/intel-iris-pro- & Dead-block Correlating Prefetchers. In Proceedings of 5200-graphics-review-core-i74950hq-tested. the 28th International Symposium on Computer [45] J. Torborg and J. Kajiya. Talisman: Commodity Architecture, pages 144–154, June/July 2001. Real-time 3D Graphics for the PC. In Proceedings of the [28] J. Lee and H. Kim. TAP: A TLP-aware Cache 23rd SIGGRAPH Annual Conference on Computer Management Policy for a CPU-GPU Heterogeneous Graphics and Interactive Techniques, pages 353–363, Architecture. In Proceedings of the 18th International August 1996. Symposium on High Performance Computer [46] Unigine: Real-time 3D Engine. http://unigine.com. Architecture, pages 91–102, February 2012. [47] A. Vartanian, J-L. Bechennec, and N. Drach-Temam. [29] H. Liu et al. Cache Bursts: A New Approach for Evaluation of High Performance Multicache Parallel Eliminating Dead Blocks and Increasing Cache Efficiency. Texture Mapping. In Proceedings of the 12th In Proceedings of the 41st International Symposium on International Conference on Supercomputing, pages Microarchitecture, pages 222–233, November 2008. 289–296, July 1998. [30] F. D. Luna. Introduction to 3D Game Programming with [48] L. Williams. Pyramidal Parametrics. In Proceedings of DirectX 10 . Wordware Publishing Inc.. the 10th SIGGRAPH Conference on Computer Graphics [31] R. Manikantan, K. Rajan, and R. Govindarajan. and Interactive Techniques, pages 1–11, July 1983. Probabilistic Shared Cache Management (PriSM). In [49] C. Wittenbrink, E. Kilgariff, and A. Prabhu. Fermi Proceedings of the 39th International Symposium on GF100 GPU Architecture. In IEEE Micro, 31(2): 50–59, Computer Architecture, pages 428–439, June 2012. March/April 2011. [32] M. Mantor and M. Houston. AMD Graphic Core Next: [50] C-J. Wu et al. SHiP: Signature-Based Hit Predictor for Low Power High Performance Graphics & Parallel High Performance Caching. In Proceedings of the 44th Compute. In Symposium on High-Performance Graphics, International Symposium on Microarchitecture, pages August 2011. 430–441, December 2011. [33] R. L. Mattson et al. Evaluation Techniques for Storage [51] Y. Xie and G. H. Loh. PIPP: Promotion/Insertion Hierarchies. In IBM Systems Journal, 9(2): 78–117, Pseudo-partitioning of Multi-core Shared Caches. In 1970. Proceedings of the 36th International Symposium on [34] S. Molner. Design Tradeoffs in the Kepler Architecture. Computer Architecture, pages 174–183, June 2009. In Symposium on High-Perf. Graphics, August 2012. [52] M. Yuffe et al. A Fully Integrated Multi-CPU, GPU, and [35] S. Morein. ATI Radeon HyperZ Technology. In Memory Controller 32 nm Processor. In Proceedings of SIGGRAPH/EUROGRAPHICS Workshop on Graphics the International Solid-State Circuits Conference, pages Hardware, August 2000. 264–266, February 2011. [53] 3D Mark Benchmark. http://www.3dmark.com/.