Modular and High Performance Shader Development

Shader Components: Modular and High Performance Shader Development YONG HE, Carnegie Mellon University TIM FOLEY, NVIDIA TEGUH HOFSTEE, Carnegie Mellon University HAOMIN LONG, Tsinghua University KAYVON FATAHALIAN, Carnegie Mellon University at a sucient rate. Many of these commands pertain to binding Modern game engines seek to balance the conicting goals of high ren- shader parameters (textures, buers, etc.) to the graphics pipeline. dering performance and productive software development. To improve CPU This problem is suciently acute that a new generation of real-time performance, the most recent generation of real-time graphics APIs provide graphics APIs [Khronos Group, Inc. 2016; Microsoft 2017] has been new primitives for performing ecient batch updates to shader parameters. designed to provide new primitives for performing ecient batch However, modern game engines featuring large shader codebases have strug- updates to shader parameters. However, modern game engines have gled to take advantage of these benets. The problem is that even though shader parameters can be organized into ecient modules bound to the been unable to fully take advantage of this functionality. pipeline at various frequencies, modern shading languages lack correspond- The problem is that even though shader parameters can now be ing primitives to organize shader logic (requiring these parameters) into organized into ecient modules and bound to the pipeline only at modules as well. The result is that complex shaders are typically compiled to the necessary frequencies, modern shading languages lack corre- use a monolithic block of parameters, defeating the design, and performance sponding primitives to organize shader logic (that requires these benets, of the new parameter binding API. In this paper we propose to parameters) into modules as well. As a result, complex shaders are resolve this mismatch by introducing shader components, a rst-class unit compiled into monolithic programs that access a monolithic block of modularity in a shader program that encapsulates a unit of shader logic of parameters. These parameters are typically updated en masse by and the parameters that must be bound when that logic is in use. We show engines, defeating the design, and performance benets, of the new that by building sophisticated shaders out of components, we can retain parameter binding API. essential aspects of performance (static specialization of the shader logic in use and ecient update of parameters at component granularity) while In this paper we propose a solution to this mismatch by introduc- maintaining the modular shader code structure that is desirable in today’s ing shader components, a rst-class unit of modularity in a shader high-end game engines. program that encapsulates both the shader logic of a feature and the parameters that must be bound when that feature is in use. From CCS Concepts: •Computing methodologies ! Graphics systems and in- the perspective of a shader writer, shader components provide a terfaces; mechanism that can be used to organize large shader codebases. Additional Key Words and Phrases: shading languages, real-time rendering From the perspective of a game engine, shader components provide ACM Reference format: both the logical units (e.g., a specic material, an animation eect, Yong He, Tim Foley, Teguh Hofstee, Haomin Long, and Kayvon Fatahalian. and a lighting model) used to assemble shading features into com- 2017. Shader Components: Modular and High Performance Shader Develop- plete shaders, and also the granularity at which blocks of shader ment. ACM Trans. Graph. 36, 4, Article 100 (July 2017), 11 pages. parameters (needed as inputs to these features) are bound to the DOI: http://dx.doi.org/10.1145/3072959.3073648 graphics pipeline. By supporting shader components directly in a shading language, our system is able to provide services such as 1 INTRODUCTION static checking of shader components against interfaces, specializa- A modern game engine must achieve high performance to render tion of complete shaders to a specic set of components (features) detailed scenes with complex materials and lighting. At the same in use, and generation of shader parameter block interfaces that time, high productivity is required when authoring and maintaining facilitate ecient binding. These services allow a well-written en- the large libraries of shader code that dene materials and lighting gine to retain the benets of modular shader software development, in these scenes. Today, a major challenge limiting the performance while also realizing the low CPU overhead and high rendering per- of real-time rendering systems is the inability for the CPU (even formance promised by modern graphics APIs. a multi-core CPU) to supply the GPU with rendering commands Our specic contributions include: • The design of the shader component concept and its imple- Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed mentation in a shader compiler library. This implementa- for prot or commercial advantage and that copies bear this notice and the full citation tion includes a shading language front-end and host side on the rst page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or APIs that engines use to assemble components into com- republish, to post on servers or to redistribute to lists, requires prior specic permission plete shaders and to allocate and bind blocks of shader and/or a fee. Request permissions from [email protected]. parameters. © 2017 Copyright held by the owner/author(s). Publication rights licensed to ACM. 0730-0301/2017/7-ART100 $15.00 • A reference implementation of a large library of shader com- DOI: http://dx.doi.org/10.1145/3072959.3073648 ponents and a Vulkan-based rendering engine that achieves ACM Transactions on Graphics, Vol. 36, No. 4, Article 100. Publication date: July 2017. 100:2 • Yong He, Tim Foley, Teguh Hofstee, Haomin Long, and Kayvon Fatahalian high rendering performance when using shaders assembled paper will ignore xed-function state and focus only on generating from these components. Specically, we demonstrate that and selecting shader variants; statements made about the handling our design simultaneously achieves modular shader author- of variants can be extended, without loss of generality, to PSOs. ing and ecient shader parameter binding, yielding over 2× lower CPU cost than a system implemented using current 2.2 Composing Shaders from Modules AAA engine best practices. Consider the nested loops from the previous section. The right shader variant to use in the inner loop might depend on contribu- 2 BACKGROUND tions from many sources. The object might contribute animation, 2.1 Organizing a Renderer for Performance tessellation, or other geometric eects. The material might con- In order to achieve high performance, a renderer must minimize tribute logic to fetch and composite texture layers. The render pass changes to GPU state. This state comprises the conguration of itself might contribute logic to evaluate light sources, write to a G- xed-function pipeline stages, compiled shader kernels to be used buer, etc. Figure 1 depicts one way of composing a shader variant by programmable stages, and a set of shader parameter bindings that from these dierent concerns. provide argument values to kernel parameters. For an engine featuring a large library of shaders and eects it Changing parts of the GPU state has costs, in both CPU and is desirable that disparate concerns or features can be dened as GPU time. CPU costs include time spent by the application and composable modules. For example, any given surface shader should driver to prepare and send hardware commands. GPU costs include work with any given light, and vice versa. Dierent modules are possible hardware pipeline stalls waiting for work using an old illustrated as dierent boxes at the left in Figure 1, grouped by state to complete. Both kinds of costs can negatively impact frame concern. rates, depending on whether an application is CPU- or GPU-bound. In other domains, it is common to compose separately-developed Reducing CPU costs during rendering frees CPU resources for other modules dynamically at runtime, by using a level of indirection key engine tasks such as AI, animation, audio, or input processing. (e.g., using function pointers). However, in the context of real-time Every millisecond counts when, e.g., only 11ms are available per shaders, the cost of runtime indirection is usually prohibitive, and frame in VR. almost all shader composition is achieved by static specialization. In order to reduce state changes, engines often organize rendering As illustrated in Figure 1, a typical way for an engine to perform work around dierent rates of change. For example, a rendering specialization is to use the C preprocessor to perform ad hoc code pass can be thought of in terms of a set of nested loops, e.g.: generation. The code for a given feature is guarded with #ifdefs and enabled by a #define. In this case, a shader compiler must be for(v in view) for(m in materials) invoked for each combination of features, to generate a statically for(o in objects) specialized variant that “links” dierent features together. When draw(v, m, o); shipping a game, all required variants are typically compiled ahead The key observation is that some state (e.g., per-object transforma- of time, and stored in a variant database. tions) changes at a higher rate than others (e.g., view parameters). Because specialized variants are generated statically, the desired In order to maximize CPU eciency when changing GPU state, behavior of composing modules dynamically at runtime must be graphics APIs provide mechanisms to group state into objects that achieved by looking up an appropriate variant in the database.

Modular and High Performance Shader Development

GLSL 4.50 Spec

First Person Shooting (FPS) Game

009NAG – September 2012

Advanced Computer Graphics to Do Motivation Real-Time Rendering

Developer Tools Showcase

Real-Time Rendering Techniques with Hardware Tessellation

Slang: Language Mechanisms for Extensible Real-Time Shading Systems

NVIDIA Quadro P620

Nvidia Quadro T1000

A Qualitative Comparison Study Between Common GPGPU Frameworks

Graphics Shaders Mike Hergaarden January 2011, VU Amsterdam

Embedded Solutions Nvidia Quadro Mxm Modules