<<

Thesis no: BCS-2016-13

Direct3D 11 vs 12 A Performance Comparison Using Basic Geometry

Mikael Olofsson

Faculty of Computing Blekinge Institute of Technology SE371 79 Karlskrona, Sweden This thesis is submitted to the Faculty of Computing at Blekinge Institute of Technology in partial fulllment of the requirements for the degree of Bachelor of Science in Computer Science. The thesis is equivalent to 10 weeks of full-time studies.

Contact Information: Author: Mikael Olofsson E-mail: [email protected]

University advisor: Stefan Petersson Dept. of Creative Technologies

Faculty of Computing Internet : www.bth.se Blekinge Institute of Technology Phone : +46 455 38 50 00 SE371 79 Karlskrona, Sweden Fax : +46 455 38 50 57 Abstract

Context. Computer rendered imagery such as computer games is a eld with steady development. To render games an application pro- gramming interface (API) is used to communicate with a graphical processing unit (GPU). Both the interfaces and processing units are a part of the steady development in order to be able to push the limits of graphical rendering.

Objectives. This thesis investigates if the 12 API pro- vides higher rendering performance when compared to its predecessor Direct3D 11.

Methods. The method used is an experiment, in which a bench- mark rendering basic shaded geometry using both of the while measuring their performance was developed. The focus was aimed at testing API interaction and comparing Direct3D 11 against Direct3D 12.

Results. Statistics gained from the benchmark suggest that in this experiment Direct3D 11 oered the best rendering performance in the majority of the cases tested, although Direct3D 12 had specic sce- narios where it performed better.

Conclusions. As a conclusion the benchmark gave contradicting re- sults when compared to other studies. This could be dependent on the implementation, software or hardware used. In the tests Direct3D 12 was closer to its Direct3D 11 counterpart when more cores were used. A platform with more processing cores available to execute in parallel could reveal if Direct3D 12 could oer better performance in that ex- perimental setting. In this study Direct3D 12 was implemented as to imitate Direct3D 11. If the implementation was further aligned with Direct3D 12 recommendations other results might be observed. Fur- ther study could be conducted to give a better evaluation of rendering performance.

Keywords: DirectX, Direct3D, rendering, performance, geometry

i Contents

Abstract i

1 Introduction 1 1.1 Background ...... 1 1.2 Research Question ...... 2

2 Programming Using Direct3D 3 2.1 Rendering Pipeline ...... 3 2.1.1 Primitives ...... 5 2.1.2 Stages ...... 5 2.2 Immediate and Deferred Rendering ...... 6

3 Method 7 3.1 Benchmark ...... 7 3.1.1 Application Structure ...... 8 3.1.2 Test Denitions ...... 9

4 Performance Tests 10 4.1 Test Parameters ...... 10 4.2 Test 1 ...... 12 4.3 Test 1b ...... 13 4.4 Test 2 ...... 14 4.5 Test 2b ...... 15 4.6 Test 3 ...... 16 4.7 Test 3b ...... 17 4.8 Test 4 ...... 18 4.9 Test 4b ...... 19

5 Analysis and Discussion 20 5.1 Analysis ...... 20 5.2 Discussion ...... 20

6 Conclusions and Future Work 22

ii References 23

iii Chapter 1 Introduction

This chapter introduces the thesis and a reason for the work done. A background to why the thesis topic was chosen and the research question that was derived from it can be found in this chapter.

1.1 Background

"The (GPU) have become important in providing pro- cessing power for high performance computing applications" [1]. One example of an application that uses this power to its benet is a computer game. One of the challenges within this eld is to provide users with a pleasing graphical experience. To establish this the GPU is used to render geometry and present it to the user. For this to be done an API (application programming interface) that communicates with the GPU can be used.

This thesis serves as a general benchmark comparison between two APIs. With evolving software it can be of importance to evaluate performance. The knowl- edge gained from evaluation could be used to decide whether to use a newer API or keep using the already existing choice.

Direct3D 12 is the newest version of Direct3D. It is designed to be faster and more ecient than any previous version. In the world of PC gaming, the main program often does the most and sometimes all of the work. Direct3D 12 aims to make more ecient use of multi-core CPUs. One important factor when making the choice about which API to choose as a developer is that the majority of PC gaming hardware available already support Direct3D 12. This means that many users will be able to play games developed with Direct3D 12 without the need for additional hardware [2].

Based on previous work done within this area, Direct3D 12 should show a signif- icant improvement over its predecessor. In an article published on the DirectX Developer Blog the benchmark 3DMark got results which show improvement in CPU utilization and better distribution of work among threads [3].

1 Chapter 1. Introduction 2 1.2 Research Question

The goal within this thesis is to measure and compare performance between two APIs. This goal originates from the question:

Will standard usage of Direct3D 12 have higher rendering performance when com- pared to an equivalent Direct3D 11 implementation?

In order to evaluate the question an experiment was conducted. Since the re- search question revolves around "standard usage" this needs to be dened. For the purpose of this thesis this will be dened as rendering at shaded geometry using Direct3D according to MSDN documentation recommendations and the aim to keep the implementation of the Direct3D 12 API as close as possible to its Direct3D 11 counterpart. The rendering performance will be measured by the time required to render the geometry.

To conduct the experiment a benchmark was developed using the C++ pro- gramming language. From this benchmark data, that can be used to evaluate performance, were generated. Chapter 2 Programming Using Direct3D

In this chapter the basics of Direct3D is presented. This is intended to introduce the basic principles of the rendering pipeline and which parts that were used by the benchmark in the experiment and, the two types of rendering, immediate rendering and deferred rendering.

2.1 Rendering Pipeline

The rendering pipeline refers to all the stages necessary to generate a 2D image given a geometric description of a scene with a positioned and oriented camera [4]. Figure 2.1 shows the stages available in this pipeline and the GPU memory resources on the right side. An arrow from the memory resource pool to a stage indicates that the stage can access the resources as input. The Pixel Shader stage and Output-Merger state has bidirectional arrows indicating that they both can read and write to the GPU resources. As seen in the gure most of the stages pass their output as input for the next stage in the pipeline; for example, the Input-Assembler reads geometric data from the resources and pass it to the Ver- tex Shader. For the purpose of this study only a few vital parts are used to render basic shaded geometry, these are shown in Figure 2.2.

3 Chapter 2. Programming Using Direct3D 4

Memory Resources (Buffer, Texture, Constant Buffer) Input-Assembler Stage

Vertex Shader Stage

Hull Shader Stage

Tessellator Stage

Domain Shader Stage

Geometry Shader Stage

Stream Output Stage

Rasterizer Stage

Pixel Shader Stage

Output-Merger Stage

Figure 2.1: Stages of the rendering pipeline [4] Chapter 2. Programming Using Direct3D 5

Memory Resources (Buffer, Texture, Constant Buffer) Input-Assembler Stage

Vertex Shader Stage

Rasterizer Stage

Pixel Shader Stage

Output-Merger Stage

Figure 2.2: Simplied pipeline used in the benchmark application

2.1.1 Primitives Triangles, lines and points are three basic primitives that can be used to render geometry [5]. These primitives have in common that they can be dened from a number of vertices. Frank D. Luna states that "Mathematically, the vertices of a triangle are where two edges meet; the vertices of a line are the endpoints; for a single point, the point itself is the vertex." [5]. These primitives are the building blocks of 3D programming. The most common primitive in games is the triangle, which many objects are built from. Thus every improvement that can be made to the pipeline that allows for more ecient rendering of triangles is valuable for performance in programs that use many objects.

2.1.2 Shader Stages A shader is a program that is executed in parallell by the multiple cores of the graphics card [5]. These programs can be specied to align the visual output as needed for the application in mind. Dependent on the implementation these can be used as a part of rendering a simple 2D interface or used to render 3D objects with advanced lighting techniques. Chapter 2. Programming Using Direct3D 6

Vertex Shader The vertex shader program inputs a vertex and outputs a vertex. What hap- pens during this stage is dependent on the implementation of the program. A common case is that each input vertex is specied in world space and during the vertex shader stage it is transformed to homogeneous clip space in preparation for rendering the 2D representation of the world.

Pixel Shader The pixel shader program is executed for each pixel fragment. This is the shader step that computes color according to the implementation, a simple implemen- tation could return a constant color or a color based on interpolated vertex at- tributes. More advanced techniques can also be set in motion such as per-pixel lighting, shadows and reections.

2.2 Immediate and Deferred Rendering

Immediate rendering refers to calling rendering APIs or commands from a di- rect3D device, which queue the commands in a buer that then executes on the GPU [6]. When using deferred rendering the commands are instead stored in a command buer that can be played back at some other time. A deferred context is used to record the commands both for rendering and state settings to a com- mand list [6]. Multiple threads can work in parallel with the deferred context, although each thread needs its own context and command list. When queuing up commands in this fashion Direct3D generates rendering overhead. The gain is that command lists execute much more eciently during its playback [6].

Direct3D 11 uses the immediate context to play back the command lists gen- erated. Only one command list can be processed at the same time. Direct3D 12 instead use a command queue to handle this. The dierence between them is that Direct3D 11 submits commands to the buer in a single threaded manner while Direct3D 12 allows for a multithreaded workload distribution for this task [7]. Chapter 3 Method

This chapter focuses on the experiment conducted to measure performance of the two APIs. The goal is to gather performance data, using an application that render basic shaded geometry, to be able to evaluate the dierence in execution time between Direct3D 11 and Direct3D 12.

3.1 Benchmark

A benchmark application was written to render basic shaded geometry. The ap- plication use either the Direct3D 11 or Direct3D 12 pipeline to generate a screen image. Each image is lled with a number of points which have a xed color. Each test use three variables:

ˆ Thread count

ˆ Amount of points rendered

ˆ Which API is used

The primary focus is to study performance of API interaction when rendering geometry. In order to try to keep the variables to a minimum no culling or anti-aliasing of geometry is performed. Triangles are often used to visualize the geometry of objects in games. For this thesis points were used to render geome- try. The choice to use points instead of triangles is based on the fact that when it comes to the rasterization of the geometry, points are interpreted as though they were composed of two triangles, which use triangle rasterization rules. Conve- niently enough there is no culling for points, which align with the aim to keep the test at a basic level [8]. The only shader stages that are used are the Vertex and Pixel . The use of these are intended to be very dened. Vertex shader passes data to the pixel shader through the rasterization stage without doing any additional calculations.

Since one of the aims for Direct3D 12 is to more eciently use all CPU and

7 Chapter 3. Method 8

GPU cores the application has a focus on a multithreaded approach. The rea- soning behind a multithreaded implementation is to be able to use more of the CPUs capacity than a single threaded application would. In order to achieve this the concept used was simple, the submitted graphical workload was equally distributed among the threads available. No other parts of the project were op- timized with multithreading.

Comparisons were made against Direct3D 11 and its threaded counterpart which is the deferred context pipeline [9]. This alongside with the variables dened and choice of geometry serve as a base to help evaluate and answer the research question.

Measurements were logged automatically by the developed benchmark. The val- ues of the variables were dened before the program was running to ensure that both APIs go through the same amount of work. Since the aim is to make an application that is aligned with the concept of a benchmark tool the user inter- action was kept to a minimum. The user specied the variables used for the test before running it and had no further control over the data collection.

3.1.1 Application Structure The application was designed with simplicity and correctness in mind, following the principles displayed within the MSDN documentation [2]. The main program initializes the Windows interface and handle the basic message loop, while leav- ing the rest of the processing time for API testing. The purpose of the main loop is to iterate the tests of which the parameters are dened in a separate le. Each test uses a DirectX variable that represents the API interface. This is initialized with a vertex buer that matches the size needed for the maximum amount of vertices used during each specic test. The variable is used to render and measure the performance of each API. The execution time measured each frame is divided into two parts, one part that populates the command lists, which is refered to as time spent by the CPU, and a second part that focuses on the execution time of the prepared lists, which is refered to as time spent by the GPU.

For each test the API variable is initialized with the Direct3D 11 version during the rst stage, then released and initialized as the Direct3D 12 version for the second stage. After each test the measured times are committed to a le. Chapter 3. Method 9

Timer Class The application uses a CPU timer to measure the execution time. The basic concept of this is a class that uses the QueryPerformanceFrequency function to nd the frequency that the processor is running. This is then used alongside with a QueryPerformanceCounter function to get timestamps and calculate elapsed time. To calculate the time spent by the GPUs execution additional steps have to be taken. For Direct3D 11 a timer class based around query timestamps is used in order to force the CPU to wait for the GPUs execution to be nished. In Direct3D 12 this is established by using a fence in conjunction with the Command Queues Signal function. When it is established that the CPU waits for GPU execution to be nished, the CPU timer can be used to measure the execution time.

Shaders The shaders used in the application have been minimized to reduce their impact on performance. During the vertex shader stage the vertices are simply passed on to the rasterization stage because they already have their position dened in homogeneous clip space coordinates. The pixel shader returns a constant color and does no additional work.

3.1.2 Test Denitions Each test species a number of samples, increment of vertices per sample and the amount of threads to use. In each test the vertex buer is initialized to the maximum size needed. Rendering starts at zero points and for each sample the number of vertices to render increases dependent of the increment chosen. To get more reliability from the time measured, each test also has a variable that determines the amount of times it should be run. An average of the time measured is calculated when the data gathering is complete. Chapter 4 Performance Tests

This chapter presents the data obtained while running the benchmark on a testing platform. Each test contains conclusions. These conclusions are drawn from a single test platform, thus these conclusions can not be used to give a general estimation about performance outside of the experiment conducted.

4.1 Test Parameters

Test computer specications:

Component Description CPU (R) Xeon(R) CPU E5-1620 v4 3.50GHz (Four cores) GPU GeForce GTX 1080 Driver Version 368.39 Microsoft Build 10.0.10586.0 DirectX Microsoft DirectX 11 and Microsoft DirectX 12 Development Microsoft Visual Studio Community 2015 with Update 1

Every test was executed in window mode at a resolution of 800x600. Each test produce the same visual output. During the measuring the Present function of the Swap Chain was disabled to ensure no synchronization was used when ren- dering frames.

Section 4.2 to section 4.9 present each test generated by the benchmark and contains the parameters used. Rendering performance is measured in millisec- onds. The workload is separated in two categories; CPU and GPU, the CPU part starts measuring at the beginning of the render call and ends when all command lists have been lled with commands, the GPU part measure the execution of the lists on the GPU.

Each test render zero to 400 000 vertices, in steps of 4000 vertices. This amount was chosen after preliminary tests with both higher and lower count showed lin- ear patterns. Each base test rendered all vertices with one draw call per thread, complimentary tests labeled "b" were executed with the same parameters but

10 Chapter 4. Performance Tests 11 with one draw call per vertex instead. This is intended to shift the focus to API interaction. To make measurements more stable each of the tests were executed 1000 times and the mean of the time measured was used for the results. Since the CPU has four cores with two threads running on each core tests were limited to use a maximum of eight threads. Chapter 4. Performance Tests 12 4.2 Test 1

This test uses one thread to ll the command list, and one draw call for vertices.

DX11 DX12 0.7 0.7

0.6 0.6

0.5 0.5

0.4 0.4

0.3 0.3 MILLISECONDS MILLISECONDS

0.2 0.2

0.1 0.1

0 0 0 0 16000 32000 48000 64000 80000 96000 16000 32000 48000 64000 80000 96000 112000 128000 144000 160000 176000 192000 208000 224000 240000 256000 272000 288000 304000 320000 336000 352000 368000 384000 400000 112000 128000 144000 160000 176000 192000 208000 224000 240000 256000 272000 288000 304000 320000 336000 352000 368000 384000 400000 VERTICES VERTICES

D3D11 CPU D3D11 Total D3D12 CPU D3D12 Total

Figure 4.1: Test 1, rendered with Direct3D 11 and Direct3D 12

DX11 vs DX12 0.7

0.6

0.5

0.4

0.3 MILLISECONDS

0.2

0.1

0 0 16000 32000 48000 64000 80000 96000 112000 128000 144000 160000 176000 192000 208000 224000 240000 256000 272000 288000 304000 320000 336000 352000 368000 384000 400000 VERTICES

D3D11 Total D3D12 Total

Figure 4.2: Test 1, total render time for both APIs

Test 1 Conclusion In Figure 4.1 Direct3D 11 shows lower CPU execution time overall, while Direct3D 12 shows lower total execution time until around 144 000 points drawn. Both APIs show linear behavior with constant time for CPU portion of the execution as seen in Figure 4.2. Chapter 4. Performance Tests 13 4.3 Test 1b

This test uses one thread to ll the command list, and one draw call per vertex.

DX11 DX12 14 14

12 12

10 10

8 8

6 6 MILLISECONDS MILLISECONDS

4 4

2 2

0 0 0 0 16000 32000 48000 64000 80000 96000 16000 32000 48000 64000 80000 96000 112000 128000 144000 160000 176000 192000 208000 224000 240000 256000 272000 288000 304000 320000 336000 352000 368000 384000 400000 112000 128000 144000 160000 176000 192000 208000 224000 240000 256000 272000 288000 304000 320000 336000 352000 368000 384000 400000 VERTICES VERTICES

D3D11 CPU D3D11 Total D3D12 CPU D3D12 Total

Figure 4.3: Test 1b, rendered with Direct3D 11 and Direct3D 12

DX11 vs DX12 14

12

10

8

6 MILLISECONDS

4

2

0 0 16000 32000 48000 64000 80000 96000 112000 128000 144000 160000 176000 192000 208000 224000 240000 256000 272000 288000 304000 320000 336000 352000 368000 384000 400000 VERTICES

D3D11 Total D3D12 Total

Figure 4.4: Test 1b, total render time for both APIs

Test 1b Conclusion In Figure 4.3 Direct3D 11 shows lower total execution time. Both APIs show linear behavior as seen in Figure 4.4. Chapter 4. Performance Tests 14 4.4 Test 2

This test uses two threads to ll the command lists, and two draw calls for vertices.

DX11 DX12 0.6 0.6

0.5 0.5

0.4 0.4

0.3 0.3 MILLISECONDS MILLISECONDS MILLISECONDS 0.2 0.2

0.1 0.1

0 0 0 0 16000 32000 48000 64000 80000 96000 16000 32000 48000 64000 80000 96000 112000 128000 144000 160000 176000 192000 208000 224000 240000 256000 272000 288000 304000 320000 336000 352000 368000 384000 400000 112000 128000 144000 160000 176000 192000 208000 224000 240000 256000 272000 288000 304000 320000 336000 352000 368000 384000 400000 VERTICES VERTICES

D3D11 CPU D3D11 Total D3D12 CPU D3D12 Total

Figure 4.5: Test 2, rendered with Direct3D 11 and Direct3D 12

DX11 vs DX12 0.6

0.5

0.4

0.3 MILLISECONDS 0.2

0.1

0 0 16000 32000 48000 64000 80000 96000 112000 128000 144000 160000 176000 192000 208000 224000 240000 256000 272000 288000 304000 320000 336000 352000 368000 384000 400000 VERTICES

D3D11 Total D3D12 Total

Figure 4.6: Test 2, total render time for both APIs

Test 2 Conclusion In Figure 4.5 Direct3D 11 shows lower CPU execution time overall, while Direct3D 12 shows lower total execution time until around 160 000 points drawn. Both APIs show linear behavior with constant time for CPU portion of the execution as seen in Figures 4.5 and 4.6. Chapter 4. Performance Tests 15 4.5 Test 2b

This test uses two threads to ll the command lists, and one draw call per vertex.

DX11 DX12 8 8

7 7

6 6

5 5

4 4

MILLISECONDS 3 MILLISECONDS 3

2 2

1 1

0 0 0 0 16000 32000 48000 64000 80000 96000 16000 32000 48000 64000 80000 96000 112000 128000 144000 160000 176000 192000 208000 224000 240000 256000 272000 288000 304000 320000 336000 352000 368000 384000 400000 112000 128000 144000 160000 176000 192000 208000 224000 240000 256000 272000 288000 304000 320000 336000 352000 368000 384000 400000 VERTICES VERTICES

D3D11 CPU D3D11 Total D3D12 CPU D3D12 Total

Figure 4.7: Test 2b, rendered with Direct3D 11 and Direct3D 12

DX11 vs DX12 8

7

6

5

4

MILLISECONDS 3

2

1

0 0 16000 32000 48000 64000 80000 96000 112000 128000 144000 160000 176000 192000 208000 224000 240000 256000 272000 288000 304000 320000 336000 352000 368000 384000 400000 VERTICES

D3D11 Total D3D12 Total

Figure 4.8: Test 2b, total render time for both APIs

Test 2b Conclusion In Figure 4.7 Direct3D11 shows lower total execution time. Both APIs show linear behavior as seen in Figure 4.8. There is less dierence in execution time when compared to test 1b. Chapter 4. Performance Tests 16 4.6 Test 3

This test uses four threads to ll the command lists, and four draw calls for ver- tices.

DX11 DX12 0.6 0.6

0.5 0.5

0.4 0.4

0.3 0.3 MILLISECONDS MILLISECONDS 0.2 0.2

0.1 0.1

0 0 0 0 16000 32000 48000 64000 80000 96000 16000 32000 48000 64000 80000 96000 112000 128000 144000 160000 176000 192000 208000 224000 240000 256000 272000 288000 304000 320000 336000 352000 368000 384000 400000 112000 128000 144000 160000 176000 192000 208000 224000 240000 256000 272000 288000 304000 320000 336000 352000 368000 384000 400000 VERTICES VERTICES

D3D11 CPU D3D11 Total D3D12 CPU D3D12 Total

Figure 4.9: Test 3, rendered with Direct3D 11 and Direct3D 12

DX11 vs DX12 0.6

0.5

0.4

0.3 MILLISECONDS 0.2

0.1

0 0 16000 32000 48000 64000 80000 96000 112000 128000 144000 160000 176000 192000 208000 224000 240000 256000 272000 288000 304000 320000 336000 352000 368000 384000 400000 VERTICES

D3D11 Total D3D12 Total

Figure 4.10: Test 3, total render time for both APIs

Test 3 Conclusion In Figure 4.9 Direct3D 11 shows lower CPU execution time overall, while Direct3D 12 shows lower total execution time until around 224 000 points drawn. Both APIs show linear behavior with constant time for CPU portion of the execution as seen in Figure 4.10. Chapter 4. Performance Tests 17 4.7 Test 3b

This test uses four threads to ll the command lists, and one draw call per vertex.

DX11 DX12 6 6

5 5

4 4

3 3 MILLISECONDS MILLISECONDS 2 2

1 1

0 0 0 0 16000 32000 48000 64000 80000 96000 16000 32000 48000 64000 80000 96000 112000 128000 144000 160000 176000 192000 208000 224000 240000 256000 272000 288000 304000 320000 336000 352000 368000 384000 400000 112000 128000 144000 160000 176000 192000 208000 224000 240000 256000 272000 288000 304000 320000 336000 352000 368000 384000 400000 VERTICES VERTICES

D3D11 CPU D3D11 Total D3D12 CPU D3D12 Total

Figure 4.11: Test 3b, rendered with Direct3D 11 and Direct3D 12

DX11 vs DX12 6

5

4

3 MILLISECONDS 2

1

0 0 16000 32000 48000 64000 80000 96000 112000 128000 144000 160000 176000 192000 208000 224000 240000 256000 272000 288000 304000 320000 336000 352000 368000 384000 400000 VERTICES

D3D11 Total D3D12 Total

Figure 4.12: Test 3b, total render time for both APIs

Test 3b Conclusion In Figure 4.11 Direct3D 11 shows lower total execution time. Both APIs show linear behavior as seen in Figure 4.12. There is less dierence in execution time when compared to test 2b. Chapter 4. Performance Tests 18 4.8 Test 4

This test uses eight threads to ll the command lists, and eight draw calls for vertices.

DX11 DX12 1.8 1.8

1.6 1.6

1.4 1.4

1.2 1.2

1 1

0.8 0.8 MILLISECONDS MILLISECONDS 0.6 0.6

0.4 0.4

0.2 0.2

0 0 0 0 16000 32000 48000 64000 80000 96000 16000 32000 48000 64000 80000 96000 112000 128000 144000 160000 176000 192000 208000 224000 240000 256000 272000 288000 304000 320000 336000 352000 368000 384000 400000 112000 128000 144000 160000 176000 192000 208000 224000 240000 256000 272000 288000 304000 320000 336000 352000 368000 384000 400000 VERTICES VERTICES

D3D11 CPU D3D11 Total D3D12 CPU D3D12 Total

Figure 4.13: Test 4, rendered with Direct3D 11 and Direct3D 12

DX11 vs DX12 1.8

1.6

1.4

1.2

1

0.8 MILLISECONDS 0.6

0.4

0.2

0 0 16000 32000 48000 64000 80000 96000 112000 128000 144000 160000 176000 192000 208000 224000 240000 256000 272000 288000 304000 320000 336000 352000 368000 384000 400000 VERTICES

D3D11 Total D3D12 Total

Figure 4.14: Test 4, total render time for both APIs

Test 4 Conclusion In Figure 4.13 Direct3D 11 shows lower total execution time. Both APIs show linear behavior as seen in Figure 4.14. Chapter 4. Performance Tests 19 4.9 Test 4b

This test uses eight threads to ll the command lists, and one draw call per vertex.

DX11 DX12 6 6

5 5

4 4

3 3 MILLISECONDS MILLISECONDS 2 2

1 1

0 0 0 0 16000 32000 48000 64000 80000 96000 16000 32000 48000 64000 80000 96000 112000 128000 144000 160000 176000 192000 208000 224000 240000 256000 272000 288000 304000 320000 336000 352000 368000 384000 400000 112000 128000 144000 160000 176000 192000 208000 224000 240000 256000 272000 288000 304000 320000 336000 352000 368000 384000 400000 VERTICES VERTICES

D3D11 CPU D3D11 Total D3D12 CPU D3D12 Total

Figure 4.15: Test 4b, rendered with Direct3D 11 and Direct3D 12

DX11 vs DX12 6

5

4

3 MILLISECONDS 2

1

0 0 16000 32000 48000 64000 80000 96000 112000 128000 144000 160000 176000 192000 208000 224000 240000 256000 272000 288000 304000 320000 336000 352000 368000 384000 400000 VERTICES

D3D11 Total D3D12 Total

Figure 4.16: Test 4b, total render time for both APIs

Test 4b Conclusion In Figure 4.15 Direct3D 11 shows lower total execution time, although very similar in the beginning. Both APIs show linear behavior as seen in Figure 4.16. There is less dierence in execution time when compared to test 2b. Chapter 5 Analysis and Discussion

This chapter contains the analysis and discussion of the tests. The reections here are based on experience gained from MSDN documentations and by working with the thesis project.

5.1 Analysis

This thesis focuses on API interaction and rendering of basic shaded geometry. The research question asked if higher rendering performance would be achieved when rendering basic shaded geometry. The conclusions from the tests suggests that in this specic experiment, Direct3D 12 does not automatically oer higher rendering performance. In most cases Direc3D 11 perform better, but there are some cases where Direct3D 12 is ahead.

During the tests that use one draw call per thread Direct3D 12 executes faster on the lower end of the graphs in all cases except when using eight threads. When viewing the complementary tests that is intended to shift the focus to API in- teraction Direct3D 11 shows better performance in all cases. It is worth to note that this gap decreases signicantly when more threads are used, and in test 4b the performance on the lower vertex counts approaches similar execution time.

5.2 Discussion

This result somewhat contradicts other studies made on the subject which show that Direct3D 12 has given higher performance. One of these studies clearly show that CPU time needed is signicantly less in Direct3D 12 when compared to Di- rect3D 11. In this study the benchmark Star Swarm, developed by Oxide Games, stress tests API eciency [10].

One reasoning behind why this is not evident in this experiment might be that the benchmark application developed does not use Direct3D 12 to its full extent. When designing the benchmark the structure of Direct3D 12 was implemented

20 Chapter 5. Analysis and Discussion 21 with the intent to mimic the functionality of Direct3D 11. This is most likely not the optimal case for Direct3D 12. It is even stated that the use of fences to wait for the previous frame to be nished is not best practice [2]. This leaves the CPU waiting and is essentially wasting valuable execution time that could be used in other ways.

An observation made when running the benchmark was that Direct3D 11 utilize more of the CPU and allocates more threads for the application than Direct3D 12. This could be an indication that the drivers for Direct3D 11 handle the opti- mization automatically while in Direct3D 12 the user has to be more specic with resource usage. This thought process aligns with the observation that dierences in performance are less noticable when using more threads as the CPU spends less time waiting during the Direct3D 12 execution.

When evaluating the primary tests which use one draw call per thread Direct3D 12 did execute faster at the lower vertex counts. This could be benecial when rendering objects with less complex geometry, which is often the case within graphical rendering.

The testing approach for this thesis might not be aligned with the recommended use for the Direct3D 12 API. This could be because Direct3D 12 oers the ability to set several stage settings with the pipeline structure, while the test specied is the bare minimum to render shaded geometry. These parts are divided in Di- rect3D 11 which allow for more basic use. This could mean that if all stages in the pipeline were necessary more eciency might be achieved. Chapter 6 Conclusions and Future Work

The conclusion for this experiment is that Direct3D 12 did oer higher rendering performance, but only in specic cases, when rendering basic shaded geometry with a benchmark designed as described by the method in this thesis.

A conclusion to be made is that for Direct3D 12 to give higher rendering perfor- mance it may not be as simple as imitating the implementation of a Direct3D 11 application. To be able to use it to full extent considerations need to be made to assure that the application is implemented with designs that align with rec- ommended usage of the Direct3D 12 API. With Direct3D 12 the user has more control and responsibilies. The driver for Direct3D 11 does much for its user, for example o oad the render thread and optimize resource residency [11].

Knowing that the benchmark application show performance that contradicts other studies future work that could be done is to evaluate if the benchmark was using the Direct3D 12 API in a manner that was intended. Other factors could be the hardware and software used, therefore studies with the developed benchmark could give other results when used on another platform than the one used in this thesis. Considering multiple core support is one of the aims for Di- rect3D 12 the study could benet from tests with more cores available for use. This would reveal if given these test parameters, could Direct3D 12 give higher rendering performance than Direct3D 11 if more work could be done in parallel.

22 References

[1] K. Karimi, N. Dickson, and F. Hamze, A Performance Comparison of CUDA and OpenCL. Cornell University , 2010. [Online] Available from: http://http://arxiv.org/abs/1005.2581 Accessed: 22 April 2015. [2] MSDN Direct3D 12 Programming Guide. Available from: https: //msdn.microsoft.com/en-us/library/windows/desktop/dn899121(v= vs.85).aspx Accessed: 20 September 2016. [3] DirectX Developer Blog. Available from: http://blogs.msdn.com/b/ /archive/2014/03/20/directx-12.aspx Accessed: 20 September 2016. [4] MSDN . Available from: https://msdn.microsoft.com/ en-us/library/windows/desktop/ff476882(v=vs.85).aspx Accessed: 20 September 2016. [5] F. D. Luna, Introduction to 3D GAME PROGRAMMING WITH DIRECTX 11. Dulles : David Pallai, 2012. [6] MSDN Rendering. Available from: https://msdn.microsoft.com/ en-us/library/windows/desktop/ff476892(v=vs.85).aspx Accessed: 20 September 2016. [7] AMD DirectX 12. Available from: https://msdn.microsoft.com/ en-us/library/windows/desktop/ff476892(v=vs.85).aspx Accessed: 20 September 2016. [8] MSDN Rasterization Rules. Available from: https://msdn.microsoft. com/en-us/library/windows/desktop/cc627092(v=vs.85).aspx Ac- cessed: 20 September 2016. [9] J. Zink, M. Pettineo, and H. J, Practical Rendering and Computation with Direct3D 11. A K Peters: CRC Press, 2011. [10] DirectX 12 Performance Preview. Avail- able from: http://www.anandtech.com/show/8962/ the-directx-12-performance-preview-amd-nvidia-star-swarm Ac- cessed: 20 September 2016.

23 References 24

[11] GDC Advanced Rendering with DirectX12. Available from: http: //developer.download.nvidia.com/gameworks/events/GDC2016/\\ AdvancedRenderingwithDirectX11andDirectX12.pdf Accessed: 21 September 2016.