CALIFORNIA STATE UNIVERSITY, NORTHRIDGE

Comparative Effectiveness of CPU and GPU Ray Tracing With Parallel Computation

A thesis submitted in partial fulfillment of the requirements

For the degree of Master of Science in Computer Science

By

Dustin Patrick Delmer

May 2017 Copyright Dustin Patrick Delmer 2017

ii The thesis of Dustin Patrick Delmer is approved:

______Dr. Robert Mcllhenny Date

______Dr. John Noga Date

______Dr. G. Michael Barnes, Chair Date

California State University, Northridge

iii In Loving Memory of my Father,

Daniel William Christian Delmer

iv Table of Contents

Copyright ...... ii

Signature Page ...... iii

Dedication …...... iv

List of Figures …...... vii

Abstract …...... vi

1. Introduction …...... 1

2. Related Work …...... 3

3. Technology Overview …...... 5

3.1. CPU vs GPU …...... 5

3.2. ISPC …...... 5

3.3. CUDA …...... 6

4. Implementation …...... 8

4.1. Ray Tracing Algorithm …...... 8

4.2. ++ Serial Implementation …...... 10

4.3. ISPC Implementation …...... 10

4.4. CUDA Implementation …...... 13

4.5. Dynamic Scene Generation …...... 16

5. Results/Comparison …...... 17

5.1. Hardware …...... 17

5.2. Rendering Results from Default Scene …...... 19

5.3. Dynamic Scene: Sphere Count and Resolution Costs …...... 20

5.4. Data Gather Techniques and Results …...... 21

6. Conclusion …...... 32

v References …...... 33

vi List of Figures

Figure 4.1. Ray tracing figure …...... 9

Figure 4.2 Reflection and refraction …...... 9

Figure 4.3: foreach_tiled loop ...... 11

Figure 4.4: ISPC Multi-core - the task function ...... 12

Figure 4.5: ISPC multi-core – launch[nTasks] ...... 12

Figure 4.6: CUDA - malloc and memcpy…...... 14

Figure 4.7: CUDA - Kernel function …...... 15

Figure 4.8: CUDA - Kernel call …...... 16

Figure 4.9: Code snippet from common/rand_sphere …...... 17

Figure 5.1: Default config, 640x480, five static spheres and a light ...... 19

Figure 5.2: Sphere counts ( 8, 64, 216, 512, 1000) …...... 22

Figure 5.3: Sphere Count vs Time - All Techniques …...... 24

Figure 5.4: Sphere Count vs Time - ISPC Only …...... 25

Figure 5.5: Sphere Count vs Time - CUDA vs ISPC …...... 26

Figure 5.6: Resolution vs Time - All Techniques …...... 27

Figure 5.7: Resolution vs Time - ISPC Only …...... 28

Figure 5.8: Resolution vs Time - CUDA vs ISPC …...... 29

Figure 5.9: Bar Chart Comparison - ISPC vs CUDA …...... 30

vii Abstract

Comparative Effectiveness of CPU and GPU Ray Tracing With Parallel Computation

By

Dustin Patrick Delmer

Master of Science in Computer Science

In this thesis, a comparison of GPU and CPU based computation using consumer

grade hardware and parallel programming languages in a raytracer is presented. The

raytracers presented make use of C++, Intel’s SIMD CPU language and compiler: ISPC,

and Nvidia’s GPGPU language/compiler: CUDA. Performance was measured for three

levels of image resolution (2562, 5122, 10242) and five levels of image complexity

( sphere counts: 8, 64, 216, 512, 1000). Image resolution had the greatest impact on

performance. Image complexity had a constant effect on performance. As image

resolution increased the parallel GPU solution offered the best results. This thesis

discusses the advantages and disadvantages of CPU vs GPU parallel programming.

viii 1. Introduction

Parallel programming has become increasingly relevant over the past decade.

Manufacturers have shifted their focus from increasing clock speeds to augmenting the core and thread counts in each generation of new CPU’s and GPU’s. In this environment writing applications that take full advantage of these multi-cored components has becoming more and more essential. There are several languages and compilers that exist to help programmers write code capable of generating efficient parallel programs. In this thesis two of these languages will be discussed: ISPC and CUDA. This thesis will show how these languages can be utilized to augment a raytracer, and compare and contrast them.

Ray tracing is a technique for generating an image by tracing the path of light through pixels in an image plane and simulating the effects of its encounters with virtual objects. These algorithms are used in 3D rendering and animation and computer graphics.

ISPC is an Intel developed compiler with extensions for “single program, multiple data”

(SPMD) programming [1]. ISPC simplifies the task of spawning multiple parallel program instances across a CPU. ISPC provides a thin abstraction layer between the programmer and the hardware; to spare the programmer the burden of writing extremely low-level intrinsics to achieve high performance. ISPC methods are exposed to C++ code through *.isph headers, and objects produced by .ispc files are compiled into libraries and executables along with C++ code using conventional c++ compilers.

CUDA is an Nvidia developed GPGPU (general purpose programmability on the ) compiler and language that allows programmers to allocate massively

1 parallel tasks directly to the GPU [2]. Through CUDA parallel functions can be written in a seemingly serial manner, which is instanced in many threads, across many blocks within a GPU.

When using CUDA, the CPU’s role is allocating memory, copying data, and launching Kernel functions. CUDA programs take advantage of the large number of cores present in modern

GPU’s to do the bulk of their computation.

In this report there are six sections. Section 2 will present work related to comparisons of serial and parallel programming. Section 3 will present CPU and GPU architectures, and ISPC and CUDA technologies used in this report. Section 4 will present the implementation of the raytracer in C++, ISPC, and CUDA. Section 5 presents the results of running these implementations with five sphere counts, over three image resolutions. Lastly, section 6 will briefly discuss the results.

2 2. Related Work

Parallel programming is widely used in the world of heavy computation. In this chapter, different works related to ISPC, CUDA, and parallel programming will be discussed.

Gil Rapaport et al. presented a comparison between OpenCL, ISPC, and CHORuS for vectorization performance boosts on modern CPU hardware in C/C++ [8]. CHORuS is a lightweight and static extension to C in which the programmer expresses computations as composable vector operations applied to scalar Kernels. CHORuS has the advantage of not introducing a whole new language, compiler, and file extension the way that ISPC and CUDA do.

CHORuS was tested against OpenCL and ISPC using the sample programs shipped with ISPC.

Overall, ISPC outperformed OpenCL and Chorus in the single core results. For some reason, the authors included multicore implementations of OpenCL and CHORuS, but only a single core implementation of ISPC.

Chris Gregg et al. presented a comparison between GPU and CPU computing with respect to memory-transfer overhead [9]. This paper has heavy emphasis on the relative cost of memory transfer in GPU based programs and their contribution to total computation time. In this paper a taxonomy is proposed for categorizing GPU Kernels. Kernels are GPU functions callable by the CPU. The five proposed taxonomies are: Non-Dependent (ND), Dependent-Streaming

(DS), Single-Dependent-Host-to-Device (SDH2D), Single-Dependent-Device-to-Host (SDD2H), and Dual-Dependant (DD). These taxonomies are based on what kind of memory transfer is required for a given application. Using this taxonomy, a raytracer would be both Single-

Dependent-Host-to-Device and Single-Dependent-Device-to-Host, since it requires pixels to be

3 transferred from the ray tracing Kernel, but not to the CPU, and spheres, or the scene, to be transferred to the Kernel, but not back to the CPU.

Stamos Katsigiannis et al. presented work comparing the performance of multicore CPU and GPU algorithms for video compression [10]. This comparison used CUDA for the GPU implementation, and C for the CPU implementation. The hardware used was a AMD Phenom II

X4 965 CPU, and an NVIDIA Tesla C2070 GPU. Though it is not clear if the CPU implementation was properly vectorized, the GPU implementation showed a 2.8 to 11x speed up in video encoding, and a 3.3 to 21x speed up in video decoding, which is a massive difference in performance. It is worth noting that the tesla C2070 is a very high end GPU, while the CPU used was not quite as high end. The results of this work may have been closer if similar level hardware was compared.

4 3. Technology Overview

3.1 CPU vs GPU computation

In general, computation is either done on the CPU or the GPU. When writing a parallel program, one decision that must be made is where the computation will take place. Not all systems have GPU’s; this is something to consider when choosing a target processor. CPU’s are generally much less powerful than GPU’s when measured in terms of maximum possible floating point computations per second, but CPU’s offer more control. It can be difficult to fully utilize a

GPU due to the parallel nature of the hardware. A GPU is optimized to run the same instruction on many cores at the the same time, if there aren’t enough items to run the instruction on, many cores will be left idle. Another disadvantage of the GPU is that memory must sometimes be copied from main memory, to the GPU, and back between each cycle of computation; this can reduce performance. The flow of memory between the CPU and GPU is dependent on the algorithm in question [9]. Compared to GPU’s, CPU’s excel at serial tasks.

3.2. ISPC

To generate the parallel, CPU based raytracer, ISPC was used. ISPC is an open source

Intel developed compiler/language with extensions for SPMD “single program, multiple data” programming. ISPC simplifies the task of spawning multiple parallel program instances across a

CPU. ISPC provides a thin abstraction layer between the programmer and the hardware; to spare the programmer the burden of writing extremely low-level intrinsics to achieve high performance

[1]. ISPC utilizes the LLVM compiler infrastructure for back-end code generation and optimization [11].

5 ISPC is tightly coupled with C/C++ code. Code syntax in ISPC is essentially identical to

C/C++; this is ideal since the goal of ISPC is to assist developers in writing parallel C/C++ code without intrinsics. In ISPC, functions are exposed to C++ code through ISPC stub headers. A stub header is automatically generated from its parent ISPC file. Once compiled, foo.ispc will produce foo.o and foo_ispc.h. This stub header contains the declarations for any functions in the ispc file that include the export keyword in their definitions. Objects produced by *.ispc files are compiled into libraries and executables along with c++ code using conventional c++ compilers.

In general, a program that utilizes ISPC will have two phases of compilation. The first phase of compilation utilizes the ISPC compiler to generate object files and stub headers from ispc files.

At this point the second phase of compilation begins, in which the artifacts generated by the ISPC compiler are fed to a C++ compiler along with conventional C++ source code. C++ files can access functions defined in ISPC objects by including their corresponding stub headers. The resulting compiled library or executable yields a program with parallel instructions for the CPU.

ISPC is aware of several hardware architectures. It currently supports the SSE2, SSE4,

AVX1, AVX2, AVX512, and Xeon Phi "Knight's Corner" instruction sets [1]. By explicitly passing a specific instruction set to ISPC via the --target compiler flag, ISPC can be further optimized to provide instructions ideal for a machine’s particular processor. Changing the instruction set can have a significant impact in a program’s performance.

ISPC is used in third party libraries such as Embree [12]. Embree is a collection of high- performance ray tracing Kernels, and is used by multiple high profile movie studios.

3.3. CUDA

CUDA is an Nvidia developed GPGPU (general purpose programmability on the graphics processing unit) compiler and language that allows programmers to allocate massively parallel tasks directly to the GPU. Through CUDA parallel functions can be written in a

6 seemingly serial manner, which is instanced in many threads, across many blocks within a GPU.

When using CUDA, the CPU’s role is allocating memory, copying data, and launching Kernel functions. CUDA programs take advantage of the large number of cores present in modern

GPU’s to do the bulk of their computation.

CUDA is used in the gaming industry for real time physics and other graphical effects, such as explosions, fire, and liquid/gas simulations. CUDA has also found applications in fields unrelated to graphics, such as machine learning and AI. Through CUDA, GPUs can be used to train neural networks, and used to run the models generated by neural networks to do classification and prediction in the cloud [13].

7 4. Implementation

4.1. Ray Tracing

As a metric of comparison for the parallel programming languages discussed, a ray tracing algorithm was used. Ray tracing is a technique used to generate a 2D representation of a 3D scene. Ray tracing has been around for decades, but is still used heavily in computer graphics.

In order for a raytracer to generate an image, a scene must contain at least one light, and one object. A simple scene could contain a single sphere, and a single light source. Each element in a scene has several properties:

● center/position - x,y,z coordinate representing the object’s position in the scene.

● transparency - a 0 to 1 value representing the level of transparency of the object. 0 being

completely transparent, and 1 being opaque.

● reflectivity - a 0 to 1 value representing the level of reflectivity of the object. As this

value increases, the amount of light that bounces off the surface also increases.

● emission color - An RGB value representing the color emitted by the object. This value

is used for light sources.

● surface color - An RGB value representing the color of the surface itself.

These properties, along with a radius, allow us to represent spheres and lights in a 3D scene.

Once a scene is defined, we can apply the algorithm to generate an image.

To create an image we must assign a camera or eye position within the scene. This position will be the origin from which we cast rays to populate the pixel colors of the 2D image that represents our scene, from perspective of the camera. Each cell in the 2D grid represents a pixel in our result. Each pixel is an RGB value. To determine the color of a given pixel, a ray is

8 drawn from the camera, through the pixel, into the scene. If the ray collides with an object, a color value is calculated, and two new rays are generated: a refraction ray and a reflection ray.

Figure 4.1: High level view of a raytracer

A refraction ray is a child of the initial ray, and is a function of the surface translucency value.

This represents the light that passes through the surface; its potency is a function of the surface’s translucency. A reflection ray is also a child of the initial ray, but is a function of the surface’s reflectivity, it represents the light bouncing off the surface. Each of these rays is computed recursively, and each of these rays can spawn their own reflection and refraction rays. To prevent these rays from causing an infinite loop, a maximum recursion depth is assigned.

Figure 4.2: Reflection and Refraction

9 For this thesis, a simple serial C++ raytracer was used as a basis from which parallel raytracers written in ISPC and CUDA were created. Specifically, the scratch pixel raytracer was used [6].

Ray tracing is trivial to parallelize because each pixel in the grid is independent of every other pixel in the grid. This means, in theory, every single pixel could be calculated in parallel, if enough cores were provided. This makes ray tracing an ideal metric for comparing parallel languages.

4.2. C++

C++ was the base language used for this thesis. C++ is heavily used in programs and applications that are computationally intensive. Compared to similar object oriented languages such as Java and C#, C++ offers better performance at the cost of higher complexity. The serial implementation of the ray tracing algorithm used in this thesis is single threaded C++.

4.3. ISPC parallel implementation

To create a parallel CPU based raytracer equivalent to the serial C++ one, ISPC was used.

The ISPC compiler was pulled from the ISPC website [1]. To start, the original C++ serial renderer was copied to a new directory. In this directory, normalize(), dot(), and cross() functions were added to a sphere.ispc file for vector computation. float3 and sphere struct definitions were also added to spheres.ispc to represent colors, rays, and spheres. Render(), trace(), sphereIntersect(), and mix() were moved from the C++ file to the spheres.ispc file and translated to ISPC.

10 Figure 4.3: foreach_tiled loop

As seen in figure 4.3, foreach_tiled was used within the render method to force parallel execution. The heart of ISPC’s SPMD vectorization lies in its foreach, and foreach_tiled methods. In this loop each (x,y) combination represents a single pixel in the image. In ISPC, the foreach and foreach_tiled constructs are used to create simple loops that spawn parallel program instances in the form of gangs across the CPU. In ISPC a gang is a set of program instances that run concurrently on a single core (SPMD). Depending on the target machine’s CPU architecture, and the instruction set specified when compiling with ISPC, the size of the gang changes. For this thesis the avx2-i32x16 instruction set was used, which formed gangs 16 lanes wide. This means on a single CPU core, 16 pixels would be computed in parallel, each pixel being represented by a single program instance within a gang. The render function was defined with the export keyword so that it could be used by the main C++ file.

A second version of the ISPC raytracer was constructed to use multiple cores. This version of the raytracer is very similar to the single-core implementation, but utilizes features from ISPC to further parallelize the code to run on all of the CPU’s cores. This required the introduction of a task method for partitioning the work into smaller pieces.

11 Figure 4.4: ISPC Multi-core - the task function

On line 188 of Figure 4.4 we can see the task qualifier. A function marked as a task corresponds to a single program instance. In this code block we can see that one render_tile_task contains a 16 by 16 pixel block, whose index is determined by taskIndex on line 202. Notice that taskIndex is not explicitly set. Each program instance is given a unique taskIndex at run-time; this index is used to map the current 16x16 tile to a unique block of pixels in the image.

Figure 4.5: ISPC multi-core - launch[nTasks]

12 On line 227 of Figure 4.5 the multi-core implementation, render_tile_task is called. A task method must be called using launch. The nTasks value is calculated based on the number of blocks required to draw the requested image. For example, an image with a 256 by 256 resolution would require 16 by 16 blocks, or 256 blocks, assuming each block contains 16x16 pixels. This means 256 tasks are spawned, each of which can be assigned to one of the CPU’s cores.

The launch and sync methods must be defined. Fortunately, ISPC comes with a tasksys.cpp file which can be compiled with your application to define launch and sync methods for scheduling tasks across multiple cores. The file tasksys.cpp, along with linux pthread [14] were used to support task delegation across the CPU.

For both the single and multi-core programs, the spheres.ispc file was fed to the ISPC compiler to generate spheres.o and spheres_ispc.h. The spheres_ispc.h stub header was included by the main spheres.cc file, giving spheres.cc access to the render function and spheres struct.

The spheres and light definitions were left in the main.cc function, but were re-worked slightly for compatibility with the ispc defined sphere struct. These spheres, along with three float vectors representing the red, green, and blue channels of the image pixels were fed to the ISPC render function. The spheres.cc file was compiled against spheres.o and spheres_ispc.h using gcc to generate the final program.

To measure the computation time, the timing.h header included with the ISPC distribution was used. This header was used in all of the raytracers to measure the execution times fairly.

4.4. CUDA Implementation

13 The CUDA implementation of the ray tracing algorithm differs from the ISPC implementation in a few ways. One difference is the extra steps required for copying memory to and from the GPU.

This was done using cudaMalloc, cudaMemcpy, and cudaFree.

Figure 4.6: CUDA - malloc and memcpy

All programs use these three functions. Function cudaMalloc is used to allocate memory on the GPU, in this thesis cudaMalloc was used to allocate memory to red, green, and blue arrays representing the pixel rgb values shown in lines 366-368 of Figure 4.6. Function cudaMemcpy is used to copy memory between the host (cpu), and the device (gpu), in this thesis cudaMemcpy was used to copy the spheres from the CPU to the GPU (line 379), and to copy the resolved pixels from the GPU to the CPU (lines 386-388). Function cudaFree is used to release memory allocated to the program from the GPU, in this thesis cudaFree was used to garbage collect the gpu rgb pixel arrays.

The parallelization of the CUDA implementation lies in its Kernel function. The Kernel is the function launched by the CPU, that runs on the GPU. The Kernel in this ray tracing implementation is the render function.

14 Figure 4.7: CUDA - Kernel function

The Kernel function is always marked with the __global__ qualifier, this makes it visible to the CPU. Within the render Kernel, the position of the pixel being rendered by the given thread is derived from the block index, and thread index. As seen in figure 4.7, the x position of a thread’s pixel is computed by the block index * the block dimension, plus the thread index. For example, if a thread is the third thread in the third block along the x dimension, than its block index would be 2, and its thread index would be 2. If these blocks were 4 threads wide along the x dimension, the equation would yield an x position of 2 * 4 + 2 = 10. The y_position is determined in a very similar manner. Using these positions each thread can be assigned a unique pixel. A Kernel function call is easily identified by the “<<<” and “>>>” marks that accompany it (line 303 in Figure 4.8).

15 Figure 4.8: CUDA - Kernel call

The values inside these triple angled brackets represent the grid dimensions and block dimensions. The gridSize argument tells the Kernel how many threading blocks to spawn. In the dimension notation, a gridSize of (2, 2, 2) would generate a 2x2x2 cube of thread blocks.

Similarly, the blockSize dimensions tell the Kernel how many threads should be spawned within each threading block. The total number of threads generated can be computed by taking the product of all six dimensions (the integers composing the gridSize and blockSize).

Once the render Kernel is called, the GPU starts spawning in running threads across the

GPU in parallel. When the render Kernel finishes the GPU RGB arrays are copied back to the

CPU, and the resulting image is drawn.

The CUDA implementation of the ray tracing algorithm was generated with a single source file, sphere.cu. This file was compiled with NVCC, the NVIDIA CUDA Compiler.

NVCC uses a host C++ compiler under the hood, and hides much of the compilation from the user. Overall, the CUDA implementation felt higher level, and easier to work with.

4.5 Dynamic Scene Generation

For data collection purposes, all three implementations of the program allow the user to assign the resolution of the image (width and height), as well as the sphere count within the image. A common .cpp header was created and shared between all three implementations for the generation of random spheres within the scene. To efficiently place spheres within the scene, a frustum calculation is made to assist in the placement of randomly generated spheres.

16 Figure 4.9: code snippet from common/rand_sphere.h

The viewing frustum calculation on lines 14 and 15 of figure 4.9 allows the rand_sphere method to place spheres within the viewing spectrum of the camera; no randomly generated sphere will appear off screen. The camera exists at (0,0,0), and the z variable represents the distance of the randomly generated sphere from the camera in the Z direction (depth). The x and y variables similarly represent the X and Y axes, are bounded by the frustum calculation.

17 5. Results

5.1. Hardware Used:

All compilation and run-time results gathered for this thesis were done on one machine. Because this thesis is comparing relative performance of GPU and CPU based programs, it is important to discuss the hardware used.

CPU

Name Intel Core i7-4790k

Cores 4

Threads 8

Processor Base Frequency 4.00 GHz

Cache 8 MB SmartCache

Instruction Set Extensions SSE4.1/4.2, AVX 2.0

Original Retail Price $339-350 (Q2’14) Table 5.1: CPU specs

GPU

Name Gigabyte GeForce GTX 970 G1 Gaming

Memory 4 GB DDR5

GPU Clock 1.178 GHz

Memory Interface 256-bit

Original Retail Price $329 ( September 2014 ) Table 5.2: GPU Specs

Neither the CPU or GPU was overclocked for the tests performed. As shown in the above tables, the CPU and GPU had similar price points, and were released within a few months of each other.

18 Note the clock speed discrepancy between the two processors: 4.0 GHz vs 1178 MHz, nearly a

4:1 ratio in favor of the CPU. The GPU makes up for this with its massive 4 GB memory.

5.2. Rendering Results from Default Scene:

To compare the rendering times of the three approaches, C++ serial, ISPC, and CUDA, the timing header included with the ISPC distribution was used. A call to start the timer was called immediately before, and immediately after the render call in the C++ and ISPC implementations.

The unit of time used by this header is clock cycles, measured in the millions. In the CUDA implementation the cudaMemCpy calls were included in the render time, since that step is not necessary in the pure CPU based implementations. All of the raytracer implementations were set to a maximum level of recursion level of 5. This means at most 25 = 32 reflection and refraction rays were spawned for each pixel.

Figure 5.1: default config, 640x480, 5 static spheres and a light

19 C++ Serial ISPC single-core ISPC CUDA mutli-core

Trial 1 2627.837 91.012 21.078 15.673

Trial 2 2635.000 90.836 20.261 15.731

Trial 3 2614.023 89.467 19.073 15.824

Trial 4 2611.201 90.802 20.401 15.730

Trial 5 2612.016 88.417 20.129 15.765

Number of 2620.015 90.107 20.188 15.745 Clock Cycles (Millions) Table 5.3: rendering times for default scene.

As seen in table 5.3 both ISPC and CUDA showed significantly faster rendering times than the serial C++ implementation. The ISPC single-core times were on average 29 times faster than the serial program (2620.015 / 90.107). The ISPC multi-core results were about 130 times faster.

The CUDA times however were about 166 times faster than the average C++ serial times, so

CUDA ran in about ¾’s the time of ISPC. Again, these times include the overhead of copying the pixel and sphere arrays to and from the GPU.

5.3 Dynamic Scene: Sphere Count and Resolution Costs

To analyze the ISPC, CUDA, and Serial approaches further, features were added to the programs to allow for variable resolution and sphere counts.

Increasing the number of spheres in the scene increases the amount of work done by the raytracer in two ways. The first computational cost is paid when the trace method of the

20 algorithm checks for intersections. In the trace method, each ray drawn must be checked against each sphere in the scene to see if an intersection is present. Because of this, the cost of finding each ray’s intersection with the scene increases linearly with the number of spheres in the scene.

The second affected portion of the ray tracing algorithm is the number of rays spawned.

For each pixel in the generated image, the algorithm draws at least one ray. If this initial ray intersects with a sphere, two new rays will be drawn based on the refraction and reflection angles, and recursively fed back into the trace method. Because the algorithm is capped at five levels of recursion, each pixel in a generated image is the product up to 1+2 5 rays. As the amount of free volume in the scene decreases, the probability of a spawned ray intersecting with a sphere and recursing increases. For these dynamically generated scenes all spheres were set to a fixed radius of 2, but were allowed to overlap. Even so, it is easy to see the probability of a ray intersecting with a sphere increases as the number of spheres increases.

The cost of increasing the resolution size is more obvious. As the resolution increases, the number of pixels increases. A image with a resolution of 256x256 has four times as many pixels as a 128x128 image, and because of this ends up with an average of four times the workload.

5.4 Data Collection Techniques and Results

For the analysis, rendering times were gathered from each of the four implementations, at three different resolutions, using five different sphere counts. The resolutions used were

256x256, 512x512, and 1024x1024. Each step in resolution size yields a 4x increase in pixel count, since both the width and height double. The sphere counts tested were 8, 64, 216, 512, and

21 1000. Each resolution was tested with each sphere count, for all four implementation of the raytracer.

To gather data with as little interference from outside processes as possible, the tests were executed overnight, through Ubuntu’s Virtual Console, with lightdm (Ubuntu’s GUI), disabled. A light wrapper script was written to run the serial, ISPC singlecore, ISPC multicore, and CUDA implementations with each configuration. Ten trials were done for each combination, and the results of each trial were piped to a corresponding text file (eg ./results/cuda/512x512_64.txt).

The text files generated by these test renders were parsed and averaged. The data gathered was used to generate the graphs and tables below.

22 Figure 5.2: sphere counts (8, 64, 216, 512, 1000)

Figure 5.2 shows images generated at each of the five sphere count levels. These particular samples were done using the CUDA implementation, at a resolution of 512x512. Despite the similarity in appearance between the 216, 512, and 1000 sphere count images, there is a significant difference in their render times.

23 Figure 5.3: Sphere Count vs Time - All Techniques

Figure 5.3 plots the average render times for C++ Serial, ISPC (single core), ISPC (multicore), and CUDA, measured at 512 by 512 pixels, with sphere counts of 8, 64, 216, 512, and 1000.

Because the ISPC and CUDA implementations are so much faster than the serial implementation, their relative performance is not really visible in this image. It is immediately clear that both

ISPC and CUDA offer huge performance gains over non-vectorized, serial C++ code.

24 Figure 5.4: Sphere Count vs Time - ISPC Only

Figure 5.4 uses the same axes and data as Figure 5.3, but isolates the single-core and multi-core

ISPC implementations of the raytracer. As expected the multi-core implementation outperforms the single-core implementation. An interesting takeaway from this graph is that the multi-core implementation is about four times faster than the single-core implementation. This is satisfying because the CPU used for this thesis has four cores, so in an optimal scenario a multi-core program should be about four times faster than a single-core program on this CPU.

25 Figure 5.5: Sphere Count vs Time - CUDA vs ISPC

Figure 5.5 isolates the multi-core ISPC implementation and the CUDA implementation using the same data referenced in Figure 5.3. In this graph we get an idea of how close CUDA and ISPC perform at 512 by 512 pixels. At all five sphere counts, ISPC and CUDA are quite similar.

CUDA maintains about a 5-20% lead over ISPC at this resolution at each sphere count (See Table

5.4 for reference).

26 Figure 5.6: Resolution vs Time - All Techniques

In Figure 5.6 three resolutions are compared, 256 by 256 pixels, 512 by 512 pixels, and 1024 by

1024 pixels. In Figures 5.6-8, the sphere count is fixed at 512. This figure plots all four raytracers, similar to Figure 5.3. Much like Figure 5.3, this graph mostly serves to show the gap between C++ serial and the parallel ISPC/CUDA implementations. The following figures will isolate the parallel implementations.

27 Figure 5.7: Resolution vs Time - ISPC Only

Figure 5.7 isolates the ISPC implementations from Figure 5.6. Figure 5.7 maintains the 4:1 performance ratio seen in Figure 5.4 between the multi-core and single-core implementations of

ISPC.

28 Figure 5.8: Resolution vs Time - CUDA vs ISPC

Figure 5.8 isolates the ISPC and CUDA implementations of the raytracer from Figure 5.6. Here we can see CUDA pull away from ISPC as the pixel count increases. At 1024 by 1024 pixels, with 512 spheres, IPSC takes about 50% longer than CUDA to complete the render. ( 5045 vs

7388 million clock cycles from Figure 5.4). In Figure 5.5, it seemed that the increasing the sphere count had very similar effects on ISPC and CUDA, but here we can see that as the resolution increases, CUDA becomes more favorable.

29 Figure 5.9: Bar Chart Comparison - ISPC vs CUDA

In Figure 5.9, ever resolution-sphere combination tested is shown for ISPC (multi-core), and

CUDA. Here we can see that the columns are almost identical at 256 by 256 pixels. ISPC maintains similar performance through the 512 by 512 trials, but at 1024 by 1024, CUDA is the clear victor. It is also apparent that the race between ISPC and CUDA is not affected much by the sphere count. As the sphere count increases, the percentage difference between ISPC and CUDA remains fairly static.

30 Resolution - Spheres Serial ISPC-singe ISPC-multi CUDA 256x256 - 8 307.44 13.9443 5.51 3.72 256x256 - 64 9917.03 397.76 81.02 80.64 256x256 - 216 40457.18 1350.01 338.83 285.46 256x256 - 512 82500.22 2579.54 640.60 599.99 256x256 - 1000 129422.17 4255.04 954.42 1015.38 512x512 - 8 1662.14 68.81 12.38 12.77 512x512 - 64 38980.59 1281.24 314.37 234.70 512x512 - 216 167628.16 4833.70 1104.38 831.97 512x512 - 512 321159.87 9096.01 2017.97 1660.32 512x512 - 1000 516826.77 14637.5 3289.21 2899.22 1024x1024 - 8 5138.14 222.05 42.26 34.35 1024x1024 - 64 156560.96 4597.48 987.17 721.40 1024x1024 - 216 654972.35 17224.46 3801.04 2476.48 1024x1024 - 512 1298173.63 31700.10 7388.55 5045.91 1024x1024 - 1000 2099290.93 51857.37 12110.69 8422.52 Table 5.4: Rendering Times In Millions of Clock Cycles

Table 5.4 contains the data used to produce the graphs shown in Figures 5.6 through 5.12. All times are recorded as clock cycles. The single-core ISPC implementation of the raytracer is at least 20 times faster than than pure C++ serial implementation in every trial, despite the fact that both programs run on a single CPU core. This is a testament to the CPU vectorization optimizations ISPC is doing under the hood.

31 6. Conclusion

As shown in the results, CUDA consistently outperformed ISPC in a simple ray tracer when compared on a system using an Intel Core i7-4790k and Gigabyte GeForce GTX 970.

These processors launched within the same year with almost identical price points. This data suggests that GPU based computation will perform better than CPU based computation for parallel applications running on machines with similar level CPU/GPU’s.

It would be interesting to test different algorithms with these languages to see how the results change, or if they change at all. Based on the taxonomy table in [9], it would be interesting to test a Non-Dependent Kernel algorithm on ISPC and CUDA, to see how much further CUDA could pull ahead. Similarly a Directly-Dependent algorithm might close the gap between CUDA and ISPC, or even favor ISPC depending on how much memory-transfer overhead was involved.

With fixed relatively low resolution size, ISPC and CUDA have similar performance. As scene resolution increases, approaching resolutions closer to standard monitor/tv displays, the

CUDA/GPU implementation shows significant advantage. This gap in performance will likely increase as resolution size increases, especially if GPU’s maintain a faster rate of core growth compared to CPU’s.

As someone new to both languages, I found CUDA easier to work with. It felt a bit simpler, and there are vastly more resources available for learning, debugging, and optimizing

CUDA due to its popularity.

32 References

1. https://ispc.github.io/ ISPC: Intel SPMD program compiler

2. http://www.nvidia.com/object/cuda_home_new.html CUDA Parallel Computing Platform

3. Rademacher, P., Ray Tracing: Graphics for the Masses,

https://www.cs.unc.edu/~rademach/xroads-RT/RTarticle.html

4. Pharr, M., Mark, W.R., ispc: A SPMD Compiler for High-Performance CPU

Programming Proceedings Innovative Parallel Computing (InPar), San Jose, CA,

https://cloud.github.com/downloads/ispc/ispc/ispc_inpar_2012.pdf, May 2012.

5. Sebastian Good, Little Performance Explorations: ISPC,

http://www.palladiumconsulting.com/2014/10/little-performance-explorations-ispc/

6. ScratchPixel, raytracer.cpp,

https://www.scratchapixel.com/code.php?id=3&origin=/lessons/3d-basic-

rendering/introduction-to-ray-tracing

7. Barney B., Introduction to Parallel Computing, Lawrence Livermore National

Laboratory,

https://computing.llnl.gov/tutorials/parallel_comp/, viewed 1/29/2017

8. Rapaport, Zaks, Ben-asher, Streamlining Whole Function Vectorization in C Using

Higher Order Vector Semantics.

https://www.computer.org/csdl/proceedings/ipdpsw/2015/7684/00/7684a718-abs.html

9. Gregg, Hazelwood, Where is the data? Why you cannot debate CPU vs. GPU

performance without the answer. http://ieeexplore.ieee.org/abstract/document/5762730/

10. Katsigiannis, Dimitsas, Maroulis, A GPU vs CPU performance evaluation of an

experimental video compression algorithm. http://ieeexplore.ieee.org/document/7148134/

11. http://llvm.org/ LLVM

33 12. https://embree.github.io/ Embree

13. http://www.nvidia.com/object/machine-learning.html CUDA Machine Learning

14. http://www.yolinux.com/TUTORIALS/LinuxTutorialPosixThreads.html POSIX thread

(pthread) Library

34