國立中山大學資訊工程學系 碩士論文

Department of Computer Science National Sun Yat-sen University Master Thesis

原創繪圖處理器於現實應用之側寫及驗證模擬器

A Simulator for a Novel GPU to Support the Verifying and Profiling in Real World Applications

研究生:竇旭康 Hsu-Kang Dow 指導教授:Dr. Steve.W.Haga

中華民國 102 年 10 月 October 2013

i

Dedicated to

My parents

ii

摘要

此論文提出以 Attila 開源模擬器為基礎的現代 GPU 驗證及側寫模擬器, 支援 OpenGL ES 2.0 與 GLSL ES 編譯和執行。完成的模擬器可以做到驗證著色語 言編譯器與提供側寫協助式最佳化編譯所需之統計資料。此模擬器在研究領域上 則可提供效能數據的統計並提供管線各階段的處理值,提供硬體驗證除錯時的正 確值。 為了使 Attila 模擬器能支援 GLSL 著色語言,本論文提供轉接器含有兩個 部分,第一部分是 API 界面的轉接器,用於 GLSL 編譯時所使用的程式資料鏈結。 另一部份為 Attila ISA 與 NSYSU ISA 指令之間的轉換與暫存器 I/O 對映。新的 Attila 模擬器與 SystemC 模擬器相比提供 300 至 2000 倍效能提升,並在系統模 擬時避免所需的前置計算,是模擬複雜的應用程式不可或缺的必要功能。

關鍵字: 繪圖晶片,Attila 模擬器,OpenGL ES,GLSL ES

iii

Abstract

This is a simulator created base on Attila, a modern GPU architecture and open source project with the power to run games and benchmarks. This simulator has been modified in order to support OpenGL ES 2.0 and GLSL ES compilation and execution. As such, it is an important extension of the Attila simulator. In addition, the compiler is designed for the NSYSU GPU architecture, which allows the verifying of the code produced by our GLSL ES compiler. This simulator was also created to enable future research, by providing the ability to record statistic data from running real world applications and then use these data to make profile assisted compiler optimizations.

Along with the simulator is a NSYSU GPU to Attila simulator converter. This converter consist two parts of the conversion, one is the API converter and another is the assembly converter. The converter solves data linking problems for attribute, uniform and varying data, which occur when adapting the Attila simulator to use NSYSU GLSL compiler assembly. Compared to the NSYSU GPU's current SystemC simulator, the new Attila simulator is 300 to 2000 times faster. It also avoids the necessities to precompute input to the simulator for system-wise simulation. These benefits are necessary for simulating non-trivial applications.

Keywords: GPU, Attila, Simulator, OpenGL ES, GLSL ES.

iv Contents 1. Introduction ...... 1

1.1 Profiling ...... 2

1.2 Verifying ...... 3

1.3 Real World Applications ...... 3

1.4 Attila Simulator ...... 4

1.5 Converter ...... 5

2 Related Works ...... 7

2.1 NSYSU GPU ...... 8

2.2 NSYSU SystemC Simulator ...... 10

2.3 Attila GPU ...... 11

2.4 Attila Tracing ...... 12

3 Methodology ...... 15

3.1 An Overview of the Converter ...... 17

3.2 Data Flow (Attribute Uniform Varying) ...... 19

3.3 Attila OpenGL API Driver Modification ...... 26

3.4 Converter for NSYSU to ATTILA Assembly ...... 27

3.5 Load / Store instructions & Memory Design ...... 31

3.6 Miscellaneous ...... 33

4 Performance Comparison and Result ...... 34

4.1 GLBenchmark ...... 36

5 Reference ...... 38

v A Simulator for a Novel GPU to Support the Verifying and Profiling in Real World Applications Author: Hsu-Kang Dow Advisor: Dr. Steve.W.Haga National Sun Yat-Sen University

1. Introduction

This thesis presents a simulator for a novel GPU to support the verifying and profiling in real world application. The novel GPU [1] is presented by Department of

Computer Science in National Sun Yat-Sen University (NSYSU). The design goal of this novel GPU is aiming on embedded system. In this case, reducing the power consumption is a vital mission in this design, thus optimization take an important role when developing this GPU. Applications and Games on handheld device are more and more popular and many of them implement 3D graphic with in it. To meet the needs of the future market, an embedded GPU with power awareness is therefore introduced.

Programmable shader is also important to modern game design. Thus the NSYSU GPU supports OpenGL ES 2.0 [2] and GLSL ES [3] to meet this requirement. Along with the

GPU, A GPU simulator is also required for verifying the implementation of both hardware and software. Current simulator is written in SystemC. It directly models the hardware behavior and it is cycle accurate. The nature of this makes the current

SystemC simulator slow. Another disadvantage of the current SystemC simulator is that it lack of full system simulation. The simulator only simulates hardware behavior but not the communication between OpenGL API calls (CPU) and shader programs (GPU).

Simple benchmark like red cubes has simple communication, which can be set by

1 programmers but not for larger scale. Current SystemC simulator is limited and not enough for real world applications.

1.1 Profiling

Because the nature of the embedded system. It’s hard to increase performance by simply adding more hardware like NVidia and AMD do on Desktop GPU. This is where optimization takes part. Before we optimize the codes, we need information to understand where the bottleneck is and what resource can be reused. Profiling helps us to reach this information by tracing tagged control flow, collect memory related information and reuse calculation like distant object texture for better performance and saving power.

The environment of embedded system is different from PC, thus when we are doing optimization, we focus on different aspect. For example, most handheld device has lower resolution comparing to Desktop PC. What we made is a tool to help programmer to automate optimization which usually done by hand. For instance, a distant tree might only take 3 , we can take hours to rasterize the full scene and acquire those 3 pixels. But programmer can do tricks on this and simply paste a texture to replace the distant object. Profiling the scene while rendering objects can do this. In order to obtain the data we need, first we need a simulator, which can run applications and gather information on runtime so we can use this information to make profile assist compiler optimization. This is an asked feature for the simulator and the new simulator enable this power.

2 1.2 Verifying

The NSYSU GPU comes with a simulator coded in SystemC, which functions directly mapped into hardware. It’s designed for hardware verification and logical in transistor detail that takes long duration to simulate the entire process. It’s also cycle accurate while simulating the hardware behavior, thus make this SystemC simulator slow when using it to verify the codes generated by the compiler. Another problem is that the current simulator lack of connection to API because simple benchmark doesn’t contain complex API call sequence, this leads to insufficient API-to-GPU communication for further analysis and verification. In this case, we need a fast simulator to quickly verify our optimized result and capability to process complex scene for fully test our compiler generated codes.

1.3 Real World Applications

Another goal of this work is to expand our infrastructure to handle more complicated and more-state-of-the-art shader programs, such as might be found in modern real-world applications. Prior to this thesis, the NSYSU GPU project was unable to support such applications. As for the pre-existing SystemC simulator, it is too buggy and too slow to simulate complex shader codes. And as for the OpenGL ES API function calls, real-world applications are so intricate that it becomes infeasible to compute the API-to-GPU communications by hand. Consequently, the current test programs for the SystemC simulator are toy benchmarks.

Such simple test programs do not need smart compilers and demonstrating compiler benefits on them is unpersuasive. Some modern game will provide complex scene and room for compiler to optimize. With the capabilities to run games and

3 benchmarks we can put our compiler in real challenge. But there is a limitation of my implementation. We need GLSL source code of the original program in order to compile them. Unfortunately, for many real-world applications, the shader code will have been precompiled in factory -- thus losing some information we need to do the simulation.

Alternatively, we acquire a complex OpenGL benchmark from ITRI (Industrial

Technology Research Institute) called GLBenchmark [4]. GLBenchmark is a state-of- the-art benchmark program for handheld devices. It includes several complex scenes and shader stress tests. To simulate the GLBenchmark, we need to add new API support in Attila driver and fix GLSL compiler for full system simulation, which we are working on.

1.4 Attila Simulator

Currently, there are open source simulators for GPUs. But it is not to say that we can just use such a simulator, since they simulate commercial GPUs instead our GPU which using a novel architecture design.

To meet all the requirements listed above, we’ll introduce a new simulator based on Attila simulator [5]. Attila is a Cycle-Level Execution-Driven Simulator for Modern

GPU Architecture. It is an open source project coded in C++, which is actively developed. The latest version update is in 2011 and a debugged version update in 2013.

It simulates modern GPU and provide VS/FS or Unified Shader version of architecture.

It also has a powerful trace tool to replay real games.

As for verification, Attila has a simulator (cycle-accurate) and an emulator (none cycle-accurate), the Attila emulator provide fast emulation which is suitable for verifying shader codes generated by our shader compiler. Custom SystemC simulation

4 usually simulate at 10k instructions per second and simulator written in C++ has approximate of 1 million instructions per second. In my final result, program runs at least 100 time faster, some up to 1000 time faster on the Attila emulator.

By using Attila trace tool, we can acquire various real world applications with complex scene and interesting shader programs. It can also use to detect and record flow control at run-time for further compiler optimization of codes. But there is a limitation to replay the trace on simulator. Originally the trace tools will records the shader program within the 3D game or applications. But most applications precompiled shader code in factory, while tracing, there is only low level assembly can be recorded and lost the control flows or data linking information we need. So we need the GLSL source code to run the full simulation of real world application.

In the end, Attila also provided statistics tools and signal traffic dump for analysis the bottleneck of the traced program. In addition to the statistics that Attila already collects, we also insert our own statistics because Attila is open source.

1.5 Converter

Both NSYSU GPU and Attila support programmable pipeline (shading language). But they use different instruction set architecture and memory allocation.

Thus we need a converter to translate the codes of our compiler to fit their architecture.

The Attila official website claims to support OpenGL2.0, but it actually only supports OpenGL2.0 games that don’t need any of the key OpenGL2.0 features. To understand what this means, consider OpenGL2.0 is backwards compatible. Legacy program written for OpenGL1.5 or earlier will work in OpenGL2.0. Attila therefore only uses the ARB shading language [9][10] that has no control flow. The earlier standard only allowed a predefined set of input parameters for each vertex/fragment as

5 opposed to the OpenGL2.0 method of user-defined attributes for vertices and user- defined varyings for fragments. Because it doesn’t truly meet OpenGL2.0 standards,

Attila trace only contains information for those parameters that are defined in

OpenGL1.5 and lose information for OpenGL2.0 API calls like glUniform().

Of course, our purpose for developing a simulator is to verify our novel GPU’s compiler of GLSL programs. So Attila needed to be made truly openGL2.0 compliant.

In order to do so, we have to fix several problems.

First, Attila does not have a GLSL compiler. This is probably the main reason why they haven’t put in the effort to truly support OpenGL2.0. But we have an

OpenGL2.0 compiler. So, problem solved.

The second is that Attila does not support control flow in ARB. The Attila ISA

[12] does have a conditional jump, but it is not supported for our purpose, because the

ARB format cannot have control flow. The solution is to modify the simulator so that I could make the existing Attila control flow work with our GPU instructions.

The third, Attila doesn’t support load and store instructions. Memory access instructions are not necessary for compiling most shader programs. But our compiler is not very efficient, so it often needs memory accesses. Thus I augmented the Attila simulator with new load and store instructions (LDV/STV) and add a virtual local memory in each shader core.

The fourth problem is lacking of OpenGL function calls for GLSL related data linking. Without the GLSL support, Attila team doesn’t implement OpenGL2.0 functions, which is related to GLSL data linking of uniform. This can be solved by a combination of openGL1.5 function calls and an analysis of the shader compiler table that is generated by our LLVM compiler [6].

6 The last one is Attila only support fixed pipeline name (I/O parameter name) in

ARB programming. This is the nature of ARB programming standard. Solved by analyzing the shader compiler table and implementing a novel predefined ARB program to achieve the I/O mapping.

2 Related Works

To further elaborate the tools and stages that will be involve while making the converter. First we have to examine the architecture of NSYSU and Attila. While Attila

GPU can be either VS/FS or unified shader configuration, but the unified shader is the same structure of VS/FS with a scheduler and a distributor to control the behavior of shader. So I’ll introduce both of NSYSU GPU and Attila GPU in VS/FS configuration, which make it easier to compare.

Attila simulator has a simulator and an emulator, simulator is a cycle accurate execution driven, designed to mimic the hardware behavior. The emulator runs only the functions of the instructions but not the command decoding process and pipeline behavior. I use the emulator as my base of this thesis because the purpose of our simulator is to verifying the code generated by our compiler and trace the tagged control flow for optimizations. Thus using the faster emulator but give up the cycle accuracy is a reasonable choice.

7 2.1 NSYSU GPU

Figure 2.1. Overview of NSYSU GPU architecture. This figure is from NSYSU GPU project second phase proposal. Four blocks on the top are software sources code and data from the programmer. The blue bucket is the FPGA board containing an ARM processor, RAM and the NSYSU 3D graphic engine.

Figure 2.1 illustrated a simple flow graph to describe how NSYSU GPU work from game source code to DRAM on the board and eventually generate a frame of scene in the frame buffer. The arrow and number are indications of the instruction flow.

Inside the 3D graphic engine there are two shader cores, vertex shader and fragment shader, both of them has programmable pipeline. Between VS and FS is a rasterizer performing various functions like culling, viewport trimming, clipping and

8 fragment generation. There is also a SRAM reside in each core to store shader executable for .

Outside the 3D graphic engine are frame buffer and DRAM. DRAM stores the game executable for CPU and shader compiler executable to compiler game shader code. The result of each frame is put in frame buffer and then passes a DAC to convert the signal for display.

The flow of generic game runs on the device. First of all, game code and shader compiler written by game/application programmer are compiled in the factory and then became executable (binaries) to target machine, in this case, an ARM processor. But the shader program isn’t compiled at this moment. After loading the game executable to

DRAM and program start running, shader compiler will compile shader program into target machine, in this case, the NSYSU vertex shader and fragment shader. The compiled shader codes start to act like a program in the VS/FS pipeline and manipulate vertices and to create the frame.

NSYSU SystemC simulator simulate the 3D graphic engine in Figure 2.1 collaborate with a LLVM compiler to compile the shader programs within the source code. Because the nature of this simulator (to actually describe hardware behavior and verify hardware implementation) it will greatly slow down the process speed, thus make it a poor tool for verifying complex shader program that needed by compiler team.

The API and device driver are implemented using MESA infrastructure [7]. This is a collaboration result from NSYSU and ITRI [8].

9 2.2 NSYSU SystemC Simulator

NSYSU simulator built base on the hardware design of vertex shader and fragment shader. They both directly models the hardware behavior, it simulate instructions like add by using instruction fetch, instruction decode and execution.

Instruction fetches from the simulated SRAM and decoded command to control the pipeline state then driven the ALU to calculate the value. This is cycle accurate but also slow. A red cube with 24 vertices takes 60 seconds to render. This is too slow for compiler verification, not to mention profiling the entire game.

Another problem is the SystemC simulator only simulates the GPU behavior not the full system. For a game run on computer, the program invokes OpenGL API calls that access the GPU vendor driver to set up the OpenGL state table (which holds information such as the modelview matrix, the projection matrix, parameters and the vertex data). Later, another API function call causes this table to be sent to the GPU via the motherboard bus. In order to perform the entire process we also need a driver for

NSYSU GPU. But the driver lack of connection to the SystemC simulator. Some data and table are hand coded and send to the simulator. Since the mechanism for passing this data to the SystemC simulator does not exist, the simulator user must manually identify these values from the source code, and then the user must initialize the simulator with these values. For a simple application like red cube or morphing ball it is possible, but consider a game run in 30 frames per second and changing state table at each frame. This is beyond the capabilities of hand coded. We need a simulator with automated process for real world applications.

The last one is the current SystemC simulator is buggy and has a poor performance. If we can provide a reference result with the tool to dump each instruction’s value, it might help us to debug and fix current simulator.

10 2.3 Attila GPU

Attila was developed in 2006. It goal is to research and develop high performance GPU architecture for modern GPUs. Attila GPU imports recent generation algorithm and hardware architecture of rasterization-based GPU (ATi R580,

NVidia G80, and G90). Attila produces not only hardware but also software support to popular graphic APIs. Originally Attila comes with an

OpenGL driver and an Attila driver as a low-level layer interface to Attila hardware. Now Attila has extended their support to DirectX 9 driver and gives a wide variety for game tracing.

Figure 2.2 is the specialized shader version of Attila architecture with Vertex Shader (VS) and

Fragment Shader (FS). It shares similarity to Figure 2.2. Overview of Attila NSYSU GPU; both of them pass data from VS to GPU pipeline. Figure modified from Attila Project –V. M. Del Barrio. rasterization pipeline and then into FS. VS and FS are This figure shows the Attila architecture of specialized shader attached to memory control so it can load shader (vertex shader and fragment shader). Data flow is from top to bottom. program onto the shader. Colors indicate different pipeline stages. ROPs are Render Output Unit Each of every block in Figure 2.2 is for final blending the pixel image. ROPs perform transactions between implemented as C++ class and comes with a local memory and buffer. configuration file to control the number of shader, memory timing, memory size, clocks…etc. This grants Attila the flexibility to simulate different kind GPU. With

11 proper configuration setting, we can use Attila to mimic NSYSU GPU in either specialized shader version, or unified shader version in the future.

2.4 Attila Tracing

While developing a GPU and its drivers, it’s hard to find complex benchmarks that fit to our needs. Attila designer also encounter this problem and then create a tracing tool to fill the blank. Figure 2.3 is the flow graph of the trace tool. It has 4 stages from collect the tracefile, verify, simulate and analyze.

Figure 2.3. Attila tracing stages and communication workflow. Picture takes from Attila Project. Red blocks: GLInterceptor, GLPlayer, ATTILE OpenGL Driver, Attila Simulator are tools provided by Attila. Green Cylinders: Trace is generated by trace tool. Statistics, Signal Traffic are files that generated by simulation tool. Blue blocks are external components. Cited from Attila official website: attila.au.upc.edu

12 In collect stage, GLInterceptor is the opengl32.dll wrapper that records all incoming OpenGL API calls into a tracefile. While the application is running, the calls are written to file along with the shader program string. The trace is a replay file of the graphic actions. Here is an example of tracefile.

glCreateShader(GL_VERTEX_SHADER)=1 glCreateShader(GL_FRAGMENT_SHADER)=2 glShaderSource(1,1,U0x7076652,U0x7076656) glShaderSource(2,1,U0x7076652,U0x7076656) glCompileShader(1) glCompileShader(2) GLSL program binding glCreateProgram()=3 glAttachShader(3,1) glAttachShader(3,2) glLinkProgram(3) glUseProgram(3)

glGetUniformLocation(3,ambient_material)=2 glUniform4fv(2,1,{1,1,1,1}) glGetUniformLocation(3,diffuse_light)=3 Uniform binding glUniform4fv(3,1,{0.8,0.8,0.8,1})

glBindAttribLocation(3,0,rm_Vertex) glBindAttribLocation(3,1,rm_Normal) glBindAttribLocation(3,2,cube_texs) glVertexAttribPointer(0,3,GL_FLOAT,0,0,*3) Attribute binding glEnableVertexAttribArray(0) glVertexAttribPointer(1,3,GL_FLOAT,0,0,*4) glEnableVertexAttribArray(1) glDrawArrays(GL_TRIANGLES,0,2880)

Figure 2.4. An example of OpenGL API calls from red_cube benchmark tracefile. Several lines are shown in bold, because these lines will later be discussed in Section 3.2. At the moment, however, the key observation of this figure is simply that the Attila OpenGL driver supported none of these API calls.

As you can see the file contains all the OpenGL function calls invoked by the application, the GLInterceptor is an OpenGL library wrapper that keeps recording the

13 OpenGL function calls and the parameters it used. After the parameters are properly saved, the wrapper calls the real OpenGL library in the system folder that GPU

Company provided.

Once we get the tracefile we can enter verify stage to replay the tracefile on the

GLPlayer and check if the trace is working. GLPlayer is still using computer’s graphics chip to process the tracefile. This stage provided a fast verification to see if the trace is properly recorded and also give it a reference for later stage.

After the tracefile is verified, we can run it on the Attila simulator. First it has to pass the driver layers, which contain OpenGL driver and shader compiler. The generated file becomes AGP Transactions of the Attila Simulator with operation instruction control. Attila simulator will produce the frame of target scene and dump related statistics file for further performance examination.

Attila OpenGL library transform fixed function into shader code and 200

OpenGL API calls are supported. The shader code format is ARB Vertex and Fragment programs, which is an OpenGL 1.4 standard. Currently Attila OpenGL driver don’t support OpenGL 2.0 GL ES Shading Language (GLSL ES) and OpenGL 1.4 ARB program [11] don’t have flow control, which is essential to our compiler development.

So we need a way to pass this barrier.

Although Attila doesn’t have GLSL but we have one. With proper modification and conversion, we can adapt the NSYSU assemblies to Attila assemblies for simulation.

14 3 Methodology

The goal of this thesis is to use the Attila simulator’s ability of tracing real world applications and then replaying on the simulator. Combine with our shader compiler that supports GLSL to achieve full system simulation of benchmarks & games containing GLSL shader program. At the beginning we have an overview of the entire flow to make Attila support GLSL and OpenGL2.0 ES data binding.

Figure 3.1. This figure is a simplification graph of Figure 2.3. This graph illustrates the flow diagram when running and tracing an OpenGL1.5 application. We enlarge the trace cylinder to show both API function calls and ARB program is recorded inside the trace. There’s also Buffer Descriptor and Memory Region files to store the pointer and record vertex data along with texture data used in the OpenGL1.5 application.

15 Figure 3.1 is basically Figure 2.3 but takes off the Verify and Analysis stage because they are not related to the simulator conversion. When we are tracing an

OpenGL1.5 application with ARB shader program. The Attila GLInterceptor will create three files, the OpenGL API Calls (tracefile.txt), the Buffer Descriptor

(bufferdescriptor.dat) and the MemoryRegion (memoryregion.dat), to save all the information we need to replay the scene. There are two parts of records written inside the OpenGL API Calls file. First is the OpenGL function calls invoked by the application. The second is ARB shader program string used by programmable shader.

Buffer Descriptor keeps the pointer used in the OpenGL API Calls as a reference to access data stored in Memory Region. Memory Region containing vertices data

(attributes) and textures that used by the application.

On simulation stage, OpenGL API calls are sent to Attila Driver Layer to setup the GLState table and register of shaders by using the information in API calls and

MemoryRegion. ARB program strings are compiled by ARBCompiler and send to simulator’s code space waiting to be executed.

This is how OpenGL application work using ARB shader program. But when using the tracing tool provided by Attila, things are changed. For an OpenGL2.0 application with GLSL program inside, the trace tool provided by Attila cannot records the shader program string correctly nor compiles. Thus we have to create proper tools for Attila to support them.

16 3.1 An Overview of the Converter

Figure 3.2. Overview of the flow graph enables Attila to support OpenGL2.0 & GLSL. This figure is a trace of OpenGL2.0 program containing GLSL. Attila trace tool has limited trace to record GLSL source code. The flow diagram illustrates the additional tools (orange) provided to make Attila supporting OpenGL2.0 and GLSL.

The Figure 3.2 shows an OpenGL2.0 application traced by Attila. The

GLInterceptor records the OpenGL API calls as well as previous OpenGL1.5 application. But the ARB part no longer exist because application doesn’t use ARB shader program. In this case, we need the GLSL source code to send to our GLSL compiler. Our GLSL Compiler will create the Assembly for NSYSU GPU. We want to

17 execute these assemblies on Attila simulator but they are in different instruction format.

So we’ll offer a GLSL Assembly to AttilaASM Converter to make the translation.

Usually shader programs use uniform as parameter to take control inside the shader program.

To support the parameter passing, we have to make connection between the

OpenGL API calls and the shader program. Thus we need an OpenGL API Converter to modify the original OpenGL API calls to pass parameters we want into the shader program. Combine the information from OpenGL API and GLSL Compiler we can create a shader compiler table to keep trace all uniform name/index/register for converter to decide correct location of the data. To pass the data into simulator, instead of making a new driver to send them. I use the original Attila driver and generate a dummy ARB shader program for loading constant registers. By doing so, we can keep the simulation of the system communication between the CPU and GPU. Once we have the complete shader compiler table, we can convert NSYSU assemblies into Attila assemblies.

18 3.2 Data Flow (Attribute Uniform Varying)

mov r32, i0 mov r33, i1 mov r34, i2 Input mov r35, i3 (attribute) mov r36, i4 mapping mov r37, i5 mov r38, i6 mov r39, i7

stv r0, c5, 5 stv r0, c4, 6 stv r0, c2, 7 Code for stv r0, c3, 8 storing stv r0, c0, 9 constant stv r0, c1, 10 register stv r0, c7, 11 (uniform) stv r0, c9, 12 … add r30.x, r0, 0 add r2.x, r0, 31 add r15.x, r0, 1.000 GLSL stv r0, r15, 22 assembly add r15.x, r0, -2.5258 … mov r32, r16 mov o0, r32 mov o7, r33 Output mov o8, r34 (varyings) mapping mov o9, r35 mov o10, r36

Figure 3.3. An example of vertex shader AttilaASM. Four sections of assemblies in AttilaASM. Attribute loading code: Setup the input of the shader program. Code for storing constant register: Store the uniform in constant register into virtual memory. GLSL assembly: These are the code generated by GLSL Compiler.

AttilaASM file has four sections: attribute loading code, code for storing constant registers, GLSL assemblies and output mapping as shown in Figure 3.3. There is slightly difference between vertex shader and fragment shader but not affect the four sections.

19 Vertex shader NSYSU ATTILA

Input r32 – attribute[0] i0 – attribute[0]

r39 – attribute[7] i7 – attribute[7]

Output r32 – gl_position o0 – result.position

r33 – varying[0] o7 – result.texcoord[1]

r36 – varying[3] o10 – result.texcoord[4]

Fragment shader NSYSU ATTILA

Input r32 – color i0 – fragment.color

r33 – varying[0] i7 – fragment.texcoord[1]

Output r0 – color o0 – result.color

r1 – xy

r2 – z

Table 3.1. NSYSU/Attila input and output register assignment for vertex and fragment shader. For vertex shader input, NSYSU uses register r32 and Attila uses register i0 as attribute[0]. NSYSU uses register r32 as default gl_position output while Attila uses register o0. Register r32 to r36 are mapped to o7 to o10 for four varyings. Same things to the fragment shader.

Attribute loading code is where we setup the input of the shader program.

There’s also code for output purpose at the end of the shader program. Table 3.1 shows how do we connect Attila registers to NSYSU registers. For example, the default registers used for storing attribute 0-7 are r0-r7 and i0-i7 for Attila.

In Figure 3.3 we can see the first seven lines are loading corresponding input register (i0~i7) of Attila into NSYSU register. The input registers of Attila are filled with data when programmer invoking OpenGL API call: glVertexAttribPointer () as in

Figure 2.4. Attila OpenGL Driver Layer (AOGL) will process this glVertexAttribPointer() and set the pointer containing the vertex data into GLState table.

20 This GLState table will sent to GPU simulator by using AGPTransaction that are sequence of CPU/GPU bus transmissions that push the data into GPU registers. When

GPU simulating the scene, the vertex data will stream into the input register on shader cores.

The second section of AttilaASM is code for storing constant register into virtual memory. In OpenGL2.0, programmer invokes glUniformLocation() and glUniform() to load data into target uniform. For example, in Figure 2.4 programmer calls:

glGetUniformLocation(3,ambient_material)=2 glUniform4fv(2,1,{1,1,1,1})

The first call “glGetUniformLocation(3, ambient_material)=2” ask driver to search uniform “ambient_material” in shader program number 3. The function returns 2 as the index for ARB shader program. And next line we set the uniform data using

“glUniform4fv(2,1,{1,1,1,1})” this function assigns a vector with value {1, 1, 1, 1} into uniform location 2 which size is 1. The problem is that Attila OpenGL driver don’t really support these function calls related to the GLSL data linking.

Alternatively we use the older OpenGL API calls used for ARB program collaborate with shader compiler table generated by our GLSL compiler to achieve same functionality. Previous version of OpenGL provides function

“glProgramLocalParameter4fARB(shader program type, index, GLfloat)” to set value of parameter inside ARB shader program. The API call above for uniform will turn into

“glProgramLocalParameter4fARB(GL_VERTEX_PROGRAM_ARB, 2,1,1,1,1)” But

21 this is not enough to link data into GLSL program without uniform name and memory address.

While GLSL compiler compiling the shader program, a shader compiler table will also generate along with the program. Figure 3.4 is an example of shader compiler table from morphing_ball benchmark. The original shader compiler table is without the information of register at the rightmost field in Figure 3.4. OpenGL API converter, a tool I provided in Figure 3.2, will later on fill the register field. Our OpenGL API converter will create a dummy ARB shader program for I/O purpose.

In default, the sequence we declare parameters in ARB shader program is the order putting in the Attila register. Thus OpenGL API converter will search for uniform in tracefile and write it in dummy ARB shader program. The converter also update the register of that uniform into shader compiler table.

Shader Compiler Table Address uniform name size register ======0 shaderr 1 (NULL) c6 1 timeflag 1 (NULL) c11 2 NormalMatrix 3 (NULL) c20 5 light_Pos 1 (NULL) c5 6 eye_Pos 1 (NULL) c4 7 diffuse_light 1 (NULL) c2 8 diffuse_material 1 (NULL) c3 9 ambient_light 1 (NULL) c0 10 ambient_material 1 (NULL) c1 11 specularExp 1 (NULL) c7 12 specular_material 1 (NULL) c9 13 specular_light 1 (NULL) c8 14 ModelViewMatrix 4 (NULL) c12 18 ProjectionMatrix 4 (NULL) c16

Figure 3.4. An example of Shader Compiler Table from morphing_ball benchmark. This table contains address used by GLSL compiler, uniform name, size of the uniform base on vector4 and corresponding register In Attila. E.g. uniform timeflag is a vector4 storing in NSYSU memory location 1 and Attila constant register c11

22 To make it more clear. In previous page we use an example from Figure 2.4 which programmer using “glGetUniformLocation (3,ambient_material)=2”. The

OpenGL API converter will create a line in ARB shader program:

PARAM ambient_material = program.local[2]

When the ARB compiler in Attila driver compiled this code, data in program.local[2]

(which is {1,1,1,1} set by glProgramLocalParameter4fARB(GL_VERTEX_PROGRAM_ARB, 2,1,1,1,1)) will sent to Attila constant register c1 because ambient_material is the second parameter

(constant register start from c0) been announced in the dummy ARB shader program.

Finally, in Figure 3.3 second section of the code. There is an instruction: stv r0, c1, 10

This instruction will move data in Attila constant register c1 into memory address 10.

Memory address 10 is the address number used by GLSL compiler. If a programmer wants to use uniform ambient_material, GLSL compiler will generated code to load memory address 10. Thus if we setup the virtual memory I created for Attila exactly the same address of NSYSU GPU. Then we don’t have to map register used between Attila and NSYSU.

The next section is the GLSL assembly. This is the main body of the GLSL shader program produce by GLSL compiler. But this is not exact the same code from our GLSL compiler. There are several problems need to be fix or modify in order to render correct scene. Some NSYSU instructions needed to be implemented in combination of multi-line Attila instructions. The Attila simulator itself also needs to be augmented to support memory load/store.

The last section is the assemblies for shader output. Output of fragment shader is simpler. According to Table 3.1 we only need to move r0 (NSYSU default register to

23 store color output from fragment shader) into o0 (Attila default register for result.color).

For vertex shader, I use the redundant register for storing texture coordinate as register to pass varyings. For example, texcoord[1] of ARB programming language used input register i7 of fragment program and now stands for the first varying in GLSL fragment program. These conclude the data-linking program between Attila trace, simulator and

NSYSU assemblies.

Besides the data linking, there are several catalogs of problems need to be adapt in order to achieve full system simulation.

The first catalog is the OpenGL Driver related problems involving OpenGL built-in gl_ModelViewMatrix and gl_ProjectionMatrix and Attila Texture bug.

The second catalog is shader instructions related problems involving NSYSU

GPU instruction conversion, load/store instructions support, memory setup and control flow.

The third catalog is miscellaneous problems like Mask/Swizzle/Select, float & integer format translation, viewport code trim and some Attila configuration changes.

The following paragraphs will follow a top to bottom sequence from the modifying of OpenGL API to low level assemblies’ conversion. A list of problems is on the next page.

24 Here is a short list of the problems.

1. Attila OpenGL API Driver modification

i. ModelViewMatrix ProjectionMatrix transpose

ii. OpenGL texture bug in Attila

2. Converter for NSYSU to Attila

i. Create shader compiler table

ii. Transcode

iii. Create adjustPC table

iv. Loading shader compiler table

v. Attribute / Varying register setup

vi. Memory setup

vii. Instruction conversion

viii. Output registers setup

ix. Viewport code trim

3. LDV/STV (load & store vector instructions) & Memory design

4. Misc.

i. Mask/Swizzle/Select

ii. Float & integer format

iii. Expend code side, disable optimization.

Figure 3.5. A list of problems encountered while implementing the converter. First part of the list is the problems encountered in API conversion and bug in Attila simulator. Second list is the problems encountered in assembly conversion and the module to solve them in specific order. Third is the implementation of load and store instructions. Fourth are miscellaneous problems.

25 3.3 Attila OpenGL API Driver Modification

The first problem we encounter of conversion is that the Attila built-in matrix like ModelviewMatrix and ProjectionMatrix are put transposed comparing to NSYSU

GPU. Programmer using built-in matrix setup function calls will end up having incorrect matrix dot product when using assembly codes generate by our GLSL compiler.

The nature of the error is because, in default, Mathematically Attila put the matrix in front of the vector when doing dot product. But the code generated by our

GLSL compiler does it reversely; matrices are pop out from stacks and put it behind the vector. The following Figure 3.4 illustrates a simple GLSL code that calculates the gl_position by dot product vertex coordinate and the model view matrix.

1 2 3 4 1 2 3 4 푇 푥 5 6 7 8 5 6 7 8 푦 [푥 푦 푧 푤] ⋅ [ ] = [ ] ⋅ [ ] 9 10 11 12 9 10 11 12 푧 13 14 15 16 13 14 15 16 푤 gl_position = ModelViewMatrix * rm_Vertex

Figure 3.6. The matrix placement comparison between Attila and NSYSU. Left hand side of the equation is how NSYSU do the dot product. Right hand side is the equal result of NSYSU but place the vector behind the matrix as Attila does it. In order to have the same correct result we have to transpose the matrix stored in Attila.

The solution is to change the mechanism Attila storing the OpenGL built-in matrices. To do so, we have to tap into the Attila OpenGL driver layer and transpose the built-in matrix in OpenGL state table right when sending to simulator. If we modify the matrix in GLState (OpenGL state table), program will be wrong if it uses previous state.

And if we modify the matrix resolving function, program will be wrong if it use generic

26 matrix, which use matrix by invoking OpenGL matrix setup function instead of using uniform.

My solution is to transpose both Modelview and Projection matrix in GLState right before the final constant binding, I transpose only the first ModelViewMatrix so the program that use extra ModelViewMatrix will not be transpose, this can be easily extended by modify the ACDX.cpp line 841.

Another bug is also found while implementing the texture API in trace. If a texture unit present in shader program and a draw command is triggered before API binds the texture to the shader, the simulator will crash. Even that specific section of code that uses the texture unit can never be executed (in an if-statement that will never happen in this iteration of draw). This happen because ARB program don’t support control flow thus this situation will never happen using ARB program.

My solution to this is to move one dummy texture bind in front of the first frame to prevent uninitialized memory access that leads to simulator crash.

3.4 Converter for NSYSU to ATTILA Assembly

The primary objective for the converter is to convert assemblies generate from our GLSL compiler for NSYSU GPU, into the instruction sets that Attila can understands. There are 9 steps for the assembly converter as follow:

1. Create shader compiler table

2. Transcode

3. Create adjustPC table

4. Loading shader compiler table

5. Attribute / Varying register setup

6. Memory setup

27 7. Instruction conversion

8. Output registers setup

9. Viewport code trim

Step 1 is to create shader compiler table that offer by our GLSL converter along with the constant register information given by the API converter. The table is saved to an array for later instruction conversion.

Step 2 is to unify the form of the NSYSU assemblies. Two separate jobs are done in this step; first job is to transform all NSYSU instructions from upper case to lower case and trim off fields of PC/binary code/bit format of immediate/vector identifier. Second job is to change all immediate into float format. NSYSU disassembly use different presentation for integer and float point for example:

PC: 12 (96) 51E06010d3EDA000: LDI R15.1000, -1071536165 (-2.525870)

PC: 40 (320) 51E0202000001280: LDIF R15.0100, R0, 3.000000 (1077936128)

For instruction LDI, the value in decimal format is inside the parenthesis but the LDIF put the decimal format in third field of three-address-code. The transcode has to unify the instructions to:

ldi r15.1000, -2.525870 ldif r15.0100, r0, 3.000000

The transcode also trim off irrelevant fields as mention above.

28 Step 3 is to create the adjustPC table. The purpose of this table is to give program counter (PC) information while converting control flow related instructions like BEQ and JMP. We cannot use the jump address of NSYSU assemblies because two reasons. First one is that NSYSU JMP instructions are using absolute address while

Attila JMP instruction using relative address. Second one is because the instruction conversion expands specific

NSYSU instructions from one instruction into multi-line instruction. This will change the relative address. Here we create a new PC table called adjustPC table. The new adjustPC table state the original program counter number in the NSYSU disassembly but shift multiple lines while encountering the specific NSYSU instructions that turns into multi-line Table 3.2. NSYSU to Attila instruction map instruction. We look up this used in morphing ball benchmark. Several NSYSU instructions are implement by same Attila adjust when converting the instruction like both mulf and mul are implement by mul instruction of Attila. Some NSYSU control flow instructions like instructions are implement by using combination of Attila instructions like div is implement by the BEQ and JMP and calculate combination of rcp and mul instruction of Attila.

29 the correct new address to branch or jump.

Step 4 loads the previous created shader compiler table into an array.

Step 5 to 6 creates the first two section of codes, input mapping and storing constant register to memory, in Figure3.3

Step 7 converts NSYSU instructions into Attila instructions and create the main body of the shader program (GLSL assembly in Figure 3.3). Table 3.2 demonstrates instructions that are used in morphing ball benchmark. When converting the instruction from NSYSU to Attila, not all instruction are one on one mapping into Attila. Some instructions are not support in Attila instruction set needed to be converted into combination of instructions in order to accomplish same functionality. For example,

DIV instruction of NSYSU have to be converted into two instructions of Attila, first it have to do RCP instruction to reciprocal the immediate to create the divisor and then do

MUL instruction to multiply dividend to the divisor. For control flow instructions.

BEQS (branch equal) are implemented by the combination of STPEQI (set prediction register when equal) and JMP instructions. The first STPEQI instruction will setup the prediction register p0 if the 2 operand are equal to each other, and second instruction

JMP will check the p0 instruction to see true/false then jump to PC+ [relative address].

The relative address here aren’t necessary the same as the immediate of the NSYSU instruction because there might be instruction within the jumping distance that are expands into multi-line instruction thus change the relative distance between instructions. So we have to look up adjustPC table as mention in step 3 for correct relative address.

Step 8 create the last section of codes, output (varying) mapping, in Figure 3.3

Step 9 trim off the viewport code that was used for NSYSU vertex shader. This is unnecessary for Attila so we have to detect the viewport code and delete it from the

30 source assembly. The method to detect viewport code is to make a peephole scan through the source assembly by using a shell script sed command:

sed -n 'N;N;N;N;N;:L;N;/mov r16.x, r15.wyzw\nnop\nmov r17, r15.xyzw\nnop\nrcp r16, r16\nnop\nmul r15, r15.xyzw, r16.xyzw/=;s/^[^\n]*\n//;bL'

This sed command find out the specific pattern of ‘mov, nop, mov, nop, rcp, nop, mul’, once it’s matched, following commands will trim off next 30 lines and previous 19 lines and finish the viewport code trim.

3.5 Load / Store instructions & Memory Design

The realization of load and store instruction (LDV/STV) are following the guide from Attila official website. Here is the list of steps I followed to add load and store into Attila emulator.

1) Add the new epode 0x38 (LDV) and 0x39 (STV) to the ShOpcode enum type in ShaderInstruction.h

2) Add the new opcode 38h and 39h to the translateShOpcodeTable table in ShaderInstruction.cpp

3) Add the new disassembled name ldv and stv to the shOpcode2Str table in ShaderInstruction.cpp

4) Specify LDV and STV are integer instructions in the ShaderInstruction(u8bit

*code) function in ShaderInstruction.cpp

5) Add the LDV and STV opcode to the setNumOperands function that uses two input operands in ShaderInstruction.cpp

31 6) Declare that LDV and STV are scalar instructions to the setIsScalar function in ShaderInstruction.cpp

7) Add LDV and STV treated as a scalar instruction when using a scalar write mask to the setIsSOACompatible function in ShaderInstruction.cpp

8) Add LDV and STV to the setHasResult function in ShaderInstruction.cpp if required

12) Define the emulation function shLDV and shSTV for the new instruction in ShaderEmulator.h

13) Implement the emulation function shLDV and shSTV for the new instruction in ShaderEmulator.cpp

14) Add shLDV and shSTV in the shInstrEmulationTable table in ShaderEmulator.cpp

15) Add the throughput and latency of the new instruction to the *RepeatRateTable *ExecLatencyTable tables defined in ShaderArchitectureParameters.cpp, in this step I following the format of ADD instruction which isn’t accurately emulate the timing behavior of memory access instructions.

Beside the implementation of load and store instruction, we also need to build a virtual memory to emulate the SRAM on NSYSU GPU for our GLSL assembly. At first the memory access globally but leads to image degradation while using the unified shaders in parallel. The cause of this is because the result generate by previous shader which store into same memory location are replaced by current result. The solution is to assign independent virtual memory to each core to protect data integrity.

32 3.6 Miscellaneous

Miscellaneous problems like Mask/Swizzle/Select translation are applied in order to change NSYSU vector operation format into Attila format. A brief example,

NSYSU uses r15.0101 as masking while Attila uses r15.yz. We also need to expand code size of the Attila simulator because the code space in Attila is easily exceeded with GLSL assembly; currently I set the code size to 2Kbytes. The size can be change in Attila assembler.cpp line 48. All instruction optimizations of Attila are disabled to avoid incorrect control flow and memory access.

33 4 Performance Comparison and Result

a b c

Figure 4.1. Compare between redcube on different platform. (a) The result from vendor GPU. (b) NSYSU GPU. (c) Attila simulator.

Red cube is a very simple benchmark which contains 24 vertices without texture,

it uses a user defined ModelviewMatrix in GLSL. The performance of Attila simulator

compares to NSYSU SystemC simulator with same GLSL assembly for redcube. On

NSYSU SystemC simulator simulating on Intel core 2 quad 2.4Ghz CPU takes 62

seconds to render each frame. On Attila simulator with same configuration takes 0.13

second to render each frame. The speed up factor between NSYSU SystemC simulator

and Attila simulator is 477 times faster.

d e f

Figure 4.2. Compare between morphing ball on different platform. (d) The result from SystemC. (e) Attila simulator. (f) Difference map PSNR is 35.062db

34 A more complicate benchmark called morphing ball which contains 24 vertices

with wooden texture on the surface and a colored ball with 2880 vertices which will

bouncing over the floor that control by GLSL. Figure 4.2 shows the morphing ball

benchmark ran on NSYSU SystemC simulator and Attila simulator. The color of Attila

simulator is slightly deviate from the SystemC result. Using the PSNR algorithm to

compare these two image the signal to noise ratio is 35.062db. The majority of

difference is caused by the float point accuracy in tracefile. Speed wise, SystemC

simulator takes more than 30 minutes to render each frame and Attila simulator only

needs 5.05 seconds to render a frame using the assemblies converted by the converter

provided in this thesis. The performance can be improved even more by taking off the

NOP instructions (originally added to prevent hazard in NSYSU GPU pipeline) and

optimized the matrix transpose. The best result for morphing ball simulating on GLSL

Attila can achieve 1 frame per second. Here is a table to compare SystemC simulator

and the Attila simulator for GLSL.

FPS SystemC GLSL Attila Speed up (frame per second)

Red cube 0.0167 4.811 288.1

Morphing ball 0.00056 0.198 353.57

Table 4.1. Compare between SystemC simulator and GLSL Attila simulator. SystemC simulator renders 0.0167 frame per second and GLSL Attila render 4.811 frame per second for red cube. The speed up factor is 288.1 time faster than SystemC simulator. For morphing ball, SystemC render 0.00056 frame per second while GLSL Attila render 0.198 frame per second. The speed up factor is 353.57.

From Table 4.1 we can observe that, more complex the benchmark is, the better

performance we have. GLSL Attila is an idea tool for verifying and profiling complex

benchmark program.

35 4.1 GLBenchmark

Figure 4.3. Compare between vendor GPU and GLSL Attila simulator. Top figure is the result from GLSL Attila simulator. Bottom figure is the result from NVidia GTX 480 graphic card. The palm tree, actor and statue in this frame are slightly different than the correct scene.

GLBenchmark is a popular 3D benchmark using OpenGL ES 2.0 for embedded system. GLBenchmark is a real world application which various online media and reviewers use this as a reference to evaluate the performance of the products. We are

36 currently working on running the GLBenchmark on Attila simulator with the converted assemblies from our GLSL compiler. The result generate from the Attila simulator can later on be use as a golden value while testing the hardware. Thus the correctness and functionality of the simulator to recreate GLBenchmark will be vital to this research project.

As for GLBenchmark running on Attila with our GLSL compiler, the result is also similar but not perfect. In order to fully simulate GLBenchmark on Attila we need to implement Object support inside Attila simulator and some works for texture cube. The following figures are the current result generate by the new GLSL

Attila Simulator.

We are currently working on GLBenchmark. GLBenchmark needs more work on the Attila driver to support various different API calls which Attila OpenGL driver are not supported. Those functions are Framebuffer Object (FBO), Texture Cube Map and Matrix Array in GLSL.

37 5 Reference

[1] Liang-Bi Chen, Ruei-Ting Gu, Wei-Sheng Huang, Chien-Chou Wang, Wen-Chi

Shiue, Tsung-Yu Ho, Yun-Nan Chang, Shen-Fu Hsiao, Chung-Nan Lee, and Ing-

Jer Huang. An 8.69Mvertices/s 278 Mpixels/s Tile-based 3D Graphics SoC

HW/SW Development for Consumer Electronics. Proc. of the 2009 IEEE/ACM

Asia and South Pacific Design Automation Conference (ASP-DAC'09),

Yokohama, Japan, pp.131-132, Jan. 2009.

[2] Aaftab Munshi, Jon Leech. OpenGL ES Common Profile Specification Version

2.0.25 (Full Specification). 2010:

http://www.khronos.org/registry/gles/specs/2.0/es_full_spec_2.0.25.pdf

[3] The OpenGL ES Shading Language:

http://www.khronos.org/registry/gles/specs/2.0/GLSL_ES_Specification_1.0.17.p

df

[4] GLBenchmark: http://gfxbench.com/result.jsp

[5] V. del Barrio, C. Gonzalez, J. Roca, A. Fernandez, and E. E. ATTILA: a cycle-

level execution-driven simulator for modern GPU architectures. March 2006.

[6] The LLVM Compiler Infrastructure: http://llvm.org

[7] Mesa 3D : http://www.mesa3d.org

[8] ITRI EGL1.4 & OGL ES 2.0 API Function List Documentation. ITRI NSYSU

GPU Device Driver Documentation. ITRI NSYSU GPU Device Driver

Documentation. ITRI MDK Platform System Test Environment Documentation

[9] ARB Vertex Program specification: http://oss.sgi.com/projects/ogl-

sample/registry/ARB/vertex_program.txt

[10] ARB Fragment Program specification: http://oss.sgi.com/projects/ogl-

sample/registry/ARB/fragment_program.txt

38 [11] Shader Assembly Language (ARB/NV) Quick Reference Guide for OpenGL:

http://www.renderguild.com/gpuguide.pdf

[12] Attila Shader ISA table:

http://attila.ac.upc.edu/wiki/index.php/ATTILA_Shader_ISA_Public

39