<<

GPU Benchmarking - for handsets from a UI perspective

ERIK BÄCKSTRÖM KRISTIAN SYLWANDER

Master’s Thesis at LTH Supervisor: Fredrik Ademar Examiner: Micheal Doggett

Abstract

The graphical processing unit (GPU) is becoming an essential part of mo- bile phones and handsets. This improves the ability of handling graphics in handsets significantly. With this new chip it becomes possible to develop new stunning games but also new innovative 3D GUI (Graphical User Interfaces). Today there exist a couple of benchmarking programs targeting mobile GPUs but all of these benchmarking tools are all focused on game performance and not GUI performance. The aim with this master’s thesis is to develop a bench- marking program for mobile GPUs focusing on GUI performance. We have identified five features that are important when benchmarking a handset, from a GUI perspective: scissoring, texture uploads, alpha blending, load times and . Additionally, we have benchmarked five different devices using our tool. From the test we could conclude that there were big differences in the performance compared to the specification sheets. Acknowledgement

This master thesis is done as a completing part of our education at LTH (Lund Institute of Technology) with focus on computer science. During the course of our master thesis we have received invaluable supervision and help from a number of persons both at TAT and at LTH. Therefore we would like to take the opportunity to thank these persons for their advise, knowledge and dedi- cation. First we would like to thank our supervisors at TAT Fredrik Ademar, Fredrik Berglund and Dan Gärdenfors. We also want to thank Michael Dog- gett, our examiner at LTH. Special thanks are directed to Michael Winberg and Emil Wassberger for their invaluable help. We would also like to thank our girlfriends Sania and Pernilla, for all their love and support, and for their comments on the thesis. Contents

1 Introduction 1 1.1 The Problem ...... 1 1.2 Our Aim ...... 2 1.3 Collaboration ...... 2

2 Background 3 2.1 OpenGL ...... 3 2.1.1 OpenGL 1 ...... 3 2.1.2 OpenGL 2 ...... 4 2.2 OpenGL ES ...... 5 2.3 EGL ...... 5

3 Benchmarking a GPU 7 3.1 Current Tools on the Market ...... 7 3.1.1 GLBenchmark ...... 7 3.1.2 JBenchmark ...... 8 3.1.3 Futuremark 3D Mobile ...... 9 3.2 Why a new Benchmarking tool? ...... 10

4 Method 11

5 Result 13 5.1 Test Framework ...... 13 5.1.1 PolluxGL ...... 13 5.2 GUI Characteristics ...... 16 5.3 Test Cases ...... 16 5.3.1 Basic ...... 17 5.3.2 Fill Rates ...... 17 5.3.3 Textures ...... 18 5.3.4 Frame Buffer Objects ...... 18 5.3.5 Triangle Throughput ...... 19 5.3.6 ...... 19 5.3.7 GUI Specific ...... 19 5.3.8 Pipeline Performance ...... 20 5.3.9 Optimizations ...... 21 5.4 Visualize Result ...... 21

6 Benchmarked Devices 23 6.1 Tested Devices ...... 23 6.2 Benchmarking Result ...... 23 6.2.1 Basic ...... 24 6.2.2 Fill rates ...... 24 6.2.3 Textures ...... 24 6.2.4 Frame Buffer Objects ...... 25 6.2.5 Triangle Throughput ...... 25 6.2.6 Shaders ...... 25 6.2.7 GUI Specific ...... 26

7 Discussion 27 7.1 Incorporation in a development process ...... 28

8 Conclusion 29

Appendices 29

A Test Case Descriptions 31 A.1 Frame Buffer Clear ...... 31 A.2 Draw Triangle ...... 31 A.3 Color Fill ...... 31 A.4 Z Fill ...... 32 A.5 Color-Z Fill ...... 32 A.6 Texel Rate ...... 33 A.7 Texture Upload Rate ...... 33 A.8 Texture Upload Format ...... 33 A.9 Alpha Blending ...... 33 A.10 Scene Alpha ...... 34 A.11 Scissoring ...... 34 A.12 Frame Buffer Object (FBO) ...... 34 A.13 Frame Buffer Count ...... 35 A.14 Triangle Count ...... 35 A.15 Triangle Count VBO ...... 35 A.16 Vertex Buffer Object (VBO) count ...... 35 A.17 Photo River ...... 36 A.18 Photo River with Mip-map ...... 36 A.19 Program Switch ...... 36 A.20 Shader Precision ...... 37 A.21 Shader Limitation ...... 37 A.22 Subtexture updates ...... 37 A.23 Bump Mapping ...... 37 A.24 Timer test ...... 38 A.25 Non power of two textures ...... 38 A.26 Pixel Pipeline Performance ...... 38

Bibliography 41

Chapter 1

Introduction

In the first quarter of 2009 mobile phone sales dropped 8.6 percent compared to the first quarter of 2008. During the same period Smartphone sales increased with 12.7 percent [1]. One reason for this increase is the introduction of the iPhone, which has started the trend of touch screens and put focus on graphics and usability in handsets [2]. Now vendors are trying to create the most visually appealing graphical user interfaces (GUIs) to meet the competition from Apple. But to be able to create this more advanced graphics, new hard- ware is needed in the phones. The key component is the graphical processing unit (GPU). It enables handsets to render advanced 3D graphics, mostly used for 3D games. Since the mobile game industry is estimated to generate 10 million dollars this year it is not supris- ing that the focus is on games, but the GPU can also be used to create new innovative 3D GUIs, and since the user experience is becoming one of the main selling points for mobile phones this is a great opportunity for the operators and vendors [3]. Mobile manufactur- ers need to have the latest technology in their phones to be able to compete on the global market, but they also need to cut their costs. Different mobile phones therefore perform very differently even though they support the same standards. The graphics standard is not always followed and sometimes functions are poorly implemented which can lead to un- expected behavior and/or bad performance especially on a prototype device[4]. Graphics application programming interfaces (API) are often implemented with the main purpose of running games which in turn can lead to functions not commonly used in games being im- plemented incorrectly or inefficiently due to short time frames during development. These functions can for instance be useful in a GUI[4]

1.1 The Problem

The problem is that the benchmarking tools available today focus on game performance which makes them unsuitable for GUI benchmarking. Another aspect is that when de- veloping a GUI or anything else using graphics one would like to have information about which specific features that are fast/slow on a certain device. That information is, to our knowledge, not given by any of the tools available today as these use scenario-based test cases. Due to the lack of a sufficient 3D GUI benchmarking tool another problem arises.

1 CHAPTER 1. INTRODUCTION

When an unfamiliar device is used in a development process intended to create a graphics application of some kind, the device is not benchmarked prior to the start of the project. This will most certainly lead to surprises about the performance which will arise during development which in turn can lead to costly delays.

1.2 Our Aim

Our aim is to develop a benchmarking tool with the purpose of finding the limitations and bottlenecks of a mobile GPU. The tool will be designed with the development of a 3D GUI in mind. Hence the tests performed by the program will be focusing on GUI specific fea- tures. It will also focus on individual low-level features in order to find minor bottlenecks caused by these features. We will also investigate the possibility of using this program to reduce any delays caused by insufficient knowledge about the device from occurring in the development process.

1.3 Collaboration

This master thesis has been carried out in collaboration with The Astonishing Tribe (TAT) and the Department for Computer Science at the Faculty of Engineering at Lund University. TAT is a company that develops solutions for GUIs in mobile devices. They have, among many other things, developed the GUI for Android (Google’s operating system for mobile devices) and their products can be found in over 300 million devices worldwide [5].

2 Chapter 2

Background

A very important factor that made it possible to run advanced 3D games on mobile de- vices is the introduction of Open for Embedded Systems (OpenGL ES or GLES), a platform independent API for rendering 2D and 3D graphics on devices with limited resources. In this section we will give a brief background to this API as well as an introduction to the benchmarking tools available on the market today.

2.1 OpenGL

Open Graphics Library (OpenGL) is a cross-platform API for rendering 2D and 3D com- puter graphics [6] . In other words, OpenGL is a set of commands for the creation of geometric objects and commands that control how these objects are rendered. One of the strengths of OpenGL is the possibility to develop extensions to add new features that are not specified by the OpenGL standard, which makes sure that OpenGL is current with the latest graphics and rendering algorithms available. Furthermore, it is important to let program- mers adapt their applications to different hardware to achieve the goal of a cross-platform API. OpenGL was initially developed by Silicon Graphics Inc (SGI) as an initiative to create a single, vendor-independent graphics API [7]. However, when the Architecture Re- view Board (ARB) was founded in 1992 they took over maintenance of OpenGL until they became a part of the Khronos Group in 2006.

2.1.1 OpenGL 1 The first version (1.0) of OpenGL was released in 1992 [6]. It contained features like per- vertex lighting, texturing and blending . When OpenGL Version 1.1 was released later that year vertex arrays were introduced as one of the more important additions. With vertex arrays it is possible to transfer geometrical data with many fewer commands compared to the previously used begin/end-paradigm [8] . In 1998 OpenGL version 1.2 was released and along with it came three-dimensional texturing, new blending modes and a new improved vertex array draw element function. The draw element function benefits from the additional index data which lets OpenGL process the vertex data without having to scan the index data

3 CHAPTER 2. BACKGROUND in order to find out which vertices are referenced [9] . OpenGL version 1.3 was introduced in 2001. In this revision it became possible to use compressed textures in order to reduce memory bandwidth when rendering textured primitives. Some other interesting additions were the support for cube map textures, multi-sampling and multi-texturing [10]. In the next year, 2002, the fourth version of OpenGL was released. Among the new features were automatic generation, the support for depth maps and image-based shadowing. Also some new blending modes were introduced [11]. In OpenGL version 1.5 the vertex buffer object was added to the library. This means that vertex data could be stored in dedicated graphics memory. The vertex buffer object leads to much faster rendering of complex geometry compared to the traditional vertex arrays [12].

2.1.2 OpenGL 2

OpenGL 2.0 was released 2004. It came with the support for high-level programmable shaders – hence the change of major version number. This version also introduced the OpenGL Language (GLSL) which is a high-level language to develop vertex and fragment shader programs. Important additions in this version were Multiple Render Tar- gets, where shader programs can write to multiple color buffers, and the support for non- power-of-two textures [13]. The next version (2.1) came two years later and supported Pixel buffer objects which made it possible to use buffer object with both vertex array and pixel data. There were also some additions to the GLSL language [14]. Despite the major change that came with the programmable pipeline introduced by OpenGL 2.0 this version is still compatible with the previous versions. In fact all versions of OpenGL are compatible with each other.

Shaders

Shaders are used to replace certain parts of the fixed function pipeline (orange parts in figure 2.1) in order to create a more dynamic pipeline as seen in figure 2.2. This gives the programmer a possibility to change the vertices and from within the , which can be used to create new stunning effects that were not possible before. There are two types of shaders: the vertex shader and the fragment shader (pixel shader). The vertex shader operates during the vertex shader stage, figure 2.2, and is used to carry out traditional vertex-based operations such as transforming the position by a matrix, com- puting the lighting equation to generate a per-vertex color, and generating or transforming texture coordinates [15]. The fragment shader, on the other hand, is operating during the fragment shader stage and is used to do fragment-based operations such as texturing and per-fragment lighting. There are two fundamental object types that need to be created to render with shaders. These are the shader object and the program object. A shader object contains compiled shader code written by the programmer in GLSL. This object can con- tain both vertex- and fragment shader code and is compiled with a specific shader compiler. A program object is created by linking two shader objects containing one vertex shader and one fragment shader.

4 2.2. OPENGL ES

2.2 OpenGL ES

The goal with GLES was to provide a slimmed and cleaned-up version of OpenGL [16] . Hence, much of the old redundant features were removed to create a smaller and simpler library. GLES was develop by the Khronos Group and the first version was released in 2003 [17]. The library is available in two profiles common and common lite. The main difference between these profiles is that the common lite profile only supports fixed-point data types whereas common supports both fixed-point and floating-point data types. The first version of GLES was based on the OpenGL 1.3 specification with some functionality removed. For instance this version only supported vertex arrays, as the old begin-end paradigm is insufficient in many ways when the complexity of geometry increases. Among the features that were removed in this version were the support for one- and three-dimensional textures, display lists (in favor of the vertex arrays) and quad and polygon primitives. The next version, OpenGL ES 1.1, was released in 2004 and was based on the OpenGL 1.5 speci- fication. This version added some new functionality to the library e.g. automatic mipmap generation, multi-textures and vertex buffer objects.

Figure 2.1. Rendering pipeline for OpenGL 1.5 and OpenGL ES 1.1 [18]

GLES 2.0 came in early 2007 and was based on the OpenGL 2.0 specification. It implemented a fully programmable rendering pipeline (see figure 2.2) and hence most of the old fixed function pipeline was removed (see figure 2.1. Along with the new pipeline most of the old transform and lighting pipelines were removed as well and replaced by shader programs developed by programmers. Due to these facts GLES 2.0 is not backward compatible with previous versions of GLES.

2.3 EGL

Embedded-System Graphic Library (EGL) is an interface between OpenGL ES and the underlying native platform window system [19]. EGL provides functions for creating a

5 CHAPTER 2. BACKGROUND

Figure 2.2. Rendering pipeline for OpenGL 2.0 and OpenGL ES 2.0 [18] rendering surface onto which OpenGL ES can draw. It also handles graphics context man- agement and rendering synchronization. From within EGL a user can specify, among other things, which color format should be used when rendering, the amount of Full Screen Anti Aliasing (FSAA), and the size of different buffers (color-, depth-, and stencil buffer).

6 Chapter 3

Benchmarking a GPU

The term "benchmarking" is, when it comes to a GPU, a way to determine how well a GPU can render and process graphical data such as vertices and pixels. This performance can be measured in many different ways e.g. FPS (frames per second), fill rates, texel rates etc. FPS is the number of frames that can be rendered per second. Fill rate is most commonly referred to as the number of pixels a GPU can draw in one second. The fillrate can be calculated by multiplying the number of raster operations (ROPs) with the GPUs clock- frequency [20]. The texel rate is the rate at which a device can draw textures to the screen. It is calculated as the number of texturing units (TMUs) multiplied by the clock-frequency of the GPU. If the two values are obtained in this way they are theoretical values. The real fill rate and texel rate depend on several other factors such as drivers, memory bandwidth and the hardware architecture. In our tests we will measure the actual fill rate and texel rate since these are the values that are interesting when developing for new hardware.

3.1 Current Tools on the Market

Several benchmarking programs for mobile phones exist today . However, the majority of those programs are focused on game performance rather than GUI performance. Only Fu- turemarks 3DMarkMobile ES 2.0 has two specific test scenarios for GUIs. Unfortunately, there is no information about these user-interface tests at the moment and our attempt to contact Futuremark for some information was unsuccessful.

3.1.1 GLBenchmark GLBenchmark is an OpenGL ES performance and quality testing benchmarking program that targets mobile devices [21]. The first versions of the program (supporting OpenGL ES 1.0 and 1.1) runs a test suite that contains several important features like user experience benchmarking, real world application performance, high and low level graphics, CPU and logic benchmarks and system analytics (altogether over 80 subtests). Some of the tested features include floating point and integer performance, lighting, fill-rate, texturing, triangle throughput and rendering quality. There is also a complex game scene in which rendering

7 CHAPTER 3. BENCHMARKING A GPU techniques like shadows, particles and skinning are implemented (see figure 3.1). The later version, supporting OpenGL ES 1.1, houses all geometrical data in the dedicated graphical memory by using vertex buffer objects (VBOs). It is also worth mentioning that the vertex- and fragment count are increased 16 times and 8 times respectively in some of the test cases performed. GLBenchmark 2.0 was first demonstrated at the Mobile World Congress in 2008 [22]. This version of the program should benchmark the new high-end mobile 3D API of OpenGL ES 2.0. The goal was to demonstrate new high-quality and popular features using this new graphics library. Below is a list of the new features tested by GLBenchmark 2.0:

• Texture based and direct lighting • Bump, environment and radiance mapping • Soft shadows • Vertex shader based skinning • Automatic levels of detail (LOD) • Multi-pass deferred rendering • Noise textures • ETC texture compression

The majority of the test cases in GLBenchmark are measured in triangles per second except for some, which are measured in texels per second or just in plain frames per second.

Figure 3.1. GLBenchmark 1.x complex scene (left), GLBenchmark 2.0 (right) [23] [24]

3.1.2 JBenchmark JBenchmark is a collection of benchmarking programs targeting different graphics APIs based on Java ME and Android. For instance JBenchmark 239 is simply a version of GLBenchmark 1.0 targeting the Java bindings of the OpenGL ES library JSR 239 [25]. Other available Java benchmarks are JBenchmark 1.0/2.0 for MIDP 1.0/MIDP 2.0 devices and JBenchmark HD and PRO for M3G (JSR 184) devices. At the moment there is no Java binding for OpenGL ES 2.0. For a full specification of test cases see the JBenchmark homepage www.jbenchmark.com.

8 3.1. CURRENT TOOLS ON THE MARKET

3.1.3 Futuremark 3D Mobile Futuremark is one of the leading benchmarking companies in the world. Their software is often used as a reference when comparing different desktop GPUs with products like 3DMark Vantage and 3DMark06. Futuremark currently has two products aimed for mobile GPUs using OpenGL ES technology: 3DMarkMobile ES 2.0 and 3DMarkMobile ES 1.1 for OpenGL ES 2.0 and OpenGL ES 1.1 respectively [26]. They also have similar pro- grams for Java bindings of the OpenGL ES library, JSR 184 and JSR 239, both based on 3DMarkMobile 1.1. 3DMarkMobile 1.1 runs two complex game tests with approximately 10K - 20K poly- gons in each scene (see figure 3.2). After these game tests some minor test cases are performed where different features like vertex- and pixel processing and CPU processing are tested. These minor tests benchmark [27]:

• Pixel Processing - Simple fill rate test with single and multi-texturing. Measured in texels per second. • Vertex processing - Benchmark raw vertex processing performance as a gouraud shaded object that generates approximately 50K polygons. Measured in polygons • CPU Processing - Skinning testing performed on the CPU. • Image Quality - Observe an object while different OpenGL features are turned on and off.

Figure 3.2. Game test screenshots for 3DMarkMobile ES 1.1 [28]

3DMarkMobile 2.0 follows the same principle as its predecessor. First the program runs the two characteristic game tests (see figure 3.3). In the first test about 50K polygons are displayed in each frame compared to about 100K in the second [29]. These tests are de- signed so that the first is more fragment heavy and the second is more polygon demanding. Among the minor tests we find:

• Shader tests - Test different kinds of shader loads. These loads can be modified by the user by varying the load between the vertex and fragment shader. • Batch/State change - Measures the performance when there are many batches and constant state switches in the hardware. Measured in frames per second.

9 CHAPTER 3. BENCHMARKING A GPU

• Texturing Filtering test - Determine the visual quality using different texture filters.

Figure 3.3. Game test screenshots for 3DMarkMobile ES 2.0 [30]

3.2 Why a new Benchmarking tool?

When developing a 3D GUI it is crucial to make sure that the performance of the interface never drops below a certain level where the user experience delays. It is important that the frame rate never drops below a certain value dependent on the device (often 30 FPS). In comparison to games where the tolerance is a lot higher when rendering very complex scenes with lots of effects, the user will likely tire of a slow and inefficient 3D GUI. As the benchmarking tools today all focus on games there is a need for a test tool that can benchmark a device from a GUI perspective as well as low-level features. This makes the developer aware of the limitations in the hardware. Such a tool does not exist today. The benchmarking tools available today for the OpenGL ES 2.0 graphic library are, as men- tioned, focused on game benchmarking (the only exception is Futuremark 3D Mobile ES 2.0 which has two GUI based scenarios). These tools render some large scenes and then produce an overall score as the result. The results generated by the benchmarking pro- grams are relative. It would be more appropriate with an absolute result that describes the performance of the tested device without the need of comparison with another. In early development stages where a company has received a new unfamiliar device it is important to answer the questions "What is this device good/bad at?" and "Are there any potential bottlenecks?". If those questions can be answered satisfactorily it will benefit the develop- ers and help them create fast and efficient 3D GUIs at a higher pace. Our opinion is that the benchmarking tools of today can not answer those questions. Due to the fact that they are scenario-based, which means that it is difficult to isolate and benchmark specific features in the graphics library. In a complete scene which contains many geometries, textures etc it is difficult to distinguish fast features from slow ones. For example, one may want to know the difference between rendering with vertex arrays and Vertex Buffer Objects (VBO) or which texture format is optimal for the target device. In conclusion there is a need for a benchmarking tool that tests low-level features from a GUI perspective and produces an absolute result unlike the relative result, produced by the benchmarking tools of today.

10 Chapter 4

Method

The aim of this master’s thesis is to write a benchmarking program for GPU’s in handheld devices. This program should be able to run several test cases and at the end of each test case produce a result with information about the test. In order to do so we identified three steps to make this program a reality.

• Test Framework - The core of the program. The Test Framework has several impor- tant tasks, among which is to ensure that the correct parameters are measured and stored during each test case. It will also decide when one test case is finished and the next can be started. The framework should contain all the logic in order to sup- port graphics features needed by the executed test cases. These features include for instance texture loading, matrix mathematics, mesh handling and render functions.

• Test Cases - In order to make a good benchmarking tool one needs appropriate test cases. As this tool focuses on 3D GUI we can find these test cases by interviewing engineers at TAT about what GUI specific features are and how graphics in a GUI differs from graphics in a game.

• Visualize the result - The result produced by the benchmarking program should be presented in a legible fashion that is easy to comprehend. This presentation should be platform independent and not demand any additional software.

To find what features a typical GUI for hand-held devices uses we conducted interviews and focus group meetings with engineers and interaction designers at TAT in Malmö. We choose to use a focus group because we wanted to engage the engineers in a discussion in order to retrieve information about which features are commonly used in a GUI and what functions have caused problems previously. Interviews were used to discuss certain features in more detail and how to create good tests. From this information we found which features should be benchmarked and what the important measurement parameters were.

11

Chapter 5

Result

5.1 Test Framework

We built the test framework from scratch and wrote it in C so it can be as platform inde- pendent as possible. This is a more complex procedure than altering an already written framework such as the PowerVR SGX OpenGL ES 2.0 SDK. However, we wanted full control over all parts of the software to be able to get correct measurements. Some parts of the PowerVR SDK, such as the matrix mathematics, were used as a template in order to create our own matrix mathematics functions. For the texture loading we chose to use the TGA-format for two reasons. It is fairly easy to work with and it can be compressed without the loss of too much data. The meshes are all loaded in OBJ-format, because this format like the TGA-format, is very easy to work with. To be able to run our benchmarking tool on a PC with Windows XP for development purposes we used AMDs OpenGL ES 2.0 emulator.

5.1.1 PolluxGL We have named the software we have developed to benchmark mobile devices PolluxGL. This benchmarking program consists of a number of tests that are specified in separate files. Each file consists of three parts: setup, display and release. The setup and release functions are only called once in the beginning and the end of the test. The display function is called for either a fixed number of frames or until the display function returns a "finished" command. During each test the FPS is sampled after each call to the display function (each frame update). The highest and lowest FPS value from the test run is also stored. The values are stored in an array and written to file after all tests have finished running. To record more test data there are functions for creating diagrams and adding data to them and for adding single values and labels.

Design of PolluxGL The PolluxGL benchmarking tool is intended to run several test cases enclosed in a test suite. A test case is defined as a struct containing primarily the name of the test case,

13 CHAPTER 5. RESULT a description and three function pointers. A test case is created by implementing three functions, Setup] , Display] and Release] ( ] stands for the name of the test case) in a separate file. These functions are then assigned to the three corresponding function pointers in the test struct (see figure 5.1).

Figure 5.1. Definition of a test case in PolluxGL

In the setup function all the graphics specific for that test case are initialized, which involves the creation of a shader program, a camera set up, a perspective matrix and the creation of some geometry. The display function is called several times until the test is complete. A test is complete if a certain test specific frame count is reached or a flag is set. It is in the display function all the rendering is performed. When the test has ended the release function is called to deallocate any memory allocated by the test case. The core of PolluxGL (polluxGL.c) contains three important functions.

• initPolluxGL() - Initializes the test framework and prepares the first test case for execution. • runPolluxGL() - The "main-loop" of the program. For a more detailed explanation of this method see "The Main Loop" section below. • exitPolluxGL() - Ends the program by freeing memory allocated by PolluxGL.

A preferable way to run PolluxGL is to initialize an EGL context specific for the target device e.g. in a file called main.c. In this file the initPolluxGL() function is called to initialize the program. In the traditional "graphics-loop" continuous calls to runPolluxGL() are made and when the program is done it finally makes a call to exitPolluxGL() to end the program. The EGL context is not created in PolluxGL, due to the major differences in the initialization process of each device. If the initialization would be performed in PolluxGL the code would be swamped with ]ifdef commands and the code would be difficult to read and maintain as the list of supported devices grows.

PolluxGL "Main Loop" The main function in PolluxGL is runPolluxGL(). In this function all the test cases are executed and the relevant data is collected. Each test case can undergo four stages: Setup,

14 5.1. TEST FRAMEWORK

Display, Release and/or Failed. Here is a brief explanation of each stage.

• Setup stage - Calls the setup function of the current test case. Calculate the load time of the test. If the setup was successful the function returns "OK" otherwise "FAILED".

• Display stage - Calls the display function of the current test. Calculate FPS and gathers other test specific measurements. This is done for as long as the test runs. If something went wrong during a call, the test fails.

• Release stage - Calls the release function of the current test. Compiles the gath- ered test data and writes the result to the result buffer. The function also reset the measurement parameters to prepare for the next test case.

• Failed stage - If something goes wrong during a test case this stage is reached. The "Failed stage" is very similar to the release stage, the only difference is that the test case is flagged as failed. A flow diagram of the main loop is show in figure 5.2.

Figure 5.2. Flow diagram for the main loop of PolluxGL

15 CHAPTER 5. RESULT

When all the test cases are done the program creates a result file containing all the gathered data for each test case. It also contains specific information about the renderer of that device along with all supported OpenGL ES 2.0 extensions. This file is in .xml format which makes it easier to visualize the result. (See Visualize Result section below)

5.2 GUI Characteristics

In this section we will discuss some of the most important techniques and parameters when rendering GUIs that we found out during our interviews and focus group sessions. Alpha Blending is a technique in OpenGL to achieve transparency/translucency. It is resource demanding and therefore not commonly used in games. In GUIs it is used more frequently to achieve user friendly and attractive interfaces. When you use alpha blending, OpenGL must read back pixels from the frame buffer, mix them with the new color and then write the pixels back to the buffer [31]. If you do not know where the transparent and translucent objects will be positioned in the scene or if they are moving around, it will be difficult to get a good real time result. This is because you can not use the Z-buffer without first sorting your objects so that all opaque objects are rendered first. Scissoring is used to update only a part of the screen. Since most 3D games involve moving around in a 3D space, the whole screen has to be updated in every frame. In GUIs, on the other hand, there might be only small parts of the screen that are being manipulated and therefore there is no need to update the whole screen. In these cases scissoring is a useful function that lets you update only a part of the screen. Load times in games are commonly accepted as a natural part of the game. This is not the case with GUIs however. If we take websites for example Google showed that an increase in load time from 0.4 s to 0.9s lead to a decrease in traffic and revenues by 20 percent [32]. Similarly Amazon.com noticed 1 percent decrease in sales for every 100 ms increase in load time [32]. Although there are big differences between websites and mobile GUIs, this shows how small changes in load time can affect the user experience. Frame rate is another very important parameter. The frame rate describes how many times the screen is redrawn in one second. If the frame rate is low, the GUI will appear slow and unresponsive. This is often accepted in games for short periods when a lot is happening on the screen, for instance large explosions etc. However, in GUIs it is not tolerable that the user experiences delay due to a limited frame rate. Texture uploads cannot always be carried out before the GUI is loaded. If a user wants to view their photos, these have to be uploaded from the phone’s memory to the GPU at runtime. A common technique in games is to put many textures together into a single file, but if the texture changes over time this can not be done, resulting in more texture files.

5.3 Test Cases

PolluxGL comes with 26 test cases to benchmark the performance of a device. Screenshots from four of the test cases are shown in figure 5.5 In this section we will briefly discuss our test cases. For a complete test list, please refer to Appendix A. In the optimization section

16 5.3. TEST CASES we describe possible improvements and optimizations that can be made for the test cases. The test cases are divided in to different categories and each category is described further below. Many of the test cases, for instance the Polygon Count test, benchmark a certain feature by straining the GPU in different ways and when the frame rate is below a preset level the test ends and calculate the result at that moment. This level is set to 30 fps and selected because if frame rate is below this limit the user will start to experience delay in the GUI.

5.3.1 Basic The purpose of the test cases in this category is to benchmark the most basic operations in the OpenGL ES 2.0 API, such as, frame buffer clears and the rendering of a simple triangle. This category also includes a timer test of which the purpose is to measure the time it takes to create/execute different contents such as buffer clear, vertex buffer objects etc. Due to the fact that the test cases are very basic and only use the absolute minimum in order to produce very simple graphics it is crucial that the benchmarked device does not fail any of the tests herein. For the most part these tests should not cause any problem for any device, but we still decided to keep them in order to be absolutely certain. The tested device could for instance be a very early prototype. Another reason why we kept these test cases is that we wanted to use this measurements to be able to compare with other devices.

Included Test Cases:

• Frame buffer clear (A.1)

• Simple Triangle Test (A. 2)

• Timer Test (A.24)

5.3.2 Fill Rates The Fill Rate category measures the fill rate for different scenarios. A Fill Rate could be measured in both Mpixels/s and Mtexels/s. When measuring in Mpixels/s we want to find the maximum amount of colored pixels the renderer can produce during one second. This number can easily be calculated as the screen resolution (number of pixels) times the FPS. Mtexels/s measures the maximum amount of textured pixels. The formula for calculat- ing that value is the screen resolution times the FPS times the number of TMUs ( Units) available. It is the TMU’s job to apply texture operations to pixels. The test cases included in this category measure the fill rate of the color- and Z buffer and the texture fill rate.

Included Test Cases:

• Color Fill Rate (A.3)

• Z - Fill Rate (A.4)

17 CHAPTER 5. RESULT

• Color - Z Fill Rate (A.5)

• Texture Fill Rate (A.6)

5.3.3 Textures Texture mapping is one of the most common ways to color pixels and it is therefore impor- tant that a device is able to perform well handling textures. We have implemented six test cases in order to benchmark the texture performance of a device. Each test benchmarks a feature concerning texture e.g. non-power-of-two texture performance, subtexture updates and the texture upload speed. The texture upload speed is of special interest for GUIs. Users can often scroll through their photos at runtime, which requires a high texture upload speed and texture bandwidth if the list contains many photos.

Included Test Cases:

• Non-power-of-two textures (A.25)

• Photo River (A.17)

• Photo River Mipmapping (A.18)

• Subtexture updates (A.22)

• Texture uploads (A.7)

• Texture format (A.8)

5.3.4 Frame Buffer Objects Frame buffer objects (FBO) allow an application to directly render to a texture or an off- screen surface without the need to create additional rendering contexts [33]. For more information about tFBOs please refer to the section "Terminology" above . Frame buffer objects are fairly important and a good feature to use if you for instance want to place a 2D UI in 3D space or have "live" icons that are updated in real-time. In order to benchmark device performance when using Frame buffer objects there are two things to test. First and foremost we need to check if the device is capable of creating an FBO using a certain color format. If that is the case we can continue with a more demanding test, which is to deter- mine the maximum number of FBOs a device can render before the performance is below an acceptable level (30 fps).

Included Test Cases:

• FBO count (A.13)

• FBO format (A.12)

18 5.3. TEST CASES

5.3.5 Triangle Throughput One of the most important parameters in terms of performance is the triangle throughput. How many triangles per second is a device capable of handling? This number is one of the first a developer asks for when trying to figure out the limits of a device. Due to the fact that there are two ways of storing vertex data (client-side arrays and vertex buffer objects) we needed two test cases. Both test cases are identical, except that the vertex data is stored as either client-side arrays or vertex buffer objects. These tests measure the number of tri- angles/s the benchmarked device can handle before the frame rate is below 30 fps. There is a third test in this category which measures the number of vertex buffer objects that can be created and used on a device. This test uses very little geometry per VBO in order to generate an accurate measurement.

Included Test Cases:

• Polygon Count (A.14)

• Polygon Count using VBO (A.15)

• VBO Count (A.16)

5.3.6 Shaders OpenGL ES 2.0 implements a programmable graphics pipeline, which makes shaders an essential part of the graphics API. Shaders are described in more detail in the section "OpenGL 2". There are a few things of interest in terms of shader performance such as support for different shader precision formats and the maximum amount of texture look- ups and matrix multiplications in the fragment shader. Another common thing task is to switch shader program often and rapidly during the course of a program and for that reason we have a test case that does this. The last test case in this category is dedicated to a heav- ier shader, to benchmark devices ability to use a more sophisticated shader. In our case we choose a bump mapping shader.

Included Test Cases:

• Shader Precision (A.20)

• Shader Limitation (A.21)

• Switch Shader Program (A.19)

• Heavier Shader - Bump Mapping (A.23)

5.3.7 GUI Specific Alpha-blending and scissoring are two features commonly used in GUI . Alpha blending is mostly used for creating transparent objects. It can be very useful on a handheld device

19 CHAPTER 5. RESULT with limited screen space to be able to present information in different layers. Scissoring is a technique for updating only the part of the screen that has changed since the last frame. In a GUI it’s more likely that only a small portion of the screen updates every frame compared to games. Applying scissoring will probably increase performance greatly for GUIs.

Included Test Cases:

• Scene Alpha (A.10)

• Alpha Blending (A.9)

• Scissoring (A.11)

5.3.8 Pixel Pipeline Performance The purpose of the Pixel Pipeline Performance test is to find the bottleneck in the pixel pipeline. This test case is motivated by that the graphics bottleneck is almost always due to the pixel pipeline and not the vertex processing. It is based on the suggested method of finding this bottleneck described by Kari Pulli et. al [34] (see figure 5.3 and 5.4). The test case follows the figures (see figure 5.3 and figure 5.4) shown below in order to determine the pixel performance bottleneck. For a more detailed description on how this is achieved see Appendix A.

Included Test Cases:

• Pixel Pipeline Performance (A.26)

Figure 5.3. : Part one of the Pixel Pipeline test

20 5.4. VISUALIZE RESULT

Figure 5.4. Second part of the Pixel Pipeline test

5.3.9 Optimizations There is always room for improvements and optimizations in order to improve performance and achieve a better benchmarking result. In this section we will discuss which optimiza- tions could have been done in order to improve performance. The vertices in some of the test cases could have been given in screen-space, which means that no matrix multiplica- tion is needed in the vertex shader in order to position the vertices. The main reason why we didn’t apply this optimization is that it is not a common case to do so in 3D graphics . The texture coordinates could be defined as 16-bit shorts instead of 32-bits floats in order to reduce memory bandwidth. The same principle applies to the normal data as well. In the Polygon Count test cases as well as in the VBO Count test the different test meshes are placed randomly into the scene. They are also given a random color. All these operations are in fact unnecessary and give a slight performance decrease. However, since they make the test cases a bit more realistic and the performance decrease was very small, we decided to keep these operations.

5.4 Visualize Result

To be as platform-independent as possible the result is presented as a web page. The only requirement is access to a web browser. Since the web page is written completely in HTML, JavaScript and CSS, there is no need for a web server or an internet connection. The result file is saved in a xml format and parsed using jQuery [35] , a JavaScript library for making scripts that work in all major web browsers, and then presented as a list (see figure 5.6). A successful test is green and marked OK, and a failed test is marked with red and the text NOK. When the user clicks on a test more information is presented. In the first section is a brief description of the test, and how many triangles and vertices it uses. In the second section general results, such as average FPS, load time and test time, are presented. Last there is a section with results that are specific for the test and diagrams to help visualize the results. The diagrams are created with the JavaScript library Flot [36]. If the frame rate drops below 30 fps in any test a warning sign appears in the title of the test to alert the user.

21 CHAPTER 5. RESULT

Figure 5.5. Screenshots from four of the test cases. 1. Photoriver, 2. Sub-texture updates, 3. Scissoring, 4. Alpha blending

Figure 5.6. The result displayed as a web page

22 Chapter 6

Benchmarked Devices

6.1 Tested Devices

During the course of our master’s thesis we have benchmarked five different devices, but due to secrecy we can’t publish the names of all these devices. They will instead be referred to as A, B, C and D. To determine how well a particular device performs we are using the Iphone 3GS as a reference device. Device A and B are both targeted for the mobile phone market and are based on Windows Mobile and Linux respectively. Both devices are considered to be used in very high-end smart phones. The device labeled as C is developed for the tablet-pc market and is based on a Linux operating system. Device D is also a Linux-based system, targeted for the automobile market. The reference device, Iphone 3GS, uses an Apple based OS, called Iphone OS. It is built with a PowerVR SGX 535 (GPU), developed by Imagination and an ARM Cortex-A8, integrated as a (SoC) [37].

Target Market Platform A High − end − mobile WindowsMobile B High − end − mobile Linux C Tablet − PC Linux D Automobile Linux iPhone High − end − mobile IphoneOS

6.2 Benchmarking Result

In this part we will present the results that we obtained after running our test program on our devices. Device C and the iPhone had a limit on the frame rate set to around 60 FPS which makes it more difficult to get a fair result from some of the tests. The reason why the frame rate has been capped is probably to reduce power consumption. The results can be viewed table 6.1. Not all of the devices were able to run all the test cases, in those cases they are marked NA in the result table. The results are presented in the same manner as the test cases were introduced.

23 CHAPTER 6. BENCHMARKED DEVICES

Table 6.1. A selection of results produced by PolluxGL

Test Device A Device B Device C Device D iPhone Frame buffer clear (frames/s) 200 149 59 79 61 Draw triangle (frames/s) 286 118 59 74 61 Color fill rate (Mpixel/s) 58.4 43.8 22.7 33.6 9.4 Texelrate (Mtexel/s) 79.8 76.8 81.4 71 79.9 Texture upload rate (Mbit/s) 37.2 84.4 116.9 38.7 98 Load large RGB texture (Time(ms)) 671 671 409 722 222 Load large RGBA texture (Time(ms)) 679 679 477 1622 207 Tri count, before FPS < 30 (Mtri) 8.7 2.3 NA 0.7 3.2 VBO Tri count, before FPS < 30 (Mtri) 19.4 2.7 4 0.8 10.2 Bump map shader (frames/s) 78 43 59 33 61 Render w. scissoring (Time(ms)) 7125 7688 8602 NA 8321 Render wo. scissoring (Time(ms)) 2983 6133 8627 NA 8316

6.2.1 Basic Since iPhone and device C are caped at a frame rate of 60 FPS, it is difficult to compare the devices in the basic tests. However, it is faster to render a triangle with device A than to do a frame buffer clear, whilst with device B it is the other way around. The reason is that with device B one have to clear the colorbuffer before drawing each frame. If the entire screen is updated each frame one could increase performance in device A by not clearing the colorbuffer.

6.2.2 Fill rates The fill rates are a very basic and important measurements of device performance. How high fill rate one need depends on the resolution of the screen and the kind of application that is being developed. The results from device C and the iPhone is not interesting since they are capped at 60 FPS, the reason why they get different results is that they have differ- ent resolutions. From the results it is clear that the texlrate is much higher than the color fill rate. This is due the fact that texturing is a very common task when rendering 3D scenes and therefore the GPUs are optimized for it.

6.2.3 Textures When it comes to the texture upload rate, two of the devices really stand out; device C and the iPhone. They reach 117 and 98 Mbit/s respectively while the other three only manage to produce around 37 Mbit/s. When comparing different color formats there is little difference between an RGB and an RGBA texture for most of the tested devices. Device D, however,

24 6.2. BENCHMARKING RESULT took approximately 2.3 times longer to upload the RGBA texture. It is also noteworthy that the iPhone was faster when uploading the RGBA texture compared to RGB. From these results we can conclude that the iPhone should use textures with an alpha-channel, but absolutely not device D. In the photo-river test only device B got a significant performance increase with 11 FPS when using auto generated mip-maps. The other devices only gained 1 FPS. A surprising result is that even though device C have a higher texture upload rate, 116,9 Mbit/s, than the iPhone with 98 Mbit/s, it takes almost twice the time to upload a large texture with device C. One explanation could be if the iPhone store the texture data in its cache even after we told it to remove the texture. When rendering texture with dimensions other than a non-power of two the iPhone ran at 61 FPS. This is because the iPhone discarded the textures and instead rendered black squares. The most remarkable results were given by device A which ran at 64 FPS, whilst device C, B and D ran at 32, 27 and 21 FPS respectively. Using non-power of two textures can be important in mobiles when, for example, displaying photos.

6.2.4 Frame Buffer Objects The variation in how many FBOs that can be drawn before the frame rate drops below 30 FPS is quite large. The iPhone can only handle 6 FBOs compared to device C which can handle up to 67 FBOs. On device D we could not create FBOs. FBOs can for example be used when rendering to a texture to display multiple views or desktops on a cube. For these purposes the iPhone is not to recommend.

6.2.5 Triangle Throughput Using VBO instead of client side arrays can prove to give a huge performance boost. The iPhone can produce over 300 percent more triangles when using VBOs, thus allowing for much larger scenes to be rendered. The reason for the NA on device C is that it could not draw client side arrays. It is very good to find out that a device can not draw client side arrays when troubleshooting a device.

6.2.6 Shaders Since OpenGL ES 2.0 is completely based on shaders, the shader performance is very important when benchmarking a GPU. When rendering an almost full-screen quad with a bump-map shader device D ran at 33 FPS which indicate a bottleneck in the fragment shader. Device A, C and the iPhone displayed no problems rendering the quad. If the result is compared to the draw triangle test, device B perform very well, loosing only 30 percent of its frame rate. This is very good performance, especially when compared to Device A which ran at 200 FPS when rendering a simple triangle now runs at 78 FPS. This result indicates that there exists a bottleneck in the fragment shader in device A. In our shader tests the C and D could only do five matrix multiplications per frame in the fragment shader before FPS went below 30. This puts limitations on which effects that can be performed. The other three devices could do 7 MMPF which is the maximum in the test.

25 CHAPTER 6. BENCHMARKED DEVICES

Figure 6.1. The result for the Alpha Blending test case

6.2.7 GUI Specific When it comes to alpha blending, the iPhone is by far superior to the others. When drawing 21 alpha-blended planes the average FPS of the other devices is around 14 FPS while the iPhone is running at 60 FPS. Even when drawing over 30 planes the iPhone runs at over 30 FPS (see figure 6.1), thus making an excellent device if one would like to use a lot of alpha blending. Scissoring does not always give a large performance increase. The biggest gain did device A make with a decrease in runtime by 4.1 s. On the other devices it would probably require a more complex scene to get a more noticeable result.

26 Chapter 7

Discussion

Before we began to create test cases we investigated the differences between rendering a GUI and a 3D game. In our interviews we discovered that there are some, but not that many, important differences. When rendering a 3D scene, whether it’s a game or not, information about fillrate and texelrate are always important. Therefore we have included several tests that measure basic 3D performance. On top of that we have a handfull of tests for features commonly used in GUIs The test framework has proven to be easy to integrate with different platforms. Once a new device is running there have been few or no problems in getting the framework to run. The biggest problem has been getting individual test cases to work correctly. On some devices one for example needs to clear the frame buffer before rendering to it, while on others this operation reduces performance. These errors can be difficult to detect during run time and thus the tests needs to be supervised by an operator. For example, when running the non-power-of-two texture test on the iPhone the results might seem fine, but when examined you will see that it’s actually discarding the textures and drawing black squares instead of the textures. These problems forced us to customize the test suite for every new device, some more than others. A way to minimize this problem could be to create a setup program that does initial tests of known issues from previous devices. After the setup program has executed, necessary optimizations could be applied to different tests in the test suite. This could, for example, be if the triangle test should clear the color buffer every frame or not. When designing the test cases we chose not to do any scenario based tests. That is, we did not do any tests where we rendered a complete GUI. This might seem strange since most other benchmarking software uses scenarios to do measurements. A scenario test is good to compare devices and to get a quick overall performance measurement. If we were to render a GUI-scenario the test result would only say how well the device could run that specific scene. With our test we want to be able to specify exactly what is making the device run slow or fast and thus give the developer a better understanding of how to optimize and what to expect from the device. From the test results it is clear that the iPhone 3GS is a very powerful device. Inter- estingly enough its performance in the FBO test is far beneath the competitors. From our

27 CHAPTER 7. DISCUSSION results it is clear that you should avoid alpha blending on some devices, like device B. The devices that are most interesting to compare are the iPhone and device B since they have an SGX530 and an SGX535, which is essentially an overclocked SGX530, respec- tively. What is also interesting is that device D should be able to produce 27 Mtri/s and effec- tive 664 Mpix/s (with overdraw) according to the spec sheet. In our test it only managed to produce 756 Ktri/s and 33 Mpix/s. This is probably due to problems with the drivers, but nevertheless shows that the theoretical value can be far from the truth, and the need of having a tool like PolluxGL.

7.1 Incorporation in a development process

Even though PolluxGL, in its current state, can give very good information about a GPUs performance some other parts must be in place for it to be useful in an organization. Pol- luxGL has to be incorporated in the development process, which means that there must exist an easy way to store, access and interpret the data produced by the program. Cur- rently PolluxGL can record relevant data about a GPU and present it in a lucid manner, but it can not help the the user with interpretation of the data or the organization of multiple tests. To handle all the information produced by PolluxGL the data must be stored in a cen- tralized way, making the data readily available for everyone in the organization. With cen- tralized storage, the risk of the same work being performed multiple times is also reduced. Preferably the data would automatically be uploaded to a server, but since PolluxGL runs on development devices there often is no direct access to a network. Instead the data most likely must be uploaded manually, either directly to the server or via a web-application. The most convient way to access the data would probably be through a web portal. The portal could be based on the result visualizer, which we developed during our project, and extended with functionality to compare tests and help with interpretation of data. This por- tal could also be extended to incorporate other benchmarking data to give a more complete overview of the performance of a new device. If PolluxGL is incorporated in the development process it has potential to reduce project lead time and it would most likely help the development team to better utilize the GPU of a device. To get an estimate on how much resources can be saved, a more in depth study of how much time is lost due to troubleshooting and poor performance of GPUs would have to be made.

28 Chapter 8

Conclusion

There are five features that are of special interest when rendering a 3D GUI: scissoring, texture uploads, alpha blending, frame rate and load times. We have developed a method to quantify these parameters and implemented them as tests. The tests can be performed in the test program, PolluxGL, that we have developed. After running PolluxGL, the devel- oper receives information about a mobile device’s 3D GUI performance. From the test we could conclude that device performance deviated significantly from the values stated in the specification sheets. For PolluxGL to be efficient in a large organization it has to be integrated in the devel- opment process. We believe that this would be fairly easy to do, and that it would lead to shortened lead times and more beautiful products.

29

Appendix A

Test Case Descriptions

A.1 Frame Buffer Clear

Purpose: Test the device frame buffer clearing speed by clearing the color and depth buffer bits. Since these clear operations are essential in it is crucial that this test case perform well on any particular device. If not it is well known that the succeeding test cases will perform badly as well. Description: Draws a simple red triangle in the centre of the screen using client side vertex arrays. Measures: FPS Expected result: A high FPS is expected since this is one of the most basic tasks a GPU can perform.

A.2 Draw Triangle

Purpose: Test if basic rendering of a simple geometry is done correctly. it is essential that this test case perform very well since the following tests uses a lot more geometry. Description: Clears the color and buffer bits repeatedly. This test doesn’t render anything which means that the test progress is not visible. Measures: FPS Expected result: The triangle should be red and in the centre of the screen. FPS should be high and in the vicinity of the Frame Buffer Clear test, probably a bit higher since the frame buffer is not cleared.

A.3 Color Fill

Purpose: Measure the raw color fill rate. Even though textures have become the most common method of coloring geometry in computer graphics, vertex color fill is still

31 APPENDIX A. TEST CASE DESCRIPTIONS

very important for instance when performing per-vertex lighting and similar features. Hence we wrote a test case covering this area.

Description: Draws 8 screen-aligned quads, each consists of two triangles, onto the frame buffer in shifting colors. The fill rate is calculated as:

screenwidth ∗ screenheight ∗ averageFPS

Measures: FPS and pixels/s

Expected result: Color should shift randomly. FPS should be slightly lower than the draw triangle test.

A.4 Z Fill

Purpose: Measure the raw Z-buffer fill rate. Z fill rate is very important because the Z buffer rejects a lot of geometry that is occluded before they are processed.

Description: Draws 8 screen-aligned quads, each consists of two triangles, onto the frame buffer with GL_DEPTH_TEST enabled and all colors masked out. The fill rate is calculated as in the Color Fill test case.

Measures: FPS and pixels/s

Expected result: Depending on the device it could be both higher and lower than the color fill test case. The screen should only show a nice clean color during the test since no colors are written to the frame buffer.

A.5 Color-Z Fill

Purpose: Measure the raw Color - Z - buffer fill rate. This test case is a combination of the two previous test cases. It do both z-comparison and writes color and z-values to the frame buffer.

Description: Draws 8 screen-aligned quads, each consists of two triangles, onto the frame buffer with GL_DEPTH_TEST enabled. The fill rate is calculated as in the Color Fill test case.

Measures: FPS and pixels/s

Expected result: FPS should be about the same as in the color fill test.

32 A.6. TEXEL RATE

A.6 Texel Rate

Purpose: Measure the texel rate by rendering a number of textures onto a full-screen plane. This number is determined by the number of TMUs () that the device supports which is queried at the beginning of the program. A high texel rate is important in GUI and in computer graphics overall as texturing is one of the most common color method.

Description: Renders a plane containing two triangles onto which several textures are mapped. The textures are 512x512 in resolution.

Measures: FPS and texels/s

Expected result: FPS should be considerable lower than the fill tests.

A.7 Texture Upload Rate

Purpose: Measures the maximum texture upload rate (number of uploads per second) be- fore FPS drops below 30. Since texture switching is an important GUI feature, for instance think of the situation when a user is scrolling through a photo album with a lot of different pictures, a high upload rate is to prefer.

Description: Draws 18 planes in a 3x6 matrix. These planes are then textured with differ- ent textures in an increasing speed. The textures are free and loaded every time they are used. The textures are 128x128

Measures: FPS, uploads/s and bits/s

Expected result: The higher upload rate the better texture performance.

A.8 Texture Upload Format

Purpose: Test the texture creation speed of different texture formats. (GL_RGB, GL_RGBA, GL_LUMIANCE, GL_LUMINANCE_ALPHA ).

Description: Renders a scene with textured and alpha blended planes.

Measures: FPS

Expected result: The device should perform a stable fps above 30.

A.9 Alpha Blending

Purpose: Measure the device ability to render alpha blended objects. Alpha blending is frequently used in 2D GUI and will probably be a common feature in 3D GUI as well. It is therefore important that a device is able to perform well in this test case.

33 APPENDIX A. TEST CASE DESCRIPTIONS

Description: Renders full screen alpha blended planes. Every 50 frames we add a new plane to the scene.

Measures: FPS

Expected result: The result should be stable above 30 fps for the first couple of planes and then decrease as more planes are added to the scene.

A.10 Scene Alpha

Purpose: Measures performance when rendering a more demanding scene with a combi- nation of features enabled such as texturing and alpha-blending.

Description: Renders a scene with textured and alpha blended objects.

Measures: FPS

Expected result: The device should perform a stable fps above 30.

A.11 Scissoring

Purpose: Test for FPS gain when using scissoring.

Description: A woman with a stone texture and a rotating monkey head is render to the screen. After 1/3 of the frames a scissoring box is put around the rotating monkey head. After 2/3 the scissoring box is but around an empty space in the scene.

Measures: FPS

Expected result: An increase in number of FPS when scissoring is used and when render- ing outside the scissor box should be noticeable in the diagram.

A.12 Frame Buffer Object (FBO)

Purpose: Test for supported pixel color formats and also in some sense FBO performance.

Description: Tries to create FBOs with different color pixel formats. Draws 12 FBOs to the screen and assign a random color to each of them. The size of the created FBOs is 128x128.

Measures: FPS

Expected result: GL_RGB, GL_RGBA4, GL_RGB5_A1 and GL_RGB565 should be

34 A.13. FRAME BUFFER COUNT

A.13 Frame Buffer Count

Purpose: Determining the maximum amount of FBOs that can be rendered before FPS drops below 30.

Description: Renders an increasing amount of FBOs. The size set to 128x128 pixels.

Measures: FPS and FBO count.

Expected result: At least a couple of frame buffer objects.

A.14 Triangle Count

Purpose: Determining the maximum amount of polygons (triangles) that can be rendered before FPS drops below 30. This is a very important figure that lets the developer know the amount of geometry that can be processed each second. However the ge- ometry rendered in this test case is purely vertex colored and no texturing is enabled. This means that using textures will decrease the amount of geometry rendered each second.

Description: Renders an increasing amount of spheres as client side vertex arrays. Each sphere consists of about 10 000 triangles. This is a considerably high triangle count per mesh but this is essential to make sure that the set up overhead for each mesh object is as small as possible.

Measures: FPS and polygons/s

Expected result: Some million triangles/s.

A.15 Triangle Count VBO

Purpose: Determining the maximum amount of polygons (triangles) that can be rendered before FPS drops below 30. This test case is exactly the same as the previous one with the only exception that the vertex data is stored in the dedicated graphic memory.

Description: Renders an increasing amount of spheres as VBOs. Each sphere consists about 10 000 triangles.

Measures: FPS and polygons/s

Expected result: Should be slightly to a lot more than the previous test case.

A.16 Vertex Buffer Object (VBO) count

Purpose: Determining the maximum amount of VBOs that can be rendered before FPS drops below 30. This test case is basically the same as the previous one. However

35 APPENDIX A. TEST CASE DESCRIPTIONS

the geometry is limited to primitive planes in order to find the maximal number of VBOs. We intend to test a device ability to fetch vertex data from the dedicated graphics memory to make sure that these API calls is not a bottleneck.

Description: Draws an increasing number of planes in different colors rendered as VBOs

Measures: FPS and number of VBOs

Expected result: N/A

A.17 Photo River

Purpose: This test case is mainly used as a reference to the photo-river with mip-map test case (described below).y to fetch vertex data from the dedicated graphics memory to make sure that these API calls is not a bottleneck.

Description: Draws many photos, moving back and forward along the z-axis

Measures: FPS

Expected result: A lower FPS than with mip-mapping enabled.

A.18 Photo River with Mip-map

Purpose: Test whether the auto generated mip-map textures gives a performance increase over using textures without mip-mapping enabled.

Description: Draws many photos, moving back and forward along the z-axis with auto generated mip-maps enabled.

Measures: FPS

Expected result: A higher FPS than the photo-river test should be achieved.

A.19 Shader Program Switch

Purpose: To switch a shader program during rendering is a common occurrence. In this test case we switch between four different shaders with decreasing intervals.

Description: Renders a sphere and changes the shader program with decreasing intervals.

Measures: FPS

Expected result: The device should not have any problems with shader switching and a frame rate above 30 is a must.

36 A.20. SHADER PRECISION

A.20 Shader Precision

Purpose: Test in what way the shader precision affect the performance of the device.

Description: Renders a scene with a rotating mesh using phong shaders with different precision.

Measures: FPS

Expected result: The frame rate should drop when a lower precision mode is used.

A.21 Shader Limitation

Purpose: To find possible bottlenecks in the shaders e.g. how many matrix multiplications and texture loop-ups that can be done in the fragment shader.

Description: Render a sphere with different shaders. Each shader does a number of matrix multiplications or texture look-ups.

Measures: FPS, numbers of texture look-ups and number of multiplications in vertex/fragment shader.

Expected result: There should be a lot of matrix multiplications in the vertex shader and a few 3-7 multiplications in the fragment shader.

A.22 Subtexture updates

Purpose: Measure performance when updating parts of a texture compared with updating the whole texture.

Description: Renders a plane while updating only parts of it with a new texture. After 1/3 the whole texture is being swapped. The last third the texture is reduced in size and the plane moved far away and only parts of the texture is updated.

Measures: FPS

Expected result: FPS should be higher when doing sub-texture updates. The reason to change size of the texture and moving the plane further away is to see if the size has an effect on the FPS.

A.23 Bump Mapping

Purpose: See how well the device can handle a heavier shader.

Description: Renders a big quad with a brick texture and a shader which does normal mapping on it.

37 APPENDIX A. TEST CASE DESCRIPTIONS

Measures: FPS

Expected result: Preferably over 30 FPS.

A.24 Timer test

Purpose: Measure the time it takes to create different objects such as VBO and FBO. Also the time to clear the different buffers is measured.

Description: This test case actually does not render anything to the screen. It just creates a lot of content and calculates the average time to create the measured object.

Measures: Time in ms to create different contents.

Expected result: Reasonable times.

A.25 Non power of two textures

Purpose: Measure performance when using textures that has non-power-of-two resolu- tions.

Description: This test case renders 64 non-power-of-two textures. Each 300 frames an- other 64 textures are rendered to the screen. This test case has a second phase in which we replace the non-power-of-two textures with regular power-of-two textures and then execute the same test procedure on these textures. The purpose of the sec- ond phase is to be able to compare the performance of non-power-of-two textures with regular textures.

Measures: FPS

Expected result: Non-power-of-two textures should be supported and perform slightly slower than the regular textures.

A.26 Pixel Pipeline Performance

Purpose: The purpose of pixel pipeline performance test is to find the bottleneck in the pixel pipeline.

Description: This test is quite complex and therefore requires a bit of explanation. As describe above in the section "Pixel Pipeline Performance" this test case follows the figures 5.3 and 5.4 in order to determine the pixel pipeline bottleneck. The test con- sists of several independent test stages. At each test stage a test stage time and a FPS is measured. After each succeeding test stage we check if the performance has in- creased or not and based on that figure we can choose a path in the test tree described by the figures 5.3 and 5.4. If we choose a path leading to a leaf the bottleneck is found. The first test stage renders a scene containing six textured meshes (about

38 A.26. PIXEL PIPELINE PERFORMANCE

10000 triangles each). If no draw calls are executed with the same scene in the fol- lowing stage we can determine if the bottleneck lies within the graphics pipeline or is limited by the application. This test will always result with an increased performance due to the fact that our application processing is very limited. After that the scene is changed to only a simple triangle in order to find if the bottleneck is limited by rendering or the buffer swaps. In the succeeding stage we still use the same scene but reduce the viewport to only 8 x 8 pixels. By doing so it’s possible to determine if the bottleneck lies within the pixel processing or the geometry processing. The following test stage uses the same scene as the first stage but without the use of tex- turing. If we get an increase in performance the bottleneck is the texturing and we try to determine if it’s due to the memory bandwidth or the texture mapping logic by reducing the texture size to 1 x 1 pixels. If we instead get a decrease in performance the bottleneck will be either limited by frame buffer operations or the color buffer bandwidth. In order to determine which of those limitations we render the same scene as before but without blending and fragment tests.

Measures: Pixel Pipeline bottleneck and suggestions upon improvements to get rid of that bottleneck.

Expected result: The Pixel Pipeline bottleneck.

39

Bibliography

[1] Gartner says worldwide mobile phone sales declined 8.6 per cent and smartphones grew 12.7 per cent in first quarter of 2009. [Online] www.gartner.com/it/page.jsp?id=985912, May 2009. [2] Sarah Reedy. Mobile’s user interface-lift. [Online] http://telephonyonline.com, April 2008. [3] Juniper Research. Mobile games market to reach $10 billion. [Online] http://software.tekrati.com/research/9658/, November 2007. [4] Ademar Fredrik. Interview at TAT, 2009. [5] TAT. Meet the tribe. [Online] http://www.tat.se/site/thetribe/about.html, 2009. [6] Khronos Group. Opengl overview. [Online] http://www.opengl.org/about/overview/, 2009. [7] sgi.com. Sgi. [Online] http://www.sgi.com/products/software/opengl/, 2009. [8] Mark Segal and Kurt Akeley. The opengl graph- ics system: A specification (version 1.1). [Online] http://www.opengl.org/documentation/specs/version1.1/glspec1.1/index.html, 1997. [9] Mark Segal and Kurt Akeley. The opengl graph- ics system: A specifcation (version 1.2). [Online] http://www.opengl.org/documentation/specs/version1.2/OpenGL_spec_1.2.1.pdf, 1999. [10] Mark Segal and Kurt Akeley. The opengl graphics system: A specifcation (version 1.3). [Online] http://www.opengl.org/documentation/specs/version1.3/glspec13.pdf, 2001. [11] Mark Segal and Kurt Akeley. The opengl graphics system: A specifcation (version 1.4). [Online] http://www.opengl.org/documentation/specs/version1.4/glspec14.pdf, 2002. [12] Mark Segal and Kurt Akeley. The opengl graphics system: A specifcation (version 1.5). [Online] http://www.opengl.org/documentation/specs/version1.5/glspec15.pdf, 2003.

41 BIBLIOGRAPHY

[13] Mark Segal and Kurt Akeley. The opengl graphics system: A specifcation (version 2.0). [Online] http://www.opengl.org/documentation/specs/version2.0/glspec20.pdf, 2004. [14] Mark Segal and Kurt Akeley. The opengl graphics system: A specifcation (version 2.1). [Online] http://www.opengl.org/registry/doc/glspec21.20061201.pdf, 2006. [15] Mark Segal and Kurt Akeley. opengl.org. [Online] http://www.opengl.org/registry/doc/glspec21.20061201.pdf, January 2006. [16] Khronos Group. Opengl es 1.0. [Online] http://www.khronos.org/opengles/1_X/1_0/, 2009. [17] Inc HighBeam Research. Opengl es 1.0 specification released by khronos group. [Online] http://www.encyclopedia.com/doc/1G1-105926453.html, July 2003. [18] Khronos Group. Opengl es 2.0. [Online] http://www.khronos.org/opengles/2_X/, 2009. [19] Khronos Group. Egl overview - native platform graphics interface. [Online] http://www.khronos.org/egl/, 2009. [20] Don. Tom’s Hardware Woligroski. Tom’s hardware. [Online] http://www.tomshardware.com/reviews/graphics-beginners-2,1292-3.html, De- cember 2006. [21] Kishonti Informatics LP. Glbenchmark 1.0 - high performance mo- bile benchmark for opengl es 1.0 and 1.1 environments. [Online] http://www.glbenchmark.com/tools.jsp?benchmark=glpro, 2009. [22] Kishonti Informatics LP. Glbenchmark 2.0 - high performance mobile benchmark for opengl es 2.0 environments. [Online] www.glbenchmark.com, 2009. [23] Glbenchmark. [Online] http://glbenchmark.com/tools.jsp?benchmark=glpro, 2009. [24] Glbenchmark. [Online] http://glbenchmark.com/tools.jsp?benchmark=glb2, 2009. [25] Kishonti Informatics LP. Benchmark 239 1.0 - opengl es binding 1.0 benchmarks for java me. [Online] www.jbenchmark.com, 2009. [26] Futuremark Corporation. Futuremark - products - 3dmarkmobile. [Online] www.futuremark.com, 2009. [27] Futuremark Corporation. Reviewer’s guide: 3dmark- mobile es 1.1 developers’ edition. 2007. [Online] http://www.futuremark.com/pressroom/companypdfs/3DMarkMobile_ES11.pdf?m=v. [28] Futuremark Corporation. 3dmarkmobile es 1.1 - screenshots. [Online] http://www.futuremark.com/products/3dmarkmobile/3dmarkmobilees11/screenshots/, 2008.

42 [29] Futuremark Corporation. Benchmark overview - 3dmarkmobile 2.0. August 2007. [Online] http://www.futuremark.com/pressroom/companypdfs/3DMM_ES20_WhitePaper.pdf?m=v/.

[30] Futuremark Corporation. 3dmarkmobile es 2.0 - screenshots. [Online] http://www.futuremark.com/products/3dmarkmobile/3dmarkmobilees20/screenshots/, 2008.

[31] OpenGL.org. Transparency sorting. [Online] http://www.opengl.org/wiki/Transparency_Sorting, August 2009.

[32] websiteoptimization.com. Website optimization. the psychology of web perfor- mance. [Online] http://www.websiteoptimization.com/speed/tweak/psychology-web- performance/, May 2009.

[33] Shreiner D Munshi A, Ginsburg D. OpengGL ES 2.0 Programming Guide. Pearson Education, Inc, May 2009.

[34] Kari Pulli, Tomi Aarnio, Ville Miettinen, Kimmo Roimela, and Jani Vaarala. Mobile 3D graphics. Elsevier Inc, 2008.

[35] John Resig. www.jquery.com. [Online] http://www.jquery.com, 2009.

[36] Ole Laursen. [Online] http://code.google.com/p/flot/, 2009.

[37] anandtech.com. The iphone 3gs hardware exposed and analyzed. [Online] http://www.anandtech.com/gadgets/showdoc.aspx?i=3579&p=3, 2009.

43