2010:011 HIP BACHELOR'S THESIS

GPU-based real time rendering of 3D-Video

Kennet Johansson

Luleå University of Technology BSc Programmes in Engineering BSc programme in Computer Engineering Department of Skellefteå Campus Division of Leisure and Entertainment

2010:011 HIP - ISSN: 1404-5494 - ISRN: LTU-HIP-EX--10/011--SE GPU-based real time rendering of 3D-Video

Kennet Johansson

Supervised by Roger Olsson (MIUN) Abstract This thesis was the result of a project to create a rendering pipeline capable of rendering a variable number of disparate views from a V+D (video plus depth) video source for usage with a lenticular display. I initially based the work on a thesis written in 2006 by one E.I. Verburg at the Technische Universiteit Eindhoven called 'GPU-based Rendering to a Multiview Display' but lack of implementationsal details led me to create my own algorithm focusing on multiple render targets. Within I explain the background of the project, about 3D video and formats and the details of the rendering engine and the algorithm that was developed with a final discussion on the practical usefulness of the resulting images, which amounts to the algorithm working but being potentially unnecessary due to the rapid increase in GPU processing power. Sammanfattning Detta examensarbete var resultatet av ett projekt för att skapa en renderingspipeline kapabel att rendera ett variabelt nummer av skilda vyer från en V+D (video plus depth) videokälla för användning med lentikulära displayer. Jag baserade först arbetet på ett tidigare examensarbete skrivet av en E.I. Verburg vid Technische Universiteit Eindhoven kallat ' GPU-based Rendering to a Multiview Display' men avsaknad av implementationsdetaljer ledde mig till att skapa min egen algoritm fokuserande på användning av multiple render targets. I detta dokument förklarar jag bakgrunden till projektet, om 3D video och format och detaljerna kring renderingspipelinen och den utvecklade algoritmen samt en slutgiltig diskussion om dess praktiska användbarhet, villken kommer fram till att algoritmen fungerar men är möjligen onödig på grund av den snabba ökningen av prestanda hos grafik processorer. Abbreviations and terms • fps: Frames per second. A measurement of rendering efficiency. • API: Application programming interface. • GPU: . • OpenGL: A free multi-platform graphics API. (1.2) • language: A miniature programming language for writing programs for the GPU. • GLSL: OpenGL shading language. A shading language developed alongside OpenGL. (1.2.1) • Vertex : A program run on the GPU which primarily manipulates the positions of the vertices of the geometry that make up graphical objects. • Geometry shader: A program run on the GPU which primarily manipulates the shapes that are created from the vertices and which ultimately are rasterized to produce the fragments. • Fragment shader: A program run on the GPU which primarily manipulates the so called fragments which ultimately become the output pixels. • V+D: Video plus depth, a 3D video format (1.1.2.1). • MRT, MRTs: Multiple render targets. A feature of modern GPUs where multiple colors can be output from the fragment shader at the same time. Index 1. Introduction...... 1 1.1 Background...... 1 1.1.1 Realistic 3D...... 1 1.1.2 3D-video...... 1 1.1.2.1 3D-video formats...... 2 1.2 OpenGL and general graphics...... 3 1.2.1 GLSL, the GPU and general shader programming...... 3 1.3 Purpose and goal...... 4 1.3.1 Technical specification...... 5 1.4 Related work...... 5 2. Methods...... 7 2.1 Engine...... 7 2.1.2 Rendering...... 7 2.1.2.1 Passes...... 7 2.1.3 Utilities...... 8 2.1.3.1 Content creation/loading...... 8 2.1.3.2 Mathematics...... 9 2.1.3.3 Other utilities...... 9 2.2 View Rendering...... 9 2.2.1 Micro-polygon displacement mapping...... 9 2.2.2 Multiple render targets...... 10 2.2.3 Multiple views with MRTs...... 10 2.2.4 Implementation...... 11 2.2.4.1 Data...... 11 2.2.4.2 Initialization...... 11 2.2.4.3 ...... 13 2.2.4.4 Geometry shader...... 13 2.2.4.5 Rendering...... 13 2.2.4.6 Optimizations...... 14 2.2.5 Multiple views without MRTs...... 14 3. Results and future work...... 16 3.1 Visuals...... 16 3.2 Performance...... 17 3.3 Conclusion...... 19 4. Discussion...... 21 5. References...... 22 1. Introduction This section contains some basic information on the background of the project (section 1.1) as well as general information about relevant concepts and tools (section 1.2), the technological specifications that I sought to meet (section 1.3) and finally some information on related work (section 1.4).

1.1 Background This thesis work was done for the Realistic 3D research group at the Mid Sweden University over a period of about 8 weeks in the spring of 2010. I received a proposal to write a rendering engine for extrapolating compressed 3D-video in response to my application for thesis work relating to systems- and/or graphics programming, which I've made the focus of my studies. The following sections briefly describe the research as well as information about 3D video and 3D video formats.

1.1.1 Realistic 3D Realistic 3D is a collective term for a number of research projects being conducted at Mid Sweden University. The focus of the research is on the capture, coding, synthesis and uses of auto-stereoscopic 3D-video in entertainment as well as other areas, for example medicine. [1]

1.1.2 3D-video The meaning of 3D-video is video which when viewed correctly gives the impression of depth. This effect is created when the observers eyes view an object or scene from different angles. Today there are two sets of technologies for creating this effect: • Stereoscopy: The traditional kind of 3D video utilizing specially designed peripherals that either filter the light shown on the screen in such a way that the eyes see two different views or which are otherwise capable of showing two different video streams. • Auto-stereoscopy: Display devices that are capable of creating a 3D effect without the use of peripherals. For plain stereoscopy alternatives include 3D glasses of different sorts including the familiar usually red-blue anaglyphic[2] glasses as well as newer models which work with the polarization of light or synchronized shutters that show every other frame to the viewer at each eye and less common variants such as headgear with actual screens for each eye. While stereoscopy seems to have been largely accepted in cinema over recent years there has been a desire to avoid having to use glasses for viewing which has led to the

1 development of auto-stereoscopic displays. While the idea isn't so new it is only in recent years that the technology has reached an acceptable level and has begun to see some usage. Currently the most used and developed technologies are lenticular displays [3] and parallax barriers [4]. Many types of displays require only two views to create their 3D effects but some displays may use a larger number of views which are then seen depending on ones position relative to the screen or depending on user input.

1.1.2.1 3D-video formats The existing 3D-video formats can be roughly divided into two groups: Image based and geometry assisted formats (though it is notable that several depth based formats include more then one video source). Image based formats are specifically created for use with displays using a particular number of views. A subset of these are formats specifically for stereoscopic displays (two view). These formats include two video streams that can then be used with minimum overhead. There are of course types of stereoscopic displays that only need one video stream (such as anaglyphic images) but including both views as separate images makes the formats more general. Geometry assisted video formats basically consist of a video stream and a geometry stream (usually in the form of a depth map) which is used to create as many views as necessary. The challenge of this approach is creating views of sufficient quality. The basic idea of these formats is to create new views by distorting the original view (image) based on the geometry. The main reason to use this approach is compatibility. Using these formats any number of views can be rendered, thus making them possible to display on any type of screen, including stereoscopic and even normal displays (by performing no distortion). This approach will however inevitably cause artifacts if it is taken too far. Most of the more advanced formats were created largely to address this problem. The most basic depth based format is called V+D or video plus depth and it is the one that was to be used in this project. The reason for this is largely due to its simplicity and compression efficiency. Some other depth based formats include multi-view plus depth (MVD) (which is a combination of multi view video (MVV, which is at its simplest multiple video streams) and video plus depth) and layered depth video (LDV) which is like V+D except some parts of the image are kept in a different video stream representing objects in the foreground, allowing for deocclusion as well as occlusion (which is important for creating a realistic 3D effect). [5]

2 Image 1: Example V+D image.

1.2 OpenGL and general graphics One goal of the project was to remain platform independent and as far as possible based on open source libraries (the later of which OpenGL isn't though there exist open source implementations) and therefore OpenGL was chosen as the graphics API for the project. The basic workings of OpenGL is as a state machine that receives commands and primitives and create pixels to a selected location that may be linked to a visible window or to another part of memory intended for use by other parts of a program. Originally the API was based on a set of fixed functionality but OpenGL version 2.0 introduced programmable stages (called shaders) to the pipeline and version 3.0 depreciated much of the fixed functionality as well as officially including a further programmable shader stage leaving it to the users to create (and recreate) their own functionality. [6] One of the goals of the project was also to investigate possible uses of OpenGL version 4.0, particularly possible uses of the two new shader stages introduced with it but this goal had to be largely dropped due to lack of time.

1.2.1 GLSL, the GPU and general shader programming When using modern computers tend to have dedicated graphics processing units (GPUs) that aid in this task. Initially the functionality of these was quite limited but as time went on more functionality was added to the point where it finally became possible to directly write custom programs to be run on the GPU. At first these programs were written in low level Assembly languages but eventually higher level programing languages were developed to make the process more easy and GLSL was one of those languages. The GPU consists of a set of parallel processors and is intended to be used to perform

3 the same operations on large sets of input values; which is perfect for computer graphics but also useful in other fields, something that is becoming more and more apparent.

Image 2: A simplified representation of the GPU pipeline of OpenGL 3. The geometry shader is an optional stage.

Current hardware provides fixed functionality for a number of basic stages along with three programmable stages at specific parts of the rendring pipeline: • Vertex processor: Primarily manipulates the positions of the vertices of the geometry that make up graphical objects. • Geometry processor: Primarily manipulates the shapes that are created from the vertices and which ultimately are rasterized to produce the fragments. • Fragment processor: Primarily manipulates the so called fragments which ultimately become the output pixels. Along with these stages emerging specifications has introduced another two stages related to tessellation (the division of input primitives into suitable structures for rendering). One goal of the project was to investigate the possibilities of these new stages but it had to be dropped due to lack of time.

1.3 Purpose and goal The primary goal of this project was the creation of a rendering engine capable of rendering a variable number of views from V+D video in real time. The engine was decided to be based on a previous thesis on the topic. ('GPU-based Rendering to a Multiview Display' by E.I. Verburg [7]) In addition two secondary goals were set, namely: • Investigating the potential uses of geometry shaders for the algorithm, particularly to potentially take some of the load off the vertex shader which was listed by previous work as a bottleneck. • Investigating the potential uses of the newest revisions of OpenGL and GLSL, version 4.0. The second of the secondary goals had to be dropped for lack of time however.

4 Not included among the goals of the project was: • Compression and related. • Real time reading of video streams. • The transfer from texture atlas to the screen.

1.3.1 Technical specification All programming was to be in the C programming language for maximum compatibility to previous projects by Realistic 3D. The chosen graphics API was OpenGL version 3.2 and the chosen shading language was GLSL version 1.5. The performance goals for the engine was to be able to render in real time (25-30 fps) 25 views to a full HD (1920x1080) output atlas on a NVIDIA GeForce 8800 GT graphics card or equivalent. Finally the code should, as far as possible, be based on open source and platform independent code.

1.4 Related work As written above the project was to be based on the previous thesis 'GPU-based Rendering to a Multiview Display'. The thesis primarily describes a proposed method for generating multiple disparate views using a technique called micro-polygon displacement mapping which is essentially displacement mapping with particularly small cells for the grid. Displacement mapping is a technique for creating surface detail for a rendered surface by displacing the vertices that make up the surface by a height map. Several texture manipulation algorithms were investigated in 'GPU-based Rendering to a Multiview Display' but it finally came down to a decision between this one or a technique called relief mapping which is a ray-tracing based technique. The reason why the selection was between these two algorithms was that they shared the property of being able to occlude other parts of the image, that is to say: a higher part of the image (as determined by the height map) can conceal lower parts of the image when seen from an angle. This still leaves one unable to de-occlude parts of the image behind objects in the foreground but that is a limitation of the V+D format. An algorithm dealing with rendering of LDV (see section 1.1.2.1) or other more advanced formats would themselves be different. The reasons why the thesis choose displacement mapping over relief mapping were firstly that relief mapping aside from its occluding property also included self shadowing which was an unnecessary feature since no lighting is performed when rendering to 3D video. Secondly, when a test was actually created for this type of rendering, it was shown that the algorithm was to resource intense to be practical.

5 Even with these facts however I developed my own algorithm for generating multiple views in one rendering pass, which was desirable for the sake of reducing the number of passes and thus overhead. The reason for this was the lack of details in Verburgs thesis that led to me having to make my own interpretations of the available information as described in section 2.2.

6 2. Methods The first step was the creation of a basic but flexible rendering engine (section 2.1). This engine was then used to create an easily configurable rendering pipeline for the view creation algorithm (section 2.2, except subsection 2.2.5). This section describes the engine and the view rendering algorithm carried out by the rendering pipeline as well as Verburgs algorithm which uses the same basic method but a different approach (section 2.2.5). The engine was as previously mentioned programmed in the C programming language using the OpenGL API version 3.2 and GLSL 1.5 for graphics.

2.1 Engine Since the purpose of the project was to create a pipeline for rendering of 3D-video and not the creation of a general purpose engine the functionality was kept to a minimum with space open for future expansion. The basis of the rendering part of the engine is the RenderPass structure (further described in section 2.1.2.1) which contains the necessary variables to render one pass. The created passes are intended to be manually gathered in a list with each pass being run through a rendering function that applies the relevant settings to OpenGL. This is still mostly an organizational tool however and many of the practical parts related to setting up custom frame buffers, render targets and such as well as deletion of resources is left to the user. There is however also a set of utilities created to make life simpler for the developer which mostly concern content creation and loading (detailed in section 2.1.3).

2.1.2 Rendering As stated above the engine is largely an organizational tool. Furthermore it's fairly specific in the sort of rendering that it is capable of doing at this time. The rendering part of the engine is organized into one file with one structure (described in the next subsection) and one function which renders one pass (Render()). Rendering a full frame is configured separately, most likely as a list of passes that is iterated through with each pass being run through the rendering function sequentially (this is the case with the pipeline). The final effect is determined by the user. In the final pipeline for instance a number of views are created and stored in textures which are then finally multiplexed into a texture at a specific location that may be reached by the end user for any further manipulation.

2.1.2.1 Passes The data provided in the render pass structure is of the following specification:

7 • vertex_array: The vertex array to render • render_mode: The desired mode for the render ( eg. GL_TRIANGLE_STRIP ). • num_vertex_elements: The number of elements in the vertex array. • shader_program: The shader program to use. • framebuffer: The desired frame-buffer to render to. 0 is the default. • out_height: The height of the view-port. • out_width: The width of the view-port. • extra_info: Extra information formatted according to need. Used in the binding_function. • binding_function: A function to be called before rendering used for setting specific variables or making specific updates. When rendering the vertex array, frame-buffer and shader program are set; the binding function is called if there is one with the extra info as input and finally the view-port (which is the intended size and offset of the image to be rendered) is set to its set values and the relevant buffers are cleared. The engine is currently limited in that it only draws vertex array objects and only one at a time and the buffers to be cleared are hard coded at this time (as evident by the lack of a variable related to it) but these limitations could easily be solved by rewriting the rendering function and adding some posts to the RenderPass structure. The reason that these parts aren't already a part of the engine is as stated above that the engine wasn't the main focus of the project.

2.1.3 Utilities Aside from the rendering part of the engine a set of utilities were created. Again, since the engine wasn't the focus of the project only a small set of utilities were created. The following subsections list and detail these.

2.1.3.1 Content creation/loading The content creation utilities included two functions for creating geometry to be used for rendering. The first one, MakeScreenSquare() was made to simply create a vertex array object containing vertex data for drawing a textured square to fill the screen for use when combining effects or similar. The other function, MakeGrid(), however had the more important function of creating a textured grid of variable size. This function and its use is elaborated further upon in section 2.2.2. As for content loading the utilities included a function for reading files from the disc (ReadFile()) and a set of functions for compiling and linking shader programs (BuildShader() and BuildShaderProgram()).

8 One notable utility that was largely ignored for the project was a utility for reading image files to textures. The reason for this was that the intended functionality of the pipeline didn't include reading of V+D video streams. This was to be left to other projects. Instead the data to be used for rendering was made to be set to a specific OpenGL texture locations that the library would then use automatically when rendering. For testing the FreeImage library [8] was used to load images.

2.1.3.2 Mathematics The mathematics part of the utilities is concerned with the creation of matrices. The algorithm used requires a shear matrix to create disparity between views and a projection matrix for correction (both described in section 2.2.2.3). Also included is a function for matrix multiplication.

2.1.3.3 Other utilities In addition to the already mentioned utilities the engine includes a simple timer for testing purposes.

2.2 View Rendering The central part of the project was as previously stated to create a pipeline capable of rendering a number of disparate views from a video plus depth (V+D) source. What this practically means is to use the provided depth map to distort the input image and repeating this process to create the required number of views. There are several ways of doing this but the one that was used in this project was a method based on displacement mapping. Initially the work was supposed to be based on the work presented in the MSc thesis GPU-based Rendering to a Multiview Display. [7] However, due to the lack of explicit description of the techniques used in that work I adopted a displacement mapping technique of my own design. The following subsections describes my own version and the status of it's implementation (sections 2.2.1, 2.2.2 and 2.2.3 and implementation in 2.2.4) as well as the method presented in the former project and what led us from it (sections 2.2.1 and 2.2.5, see also section 1.4).

2.2.1 Micro-polygon displacement mapping The way that the micro-polygon displacement mapping (see section 1.4) is used to generate views is by applying the technique to the grid and viewing the result from different angles with virtual cameras. This can be done in multiple ways but my work uses the fact that when an object is viewed from the sides through orthogonal projection (which is the type of projection that was used) the only changes to the vertices are in the x-direction, thereof the shear matrix (section 2.1.3.2). This is also central to the distortion part of the algorithm as detailed in section 2.2.3.

9 Image 3: Displacement mapping and orhogonal projection. The angle of the projection was as stated above actually achieved by shearing the displaced surface in the x-direction by the z value.

2.2.2 Multiple render targets Multiple render targets or MRTs is a feature of modern GPUs that make it possible to render to multiple sources at the same time. What this means is that it becomes possible to reduce the number of rendering passes that are needed to perform certain algorithms. This means that less data may need to be sent to the GPU and less operations have to be performed the vertex and geometry shader stages, and although the cost of saving the rendered views increases with the number of views and their complexity, using MRTs can reduce both data overhead and program complexity. The limitation of MRTs is that it is an entirely fragment shader based feature. When creating the views the fragments for each location in every render target is based on the same geometry and transformations as the fragments in the same locations of all the other render targets for that pass. In other words: You cannot have different vertex transformations for different views within the same pass and thus cannot change the angle between virtual cameras as required by the algorithm. This fact is central to my implementation (as described below) and the main reason that I diverged from the thesis [7] on which I intended to base the pipeline as explained in section 2.2.5.

2.2.3 Multiple views with MRTs The method that I thought up to solve this problem using MRTs, which as previously stated cannot directly change the angle in between multiple views, was to distort the texture within the fragment shader based on the initial distortion by the vertex shader/displacement mapping and virtual camera transform by “pulling” the texture by an amount determined by the fragments height in one direction depending on whether the view is to the left or the right of the initial view.

10 2.2.4 Implementation This section describes the rendering pipeline created around the algorithm described above. A degree of knowledge in graphics programming is expected in sections 2.2.4.1 – 2 and is helpful in the rest of the section as well.

2.2.4.1 Data The data of the rendering pipeline is stored in two structures, ViewRenderInfo and ViewRenderData. ViewRenderInfo keeps track of the following variables: • grid_size_x, y: The number of cells in the triangle grid. • input_size_x, y: The size of the views after rendering. In other words: the sizes of the render targets. • output_size_x, y: The size of the output atlas. • view_size_x, y: The sizes of each view as organized on the output atlas. • num_views: The number of views to be rendered (25 as the real time target). • num_views_x, y: The number of views in each direction on the output atlas. • num_views_per_side: The number of views to be created for each side. • view_disparity: A value representing the amount of change between each view. • max_draw_buffers: The maximum number of targets per render pass. ViewRenderData contains a list of RenderPass structures and a variable representing the number of passes. In addition to the above two binding functions (one for the view rendering passes and one for the output combiner) and one extra_info structure (for the view renderers, containing the locations and data for the matrix and the number of views used by the shaders) (as described in section 2.1.2.1). The binding functions set the variable data of the shaders and make sure that the correct textures are bound to the correct locations. In addition to the above a number of variables are kept track of for cleanup purposes.

2.2.4.2 Initialization To prepare the pipeline for rendering you call the Init() function which takes grid_size, input_size, output_size, num_views, view_disparity and max_draw_buffers as inputs. The function then computes a number of other variables and enables some settings: • Depth testing is enabled, being required to create occlusion.

11 • num_views_x, y is calculated according to the relations between the output_size and the input_size. • view_size is calculated according to the relations between num_views_x, y and the output_size. • max_draw_buffers is set to the smallest value between the one provided by the user and the maximum one queried through OpenGL (the maximum for current hardware is 8, the program has provisions for up to 16). • The vertex array objects and vertex buffer objects are filled up by the MakeGrid() and MakeScreenSquare() utility functions. • The number of render passes per side is calculated by the formula: ceil((num_views / 2) / max_draw_buffers) This information is then used to set up the basic ViewRenderData structure with “the number of passes per side” * 2 + 1 passes (the view generating passes plus the combiner pass). Next the textures are initialized. The pipeline creates three textures, one ordinary 2D texture (size output_size_x * y) to receive the output and two 2D texture arrays (sizes input_size_x * y with num_views_per_side layers) to hold the rendered views. After this the frame buffers are set up. The number of frame buffers is the same as the number of render passes. Each of the frame buffers associated with the view rendering have a render buffer of the same size as the views associated with it's depth component in order to be able to calculate occlusion (note that it may have been possible to have only a single render buffer though the current implementation has one per view pass). The frame buffer then has the relevant texture layers attached. Interestingly different layers of 2D texture arrays can be rendered to at the same time without problems. The number of passes to be rendered in each pass is divided so that each pass renders the smallest possible number of views. Next the shaders are compiled and linked and their relevant variables are set. The pipeline creates three shader programs out of four shaders the specifics of which are detailed in the next section. Finally the created data is put into the generated render passes. For each pass a matrix is created out of a shearing matrix defined as shearing the x-coordinate based on the z-coordinate by an amount defined by adding up the view_disparity multiplied by the number of views for each pass and an orthogonal projection matrix with one side pulled in in order to remove an artifact caused by moving the vertices Image 4: Above: at the edge of the view-port (see Image 4). distorted Below: original

12 2.2.4.3 Shaders As previously stated the pipeline creates three shader programs out of four shaders. The shader programs include one program for each side of the view renderers and one for the output, the two view renderer programs are however made from the same shaders just with different uniform variables set at creation rather than using the same program for both sides and setting a larger number of variables each frame. Both programs are made up of one vertex shader and one fragment shader. • The view rendering vertex shader is quite simple. It applies the local value of the height map onto the current vertex through a texture fetch and then multiplies the matrix by that value, sending the new position and the texture coordinate on to the next stage. • The fragment shader begins by getting the height value of the local fragment through the gl_FragCoord variable and uses this along with the provided disparity value (which is the same one used for calculating the shear matrix, note also that it's been suggested that “disparity” may have been a bad choice of name) and a constant to calculate the distortion value. The reason why another value (the constant) is introduced is that the distortion caused by using the raw values by themselves is too large. The distortion value is then used along with the original texture coordinate to create a modification value to be applied to the distorted views. Finally the program iterates through the specified number of views, setting the output color for the i:th fragment as the sample pointed to by the modified texture coordinate which is defined as (texcoord + vec2(mod * i, 0)). This is the central part of the developed rendering algorithm. The output shader program that creates the atlas is also simple. The vertex shader passes the entered values on to the next stage and the fragment shader calculates the current view and position of that view based on the entered parameters (num_views_x, y and view_size_x, y).

2.2.4.4 Geometry shader One of the goals stated in the introduction was the investigation of whether a geometry shader could be used to take some of the load off the vertex shader by subdividing the triangles of the grid or by dividing the reading of the height values from the matrix multiplication but the first turned out to be too resource intensive (also stated as unsuitable by other sources [9]) and the second was less efficient then leaving all the computation in the vertex shader.

2.2.4.5 Rendering Rendering one frame is performed by calling the RenderViews() function which calls the engines Render() function for each of the RenderPass structures in sequence as

13 described in section 2.1.2.

2.2.4.6 Optimizations With the large number of render passes that may occur due to the fact that the best result is achieved by having the smallest number possible of views per pass (as further described in section 3.1) it turns out that the view creating fragment shader is the primary bottleneck (though it is true that a sufficiently large grid will cause slowdown as well). One important property was the fact that when multiplexing the views to fit a lenticular screen not all pixels of a full-size view is used. Because of this the output atlas is defined as being the same size of the intended screen and the size of the views is calculated to fit that size as well as possible (as described in section 2.2.4.2). Because of this the output size of the views could be defined as smaller then the original views, since they would be down sampled anyway. This led to vastly improved frame rate. Another important optimization had to do with the MRTs. Originally the views to be rendered by the fragment shader were listed as an array of vec4 of a size defined as the maximum render buffers (which can be reached as a constant defined in GLSL). However, when changing this to a constant value defined in the shader it became possible to reduce the data output even further since the GPU believes that it should output the defined number of colors unrelated to the number of surfaces that are bound to the program. This means that there had to be multiple fragment shaders loaded depending on the maximum number of render targets (a feature not yet fully implemented) but the performance gains outweigh the troubles.

2.2.5 Multiple views without MRTs As previously stated 'GPU-based Rendering to a Multiview Display' is very light on the details of its implementation. What is clear is essentially that the technique used to create disparity is the above described micro-polygon displacement mapping. What confused us about it is primarily four things: 1. The paper contains an excessive amount of information relating to basic rendering concepts, many of which are never used. 2. (Related to the above) Despite most likely not using multiple render targets (described in detail in section 2.2.2) to any greater extent, great mention is made of the feature even stating that the engine depends on it without ever saying how it is used. 3. There are multiple sections that are cryptic, offering solutions to problems that are never adequately explained (see for instance sections 3.7.2, 4.1.3 on page 61 and particularly 4.1.4 (sections from the paper 'GPU-based Rendering to a Multiview Display')).

14 4. The bottleneck of the application is said to be the vertex processing despite this not being a large problem in my implementation despite rendering more than double the number of views at a supposedly higher grid resolution (in fact the bottleneck of my implementation is primarily the fragment shader, more on this in section 2.2.4.6). These parts led us to think that the solution used multiple render targets in order to cut down on the number of passes needed which posed a number of problems, largely in that it would probably cause a good amount of artifacts (which it seemed that some of the cryptic pages might have been talking about). This became the basis of my implementation as described in sections 2.2.3 and 2.2.4 but eventually as the forth post became apparent I formulated another theory for how it was done. Be aware that this is only a theory with no testing to back it, however two sections (section 4.1.4 of 'GPU-based Rendering to a Multiview Display' and section V. of 'Multiview Rendering on a Graphics Processing Unit' which is a shorter rapport on the same work, also written by Verburg [10]) suggest that this is how it was done: Rather then using MRTs which as previously stated cannot make vertices different for different views, Verburgs method actually creates a grid to make four views at a time (once per side) by dividing the grid into four sides, altering the height of the vertices in each of the four corners by different amounts and then shearing them all by the same matrix. This neatly explains problems 3 and 4 since the cryptic sections mentioned above indeed solve certain problems with this approach and the bottleneck being vertex processing is explained by there being four times the vertices for each view.

15 3. Results and future work In this section I present the results of this project, firstly as pure visuals and comparisons between views generated with the described method and views generated through brute force; secondly as performance analysis and finally as a appraisal of the actual usefulness of the technique. The parts detailed in sections 2.1 – 2.2.4.6 detail the code written for the project.

3.1 Visuals

Image 5: The test V+D image used for this section.

Image 6:

Output atlas created using two views per pass (max 2 draw buffers). The lower leftmost image is the one furthest to the left and the upper rightmost is the one furthest to the right. The middle one is the original.

The composition of the grid is based on the selected size of the output and the number of views and aims to maintain the same aspect ratio as far as possible.

16 Image 7: Detail of images 6-7 from the atlas above. The left one is the original rendering and the right one is created in the same pass with the technique described in section 2.2.4.3.

Image 8: Detail of the same views on an atlas created using a brute force technique (essentially setting the maximum number of render targets as one, causing all views to be rendered purely using matrices).

Image 9: Detail of the same views on an atlas using a different V+D image. Notice the more visible artifacts around the text on the right image.

3.2 Performance For performance tests I ran through the pipeline using the following configurations: • 384x216 grid, 384x216 output view • 384x216 grid, 960x540 output view • 960x540 grid, 384x216 output view

17 • 960x540 grid, 960x540 output view Each performed with 1, 2 and 4 maximum views per pass. The sizes are based on the size of the input texture and the dimensions of the same divided by the number of views in each direction of the atlas. Each of the tests rendered 25 views.

Grid: 384x216 View: 384x216 Grid: 384x216 View: 960x540 40 16

39 15,5

d 38 d n n

o o 15

c 37 c e e s s

r 36 r 14,5 e fps e fps p p

s 35 s

e e 14

m 34 m a a r r

F F 13,5 33

32 13 1 2 4 1 2 4 Number of views per pass Number of views per pass Graph 2: Graph 1: Grid: 960x540 View: 384x216 Grid: 960x540 View: 960x540 45 8,4 40 8,2 35 d d n n 8 o o 30 c c e e 7,8 s s 25

r r e e 20 fps fps p p 7,6

s s e e 15 7,4 m m a a 10 r r F F 7,2 5 0 7 1 2 4 1 2 4 Number of views per pass Number of views per pass Graph 4: Graph 3:

As can be seen from the graphs (most particularly graph 3) the developed algorithm is the most effective with large triangle grids. Since the size of the size of the views are bound by the output atlas the way to increase detail of the rendering is increasing the size of the grid. However: When testing an optimized brute force algorithm it was revealed that an acceptable frame rate could be reached with only about a 70% reduction of the grid size and the

18 resulting artifacts of this reduction are not very noticeable.

Image 10: Left: Grid size 960x540. Right: Halved grid size. Artifacts are most noticeable around the text and lamp. The performance results of the brute force algorithm were as follows:

Brute force performance

60

50

40

30 fps

20

10

0 G 384x216, V 960x540 G 960x540, V 960x540 G 384x216, V 384x216 G 960x540, V 384x216

Graph 5

These results suggest (comparing graph 3 with column 3 of graph 4) that my algorithm is only more efficient then the brute force algorithm when large grids are considered.

3.3 Conclusion It has become clear that while my algorithm can create acceptable visual results it may be unnecessary due to the rapid increase in GPU processing power which has led

19 to it being possible to render sufficient amounts of data to produce acceptable results even without much algorithmic overhead. It is written in Verburgs paper that the reduced size of the depth map didn't reduce quality as much as might be expected and in fact it mentions as possible future developments algorithms for reducing these errors. My algorithm was designed largely out of a desire to natively use a larger grid resolution, which indeed it does more efficiently then the brute force alternative. The conclusion is thus that it is better to consider possible solutions using a more direct approach that deals with viewing artifacts in a less direct fashion. On a lighter note however, this is proof that today's hardware is ready for the challenge of handling real time 3D video.

20 4. Discussion One of the biggest problems that I've had with this project was essentially that I had a shorter then intended amount of time to work on it due to various things. The “about 8 weeks” that I wrote of in my abstract is a generous figure. Furthermore I had never worked with OpenGL version 3 before, although I had worked a good amount with version 2 and they are honestly not so different. Even so a few days went by while I configured my environment and got used to the changes (which I like, by the way). The troubles that lead to the creation of the whole texture distortion algorithm were, as mentioned, initially the cryptic nature of Verburgs paper. I was actually able to contact him before I figured out how he had probably actually done it but I never got an answer from him (as expected since he was under an NDA, still it would have been courteous to at least tell me how off I was). I'll also note that while I say “A degree of knowledge in graphics programming is expected in sections 2.2.4.1 – 2 and is helpful in the rest of the section as well.” in section 2.2.4, the truth is that a degree of programming knowledge is useful throughout although I've tried to make it as accessible as possible. Finally, although I may sound somewhat self-deprecating about the final result I am quite proud of the engine and the pipeline. The engine, I feel, was well designed with some potential for expansion and the algorithm did work and give results that in most cases were sufficient (although no replacement for rendering each view now that that has been shown to be possible).

21 5. References [1] http://www.miun.se/sr/Research/Realistic-3D/ [2] http://en.wikipedia.org/wiki/Anaglyphic [3] http://en.wikipedia.org/wiki/Lenticular_lens [4] http://en.wikipedia.org/wiki/Parallax_barrier [5] Aljoscha Smolic, Karsten Mueller, Philipp Merkle, Peter Kauff, Thomas Wiegand; “An overview of available and emerging 3D video formats and depth enhanced stereo as efficient generic solition”, Fraunhofer Institute for Telecommunications – Heinrich-Hertz-Institut, Berlin, Germany [6] The Khronos Group inc., OpenGL core specifications, http://www.opengl.org/registry/ [7] Verburg, E.I., ”GPU-based Rendering to a Multiview Display”, MSc Thesis Technische Universiteit Eindhoven, 2006 [8] http://freeimage.sourceforge.net/ [9] NVIDIA corporation, “GPU Programming Guide, GeForce 8 and 9 Series”, December 19, 2008 [10] Edgar Verburg, Guido T. G. Volleberg , “Multiview Rendering on a Graphics Processing Unit”, Signal Processing Group, Philips Applied Technologies

22