Development and Research in Previsualization for Advanced Live-Action on CGI Film Recording Aron Tornberg Sofia Wennström

Home , Essence Engine, IW (game engine), Match moving

LiU-ITN-TEK-A--17/007--SE

2017-02-27

Department of Science and Technology Institutionen för teknik och naturvetenskap Linköping University Linköpings universitet nedewS ,gnipökrroN 47 106-ES 47 ,gnipökrroN nedewS 106 47 gnipökrroN LiU-ITN-TEK-A--17/007--SE

Development and Research in Previsualization for Advanced Live-Action on CGI Film Recording Examensarbete utfört i Medieteknik vid Tekniska högskolan vid Linköpings universitet Aron Tornberg Sofia Wennström

Handledare Joel Kronander Examinator Jonas Unger

Norrköping 2017-02-27 Upphovsrätt

Detta dokument hålls tillgängligt på Internet – eller dess framtida ersättare – under en längre tid från publiceringsdatum under förutsättning att inga extra- ordinära omständigheter uppstår. Tillgång till dokumentet innebär tillstånd för var och en att läsa, ladda ner, skriva ut enstaka kopior för enskilt bruk och att använda det oförändrat för ickekommersiell forskning och för undervisning. Överföring av upphovsrätten vid en senare tidpunkt kan inte upphäva detta tillstånd. All annan användning av dokumentet kräver upphovsmannens medgivande. För att garantera äktheten, säkerheten och tillgängligheten finns det lösningar av teknisk och administrativ art. Upphovsmannens ideella rätt innefattar rätt att bli nämnd som upphovsman i den omfattning som god sed kräver vid användning av dokumentet på ovan beskrivna sätt samt skydd mot att dokumentet ändras eller presenteras i sådan form eller i sådant sammanhang som är kränkande för upphovsmannens litterära eller konstnärliga anseende eller egenart. För ytterligare information om Linköping University Electronic Press se förlagets hemsida http://www.ep.liu.se/

The publishers will keep this document online on the Internet - or its possible replacement - for a considerable time from the date of publication barring exceptional circumstances. The online availability of the document implies a permanent permission for anyone to read, to download, to print out single copies for your own use and to use it unchanged for any non-commercial research and educational purpose. Subsequent transfers of copyright cannot revoke this permission. All other uses of the document are conditional on the consent of the copyright owner. The publisher has taken technical and administrative measures to assure authenticity, security and accessibility. According to intellectual property law the author has the right to be mentioned when his/her work is accessed as described above and to be protected against infringement. For additional information about the Linköping University Electronic Press and its procedures for publication and for assurance of document integrity, please refer to its WWW home page: http://www.ep.liu.se/

Development and Research in Previsualization for Advanced Live-Action on CGI Film Recording

Master of Science in Engineering Department of Science and Technology

Master’s Thesis

February 26, 2017

Authors: Examiner: Aron TORNBERG Jonas UNGER Soﬁa Supervisor: WENNSTRÖM Joel KRONANDER Abstract

This report documents the theory, work and result of a master’s thesis in Media Technology at Linköping University. The aim of the thesis is to come up with solutions for improving the film studio Stiller Studios’s previsualization system. This involves a review and integration of game engines for previsualization in a motion control green screen studio, a camera calibration process with blur detection and automatic selection of images as well as research into camera tracking and depth compositing. The implementation and research done are based on literature within the computer graphics and computer vision fields as well as discussions with the Stiller Studios employees. The work also includes a robust camera simulation for testing of camera calibration methods using virtual images capable modeling the inverse of Brown’s distortion model, something largely unexplored in existing literature. The visual quality of the previsualization was substantially improved as well as Stiller Studios’s camera calibration process. The work concludes that the CGI filmmaking industry is under fast development, leading to discussions about alternative solutions and also the importance of modularity.

i Contents

Abstract i

1 Introduction 1 1.1 Background ...... 2 1.2 Purpose ...... 2 1.3 Limitations ...... 2 1.4 Typographic conventions ...... 2 1.5 Planning ...... 3

2 Existing system 4 2.1 Film studio setup ...... 4 2.1.1 Pipeline ...... 5 2.1.2 Shooting process ...... 6 2.1.3 Flair ...... 6 2.1.4 DeckLink ...... 6 2.2 Camera calibration ...... 7 2.3 Problem deﬁnition ...... 7 2.3.1 Low quality rendering ...... 8 2.3.2 Slow and imprecise camera calibration ...... 8 2.3.3 Motion control safety and errors ...... 8 Safety ...... 8 Mechanical errors ...... 8 2.3.4 Rudimentary compositing ...... 9 2.3.5 Suggested improvements ...... 9

3 Theory 10 3.1 Rendering ...... 10 3.1.1 Ofﬂine rendering ...... 10 Ray tracing ...... 10 3.1.2 Real-time rendering ...... 10 Rasterization ...... 10 Real-time global illumination ...... 10 Real-time shadows ...... 11 Forward and deferred rendering ...... 11 3.2 Camera fundamentals ...... 11 3.2.1 Pinhole camera ...... 11 3.2.2 Real camera ...... 13 Diffraction ...... 13 Proportions ...... 13 Lens Distortion ...... 13 Aperture ...... 14 Exposure ...... 15

ii Center of projection and angle of view ...... 15 Zoom and prime lenses ...... 16 3.3 Camera model ...... 17 3.3.1 Other camera models ...... 18 3.3.2 Modelling blur ...... 18 3.4 Geometric camera calibration ...... 19 3.4.1 Zhang’s method ...... 20 3.4.2 Reprojection error ...... 20 3.4.3 Camera tracking ...... 21 3.4.4 Control point detection ...... 21 Corner detection ...... 21 Center of squares ...... 21 Circles ...... 22 3.4.5 Precalculation of lens distortion ...... 23 3.4.6 Interative reﬁnement of control points ...... 23 3.4.7 Fiducial markers ...... 23 3.4.8 Image selection ...... 26 3.4.9 Blur detection ...... 27 3.5 Depth detection and shape reconstruction ...... 28

4 Method 31 4.1 Previsualization using game engines ...... 31 4.1.1 CryENGINE + Film Engine ...... 31 4.1.2 Ogre3D + MotionBuilder ...... 32 4.1.3 Stingray ...... 32 4.1.4 Unity 5 ...... 32 4.1.5 Unreal Engine 4 ...... 33 4.1.6 Game engine compilation ...... 33 Graphics ...... 34 Development ...... 34 Usability ...... 35 4.1.7 Previsualization tool implementation and selection ...... 35 Unreal Engine implementation ...... 35 Film Engine implementation ...... 36 4.2 Improved camera calibration ...... 37 4.2.1 OpenCV ...... 37 4.2.2 Previzion ...... 37 4.2.3 A new camera calibration process ...... 38 4.2.4 Conditions ...... 39 4.2.5 Calibration process ...... 39 4.2.6 Calibration software ...... 40 4.2.7 Calibration pattern design and marker detection ...... 40 ArUco pattern ...... 40 Concentric circle pattern ...... 41 4.2.8 Image selection ...... 42 4.2.9 Iterative reﬁnement of control points ...... 43 4.2.10 Calculating extrinsic parameters relative to the robot’s coordinates 43 Ofﬂine calibration ...... 43 4.3 Camera tracking solution review ...... 44 4.3.1 Ncam ...... 44

iii 4.3.2 Trackmen ...... 44 4.3.3 Mo-Sys ...... 45 4.4 Commercial multi-camera systems ...... 46 4.4.1 Markerless motion capture systems ...... 46 4.4.2 Stereo cameras ...... 46 4.4.3 OptiTrack ...... 46 4.5 Camera calibration simulation ...... 47 4.5.1 The camera matrix ...... 47 4.5.2 Inverse lens distortion ...... 48 4.5.3 Blur and noise ...... 50

5 Results 51 5.1 Previsualization using graphics engine ...... 51 5.2 Camera calibration ...... 54 5.3 Image selection ...... 54

6 Discussion and future work 56 6.1 Graphics engine integration ...... 56 6.2 Modular architecture ...... 57 6.3 Camera tracking ...... 58 6.4 Compositing ...... 58 6.5 Object tracking ...... 58 6.6 Augmented reality ...... 59 6.7 Camera calibration ...... 59 6.8 Using calibration data ...... 59

7 Conclusions 61

Appendix A Inverse distortion 66

iv Chapter 1

Introduction

Pure live action movie production is often limited by time, location, money and the possibilities using practical effects. Building sets, costumes and animatronics takes time. Certain locations are not accessible for shooting when needed or at all. Large crowds of people can be hard to get by. All of these factors can also be prohibitively expensive and some things can simply not be done in a satisfying manner using practical effects. Computer generated images are in many cases indistinguishable from reality to the human eye and are used more and more in films and advertising. Most of the photos the Ikea catalog are computer generated. However, in many cases realistic computer generated images are hard and expensive to produce and often fall short in realism (especially the case when dealing with animation and human beings). The best of both worlds can be achieved by mixing realities and shooting live action on CGI using green screens. This allows scenes that would be too expensive or even impossible to film in full live action while still getting the realism of real human actors. A major disadvantage of shooting on green screen with CGI is that the director cannot see or interact with the virtual elements in the scene and thus get a realistic view of how the end result will look like, and also not being able to make changes as necessary by moving objects and actors around during the shooting session. This disadvantage can be mitigated by the use of previsualization where the director is given a rough take of what the final cut will look like by combining the filmed material with the virtual environment in real-time. To achieve this a number problems should be solved in descending order of importance:

• At minimum, a solution for compositing the camera feed with a rendering of the virtual scene, placing actors and props in the virtual environment, is needed.

• To allow camera movement, the parameters of the camera in the virtual scene should match the those of the real camera.

• To allow renderings with dynamic scenes and more advanced camera movements a real-time rendering of the virtual scene is needed.

• To allow the director to make corrections between takes the previsualization tool should allow for easy and fast changes within the virtual scene.

Photorealistic rendering has long been the offline renderings domain alone. Ad- vances in computer hardware and rendering algorithms has improved and still is improving real-time rendering massively. Several game engine developers are starting to take advantage of this fact to market themselves as tools for filmmaking and several short films have been made to demonstrate these capabilities [1][2].

1 Chapter 1. Introduction 2

1.1 Background

Stiller Studios at Lidingö is one of the world’s most technologically advanced film studios. Instead of letting a film crew travel around the world recording different envi- ronments, every shot is set up and recorded in a single location. This is done by several digital tools which cooperate in green screen to build a scene that do not exist in reality. What stands out extra for Stiller Studios is the film material delivered to the customer. Stiller Studios market themselves as the only green screen studio in the world generates perfectly matched foreground and background clips without first having processed it, something that is extremely time-saving. The company is especially specialized in previsualization, which is used to give an idea of the final result before post-processing. The existing previsualization tool could not handle light, reflections and other visual phenomena required for realistic renderings. Moreover, there was development potential in terms of including camera calibration and depth detection. Therefore two students from the M.Sc. program in Media Technology at Linköping University were given these research areas to examine and develop as part of their thesis work.

1.2 Purpose

The thesis work is meant to solve the previsualization problems and wishes that Stiller Studios obtains in collaboration with a supervisor and examiner at Linköping Univer- sity. The following questions are designed with the aim to correspond to the assign- ments of the work:

• How can Stiller Studios’s previsualization be improved?

• Which methods are suitable for improving the previsualization?

• What is the intended outcome of improving the previsualization?

1.3 Limitations

Since the work corresponds to full time studies for two students during one semester it must be limited in terms of capabilities and performance. The hardware and software solutions therefore depends on what is considered most appropriate for the studio structure within the timeframe available.

1.4 Typographic conventions

The following typographic conventions are used in this report:

• Italics text refers to a variable of a mathematical expression.

• Italics and bold text refers to a ﬁrst presentation of a product or organization. Chapter 1. Introduction 3

1.5 Planning

At start-up, the students had little or no experience of the various tools Stiller Studios planned integrate with the previsualization system. Therefore, a preparatory evaluation work of the tools was conducted followed by an implementation of those that would seem most suitable for the task. The preparatory work also included a summary of the company pipeline and workﬂow. In addition to this were further developments, such as camera calibration and depth detection, included in the planning schedule. Chapter 2

Existing system

2.1 Film studio setup

FIGURE 2.1: Virtual view of Stiller Studios, green screen studio showing the cyclops.

Stiller Studios’s previsualization system is based on several components that interact. The ﬁrst step consists of a 3D view of the camera, according to customer requirements. With the help of a plugin built for the animation software program Maya, it is possible to digitally match that scene with the actual green screen studio. When the matching is considered completed (all elements are positioned so that the upcoming shooting is considered physically feasible) is real camera data connected to the virtual one. The camera in the studio is attached to a motion control robot known as the cyclops 2.1 and gets set to the correct position according to the scene in Maya. The software that controls the cyclops is called Flair and is not communicating with Maya, but with the real-time engine MotionBuilder. Data is therefore sent from Maya via MotionBuilder to the cyclops in order to adjust the camera position for the scene. The ﬁnal image is then given by a combination of foreground and background images in QTAKE, an advanced video system with integrated assistance for keying (removing the green color

4 Chapter 2. Existing system 5 from green screen). Foreground images are given from the camera and background images from MotionBuilder. This process is shown in ﬁgure 2.2.

FIGURE 2.2: The process of CGI combined with live action camera data.

MotionBuilder runs in two sets, one that provides compressed image data to QTAKE for real-time rendering and one rendering high-resolution image post-processing. The shooting data gets saved in the database software FileMaker.

2.1.1 Pipeline The Stiller Studios pipeline is written in Python and is based on a specific folder structure. It would be possible to rewrite the pipeline in C++ to speed it up, but it is not a priority or even necessary for the studio at the moment. The pipeline server is running FileMaker and is waiting for REST commands that sends and receives database information. The commands is given by the user via a Python command-line interface. Each command has its own Python file, containing the command functionality. For each project in FileMaker there are several film clips. Every project has an unique ID. The clips are divided into locations with different scenes. Under each scene there are different shots and under each shot there are assets (video elements). Assets are shots of the same scene but from different angles. All shots get saved in a table where every row is created for each shooting. This can be rated so that the customer afterwards can check whether a shot went well or not. The command-line interface is based on an API package, which has the same structure as FileMaker does (film→scene→shot→asset). The current pipeline uses UDP for communication with the system’s all different applications and computers. The advantage of UDP is that it is very easy and fast. The disadvantage is that the messages are not guaranteed to arrive at all or in the correct order which may become a problem, especially when it comes to getting the precise Chapter 2. Existing system 6 movement data from a camera robot. According to Stiller Studios’s software developers this should not be a problem, partly because the data is sent locally and partly because the new data is sent continuously and Stiller Studios has not experienced it working poorly.

2.1.2 Shooting process Before the studio starts recording it must be prepared. This is done via a prepare REST command. The command is sending out information so that all hardware components for the shooting know what to do. If any hardware is not working, the user must be informed. This is something that is currently under development. A Rasberry Pi (credit card-sized computer) works as the leader of the studio setup. It is attached to a RED-camera that, via serial port, tells when a recording starts. The Rasberry Pi receives and controls the sent commands and then forwards the message to all other hardware (computers running 3d softwares and camera/cyclops) involved. When a prepare command is sent is the Rasberry Pi telling the computers which visual world to load and the cyclops that camera data is needed. Additional information from the Rasberry Pi is sent when the camera starts recording, such as possible animations that should be triggered in MotionBuilder. Once a scene has been recorded, the camera magazine gets copied by inserting it into a computer that scans the data and places it in the correct folder according to the project structure explained in section 2.1.1.

2.1.3 Flair Motion control data from the camera can be streamed from Flair via UDP, TCP or serial ports. It is possible to manually set the format in which the data should be be sent in Flair. The different modes for streamed data tested in Flair is called XYZ, Axis or Motion- Builder. XYZ and MotionBuilder contains values of the data type ﬂoat that represents the camera’s position, target, scroll, zoom and focus. Axis sends actual values for the robot. The MotionBuilder mode was written by the developers of Flair by request from Stiller Studios and differs from the XYZ mode by also sending time data with the pack- ages [3]. This allows MotionBuilder to know not only the position of the camera, but also the date of the camera to which the camera is located and can therefore adjust the timeline in MotionBuilder. This is important for scenes containing animation. From the position of the camera and target, it is trivial to ﬁgure out a direction vector. With this direction vector together with the roll it is possible to calculate the orientation of the camera.

2.1.4 DeckLink On Stiller Studios’s computers there are DeckLink capture cards from Blackmagic which can take video signals via SDI connections. This image signal can be handled through Black Magic’s DeckLink SDK. An Interface Definition Language file (IDL) can be included and compiled into a Visual Studio project to generate a header with functions that can be called from a C++ application to communicate with the DeckLink card installed on the computer. Via the SDK, it is possible to find and iterate through all DeckLink cards installed on the computer and also to find and iterate through all video out connections available Chapter 2. Existing system 7 on these cards. The SDK provides functions to get the image data in different formats (resolution, color spaces and color depth) in a byte string which can be used freely by the user. Exactly what formats that are supported may vary between different cards. Some cards also make it possible to effectively make alpha keying to combine images with transparency.

2.2 Camera calibration

An important part of the filming live action on CGI is the camera calibration, making sure that correct camera settings for the current lens, such as angle of view and distortion are known. A more in-depth explanation of these parameters is found in section 3.2 on Camera Fundamentals. Stiller Studios’s calibration process is done by setting up a large ”ruler” on the far end of the robot’s rails using movable walls and tape where the zero position of the ruler is lined up due to the rails and robot. The robot is moved as far away from the ruler as possible and the lens’s focus distance is set to infinite and looks at the zero position of the ruler. The length of the ruler visible to the camera is estimated visually and the view angle and focal length is calculated using trigonometry as seen in figure 2.3. This process is repeated for all focus distances written on the lens, noting discrep- ancies between the actual focus distance and the given focus distance. Lens distortion, bulging or contraction of the image due to the optics are handled in Nuke (post-processing software). An image of a grid is taken by the camera and then Nuke calculates the distortion using the warping of the grid. The center of projection or no-parallax point’s position relative to the sensor is calculated by rotating the camera and finding the rotational position where parallax does not occur.

FIGURE 2.3: Sketch of the Stiller Studios camera calibration process.

2.3 Problem deﬁnition

From observation and discussion with the Stiller Studios crew a number of areas with room for improvement was identiﬁed. Chapter 2. Existing system 8

2.3.1 Low quality rendering Optimally the previsualization rendering of the virtual scene should be as close to the postproduction result as possible. The graphics provided by MotionBuilder however do not hold a particularly high quality, lacking advanced materials or rendering methods. MotionBuilder scenes often look ﬂat and represent a bad attempt of conveying the feel of a scene. This makes MotionBuilder mostly useful as a purely technical tool for knowing the physical limits of the scene where actors should stand and not stand, rather than a creative tool by providing the director a feel for the scene.

2.3.2 Slow and imprecise camera calibration To provide a good previsualization and sending the correct data to the customer for production, it is important that the camera calibration works well. Optical distortion and field of vision must be known in order to match the virtual background to the foreground filmed accurately. The position of the center of perspective is particularly important to know since this is the same thing as the no-parallax point, the point which the camera rotates around without parallax, which results in that objects in the scene appear to move relative to each other. Thus, when Stiller Studios is filming against a static background the camera must only rotate around the no-parallax point to prevent that the film material slides relative to the background. The current camera calibration technique takes massive amount of time to set up and execute. If someone disturbs the ruler or if it is not setup correctly the results will be ruined and the calibration has to be redone. The calibration also relies on the naked eye viewing when things look right which will introduce a large degree of imprecision. The current method also fails to account for images that are squeezed or stretched by the lens and camera or cases where the optical center is not in the middle of the image. Stiller Studios’s method for camera calibration also prevents the company to use other types of lenses such as zoom lenses where the center of projection and angle of view change by a large degree since it would simply take to much time.

2.3.3 Motion control safety and errors Stiller Studios’s camera tracking relies on translating motor values from the camera robot into camera parameters. This has the beneﬁt of allowing the studio to precisely plan and repeat exact camera movements. However, the use of motion control for camera tracking has two major drawbacks - safety and mechanical errors.

Safety The robot is heavy, fast and requires safety measurements. Operations of the robot are limited to certiﬁed motion control supervisors. To precisely plan the movement of the camera is not only a possibility but in many cases a necessity. The safety requirements for the motion control robot can slow down the shooting, especially if changes to camera movement is required.

Mechanical errors Due to the robot’s weight and speed, mechanical errors exist. Rapid retardation will result in camera shaking due to inertia. When the robot fully extends its arm it will be Chapter 2. Existing system 9 slightly lower than in its retracted state due to its weight. Mechanical errors also occurs from gears due to backlash of the gears in the robot.

2.3.4 Rudimentary compositing Stiller Studios combines camera feed with virtual scenes by simply keying out the green screen and adding the camera feed over the virtual scene. No depth information of the ﬁlmed material is known or used. This limits the kind of scenes that can be previsual- ized since the actors can not go behind virtual objects, leading to unrealistic previsualization or requiring green screen props as stand in for the virtual objects.

2.3.5 Suggested improvements From the problem deﬁnitions four areas of improvement can be deﬁned:

• Improved rendering.

• Improved camera calibration.

• Alternative camera tracking.

• Depth compositing.

In order to limit the scope of the project, the majority of the work has focused on rendering and camera calibration. In chapter 3 are the underlying theories behind these concepts explored, while in 4 are methods for solving these problems evaluated and implemented. Chapter 3

Theory

3.1 Rendering

Many people encounter computer renderings every day - TV, the internet, video games and billboards at the streets are just some examples. Rendering is the process where a geometric description of an virtual object is converted from 3D into 2D that looks realistic. There are both ofﬂine and real-time rendering techniques [4], some of them presented below.

3.1.1 Ofﬂine rendering Ofﬂine rendering is used by systems that prioritize accuracy over frame rate. Only one frame can, depending on what is being rendered, take hours of time to complete.

Ray tracing Physically correct simulations of how light transports are sought when rendering realistic 3D scenes. This can be done by casting rays from the camera, through a pixel, into the scene to calculate shadows and reﬂections due to how the rays bounce between different scene surfaces, called ray tracing. This means heavy computer calculations and is therefore traditionally performed ofﬂine.

3.1.2 Real-time rendering Real-time rendering is a direct interaction between a user and a virtual environment, especially common in the computer games industry. It consists of algorithms using rasterization for converting geometries into pixels, and techniques for deﬁning what, how and where pixels should be drawn.

Rasterization Rasterization converts a vector graphics image into a raster (pixel or dots) image. The technique is extremely fast but does not, unlike ray tracing, prescribe any way of simulating reﬂections or shadowing. To handle these issues rasterization is combined with certain real-time global illumination and real-time shadow methods, described in following subsections. This will not however make the rendering result end up as realistic looking compared to ray tracing.

Real-time global illumination Two methods used for real-time global illumination is baking and voxel cone tracing. Baked global illumination has the limits that it only can handle static objects since the

10 Chapter 3. Theory 11 technique must store light information before it can be processed further. Based on a hierarchical voxel octree representation [5], voxel cone tracing however supports dynamic indirect lightning but is not as good-looking as baked lightning.

Real-time shadows There are several ways of generating real-time shadows, for instance shadow mapping and shadow volumes. The shadow mapping is image based and the shadow volumes is geometric based [6], resulting in that shadow mapping is faster but not as precise as shadow volumes.

Forward and deferred rendering There are rasterization techniques for determining the path of how pixels should be rendered in real-time. Two examples of these are the forward rendering and deferred rendering [7][8]. Forward rendering supplies the graphics card with geometry that breaks down into vertices, and then those are split and transformed into fragments, or pixels, that get rendered before getting passed on to the screen. It is a fairly linear process and is done for each geometry in the scene before producing the ﬁnal image [9]. Deferred rendering, on the other hand, performs its calculation directly on the pixels on the screen instead on relying on the total fragment count. This simpliﬁes the use of many dynamic light sources within a scene, an optimized forward rendering. However it cannot handle everything that forward rendering does, for example the rendering of transparent objects. It also requires newer hardware to run.

3.2 Camera fundamentals

3.2.1 Pinhole camera A simple model for understanding how a camera or an eye works is the pinhole camera. The idea behind a pinhole camera is essentially to take a lightproof box with a small hole (a pinhole) on one side. An upside down projection of the view outside the pinhole is projected on the opposite side of the box as a result of the pinhole blocking all rays of light not coming from the pinhole’s direction as seen in ﬁgure 3.1. Chapter 3. Theory 12

FIGURE 3.1: The pinhole camera model [10].

The image of a perfect pinhole camera with the pinhole as a single point in space and no bending of light can be described using the following equations X −x = f (3.1a) Z Y −y = f (3.1b) Z where x and y are the coordinates of the projection on the image plane, f is the focal length or distance between the image plane and the pinhole, X and Y are the horizontal and vertical displacement from the pinhole and Z is the depth of the observed object. The angle of view of the image is given by the following formula

d a = 2 arctan( ) (3.2) 2f where a is the angle of view, d is dimension size of the projected image (width or height depending on whether a horizontal or vertical angle of view is requested) and f is the focal length. By removing the minus signs from equations 3.1 a correctly oriented image is given. This can be geometrically interpreted as putting the image plane in front of the pinhole and coloring it based on the rays that pass through. This mathematical abstraction of the pinhole camera is typically known as the projective transform and the pinhole as the center of projection [11]. This is the camera model commonly used when rendering virtual 3D images. When modeling real cameras there are a few other factors that need to be taken into account. Chapter 3. Theory 13

3.2.2 Real camera One of the biggest issues with the pinhole camera is that in order to get a clear image the pinhole needs to be very small. A larger hole will allow light from a larger area to reach the same location on the image plane or sensor in the camera, creating blur. A smaller hole however means less light and a darker image.

Diffraction Another problem with real pinhole cameras is diffraction. While light is commonly modeled as perfect rays this does not quite accurately describe how light behaves in real life due to the wave-particle duality of light. A consequence of the wave nature of light is that it bends slightly around corners. Thus, a small pinhole will have most light bend around its edges resulting in an effectively blurring of the image and also creation of an artifact known as the airy disk [12]. This limits how sharp images a real camera can produce.

Proportions Yet another problem with the pinhole camera is that the angle of view completely depends on the relation between the size of the projected image or sensor and the distance between the image plane and the pinhole, making very large or very small angle of views infeasible.

Lens Distortion To get around the pinhole camera problems mentioned above, real cameras use lenses. The lens allows for a larger opening while retaining a crisp image by focusing light rays originating from the same point in space to the same position on the sensor. The use of lenses does though result in a number of new problems. Using lenses will result in various degrees of distortion. That is deviation of the rectilinear projection of the pinhole camera model where straight lines in the image remain straight after projection as seen in ﬁgure 3.2. The most common forms of distortion introduced by lenses are radial and tangential distortion [11]. Chapter 3. Theory 14

FIGURE 3.2: A rectangular grid warped by radial distortion.

Radial distortion occurs as a result of the radial symmetry of the lenses. Tangential distortion in turn occurs as a result of misalignment of the elements in the lens and camera, such as the sensor being at an angle relative to the optical axis of a lens. A lens only correctly focus light at a specific distance with objects outside the focus distance are perceived as increasingly blurry. The range in which objects are considered sharp is called the depth of field, see figure 3.3. The human eye solves blurring by controlling the shape of the lens using muscles and thus changing the focus distance to whatever the person is looking at. Cameras usually solve the same problem by having one camera lens consisting of an array of lenses whose configuration can be changed mechanically, changing the focus distance.

FIGURE 3.3: Within the depth of ﬁeld are objects perceived as sharp [13].

Aperture The problem of depth of ﬁeld can be mitigated by the use of an aperture that can close and focus the image essentially the same way as the pinhole camera by only letting light in from small opening, but with the same drawback of making the image darker. Chapter 3. Theory 15

How much light that reaches the sensor is usually measured using the f-stop or f-number which is equal to f N = (3.3) D where N is the f-number, f is the focal length and D is the diameter of the effective aperture. The larger the f-number is, the less light there are. The larger the diameter of the aperture is, the more light enters the camera. A larger focal length means that less light reaches the sensor due to decreased energy density over distance traveled. This means that the f-number gives a measurement of how much light that is captured and translat- able over cameras with different focal lengths [14]. However, light is not only lost due to distance traveled but also by absorption by the optical elements of the lens. There- fore are t-stops (transmittance stops) used which also take the transmission efﬁciency of the lens into account.

Exposure Another method of making an image brighter is to increase the exposure time. Rather than increasing the area through which light enters the camera it is possible to increase the amount of light by increasing the time-period the sensor is exposed to light for each image. This however creates problem if there is movement in the image during the exposure, leading to motion blur.

Center of projection and angle of view Finding the center of projection and angle of view for a pinhole camera is somewhat trivial since these are well deﬁned and easily measured properties of the camera. The center of projection is at the same position as the pinhole. By measuring the distance between the pinhole and the sensor as well as the size of the sensor, the angle of view can be determined using equation 3.2. This however is not the case for real cameras with complex lenses. A common misconception is that the center of projection is the nodal point or front nodal point of a lens or that the no-parallax point is the same as the nodal point or front nodal point, either as a mistake in terminology or a misunder- standing in how lenses work. The no-parallax point is the point on the camera around which no parallax will occur if the camera is rotated. This means that objects seen from the camera will not appear to change their relative location to each other when the camera is rotated. That the no-parallax point and the center of projection are the same thing can easily be un- derstood since the center of projection is the point where all rays entering the camera converges. Drawing lines between the points in the scene and the center of projection shows that the angle between objects only changes with the position of the center of projection rather than its orientation. This also has the convenient effect that the center of projection can be found manually by rotating the camera around different points until no parallax is achieved. The center of projection is the same as the entrance pupil or apparent position of the aperture seen from the front of the camera. The position of the entrance pupil will also affect the angle of view of the camera, however due to bending of light by the lens the angle of view cannot be calculated by simple measurements as for the pinhole camera. A short explanation is that the aperture puts the same constraints on incoming rays of Chapter 3. Theory 16 lights as on the pinhole on the pinhole camera. All rays will pass by the aperture and thus the aperture will in practice work as a center of projection. While this is intuitive for a small aperture it also holds true if the aperture size is large. The same rays entering the small aperture will also enter the large one, however more rays will enter the large one creating blur. The position of the aperture will determine which part of the blurred image to sharpen as a function of the center of projection and angle of view decided by the aperture position. A more detailed explanation and proof can be found at [15]. For a lens with the aperture at the front, the position of the center of projection is easily determined. For many lenses however is the aperture behind a number of lens elements, bending the light before it hits the aperture. This is why the center of projection is not necessarily at the apertures physical position but rather at its apparent position as seen in ﬁgure 3.4.

FIGURE 3.4: A raytraced telephoto lens, showing the entrance pupil and its relation or lack thereof to the nodal points or any physical part of the camera [15].

Zoom and prime lenses Complex lenses used for professional photographing and ﬁlmmaking usually fall into one of two categories - prime lenses and zoom lenses. Prime lenses are lenses with supposed constant focal length (and thus angle of view) with the possibility to change the focal distance. However, due to “lens breathing” the focal length and the center of projection can change slightly when changing the focus of the camera. For zoom lenses this changing of focal length is not a bug but a feature allowing the cameraman to not only change the focus distance but also the angle of view by a large amount, zooming in or out of the image as a result. What this means is that not only are the center of projection and angle of view hard to ﬁnd for complex lenses, they also change depending on the mechanical settings of the lens. Chapter 3. Theory 17

3.3 Camera model

A more complete mathematical model of the pinhole camera is given by X x = fx + cx (3.4a) Z Y y = fx + cy (3.4b) Z or as a matrix using homogeneous coordinates, as

x fx 0 cx X       y = 0 fy cy Y (3.5) w  0 0 1  Z  where fx and fy are the horizontal and vertical focal length and cx and cy are the optical center in the image. Sometimes a ﬁfth parameter representing skew in the image is added to the model [11]. The reason why there are two focal lengths if the focal length of the pinhole camera is equal to the distance from the image plane to the center of projection is because the focal length is given in the coordinate system of the image rather than the coordinate system of the observed scene. Different values for fx and fy will squeeze or stretch the image horizontally or vertically. In real cameras this can be used to compensate for the lens squeeze of anamorphic lenses or the sensor elements representing each pixel in the camera when rectangular rather than square which is generally the case for pixels. cx and cy represent the optical center of the image also given in image coordinates. That is the intersection of the optical axis and the image plane. The optical axis is the line going through the center of projection orthogonal to the image plane. Typically the optical center is requested to be the actual center of the image. For an image given in pixel coordinates that is 800 pixels wide and 600 pixels high is cx = 400 and cy = 300. However, since real cameras and lenses are not perfect this might vary slightly. There also exist applications where the optical center and the image center should differ. It is possible to think of the center of projection as an observer in itself and the image plane as a screen or window that the observer is looking at. Thus, most images will look right when looking at them from a centered position, aligning the observer’s view with the center of projections. If an image is meant to be viewed from an angle, the position of the optical center should be changed accordingly. A notable example of this are VR-displays creating an illusion of depth by tracking the eyes of the user and changing the position of the center of projection [16]. To represent the camera position and orientation in the world it is also possible to add a rotation and translation matrix to the model, transforming the coordinates of objects given in a coordinate system of the scene to the coordinate system of the camera. This completes the pinhole camera model as

q = MTRQ (3.6) where q is the image coordinates, M is the camera matrix describing the mapping of the pinhole camera from 3D points in the world to 2D points in an image, T is a translation matrix, R is a rotation matrix and Q is the object space coordinates. To model real cameras however, lens distortion must be taken into account as discussed above. This can be done by using Brown’s distortion model [17]. Brown’s Chapter 3. Theory 18 distortion model corrects for lens distortion by mapping the image coordinates of an distorted image point to its undistorted position, thus correcting the distortion of the image. Brown’s model is given by

2 4 2 2 2 4 xd = xu(1+ K1r + K2r + ...)+ P2(r + 2xu) + 2P1xuyu)(1+ P3r + P4r + ...) (3.7a)

2 4 2 2 2 4 yd = yu(1+ K1r + K2r + ...)+ P1(r + 2yu) + 2P2xuyu)(1 + P3r + P4r + ...) (3.7b)

r = (x − x )2 + y − y )2 (3.7c) p u c u c where xd and yd are the undistorted image coordinates, xu and yu are the distorted image coordinates, xc and yc are the distortion center, Kn are the radial distortion parameters and Pn are the tangential distortion parameters. Since Brown’s distortion model includes radial and tangential distortion based on Taylor polynomials, it can in theory be used to model radial and tangential distortion of arbitrary complexity [18]. The camera matrix together with Brown’s distortion have been shown empirically to be capable of giving relevant estimations of the type of cameras commonly used in photography and ﬁlmmaking. Variables of the camera matrix and distortion functions are together referred to as the intrinsic parameters of the camera. That is the parameters describing the camera’s internal structure where the orientation and position of the camera, given by the rotation and translation matrices, are referred to as the extrinsic parameters.

3.3.1 Other camera models While the pinhole camera model with added distortion works well enough in most cases, it is entirely physically accurate and have limitations. For example, it is generally unsuitable for ﬁsheye lenses since the model breaks when dealing with angles of view that exceed 180o. For these cases omnidirectional camera models can be used [19]. Another example of an alternative and more physically accurate camera model is the one proposed by Kumar et al. [20], modelling radial distortion as a change in the position of the center of projection.

3.3.2 Modelling blur The real camera produces both depth and motion blur and the virtual perfect pinhole camera does not. To correctly match the cameras might require either correction of blur in the real camera image or applying blur to the virtual one. In general this is less of a problem when shooting on green screen since the actors and real props are usually at a similar distance from the camera or can be shot in several takes. Simulating blur is generally easier than correctly sharpen an image since information is lost in the blurring process. However, in case of attempting to sharpen blurred images it would probably be preferable to look into deconvolution algorithms which are algorithms used to reverse effects of convolution [21]. Nvidia’s GPU Gems 3 discusses both problems of simulating motion blur and depth of field [22]. The suggested motion blur method applies motion blur as a post-processing Chapter 3. Theory 19 step by first calculating the world position of each pixel using the depth buffer. The previous pixel world position is then calculated using the view-projection matrix of the previous frame. The difference between these two positions are calculated to get the pixel velocity. The color of the pixels in the velocity direction is sampled and blended for each pixel creating motion blur. This method however only render motion blur as a result of camera movement. If enough computer power is available both the motion blur and depth of field problems can be solved in largely physically accurate ways by multisampling. Motion blur can be simulated by taking several subsamples between each frame and blending them. The depth of field problem can be solved by rendering the scene several times, each time offsetting the center of projection slightly while still having the camera looking at the point that is the center of focus and blending the samples. However, both of these methods can become extremely expensive since the whole scene is rendered anew for each sample. Too few samples will create no blur or spectres where several copies of the blurred objects will appear rather than a continuous blur.

3.4 Geometric camera calibration

"We focus on intricate motion control work, where virtual and real camera positions and paths need to be perfectly matched and output in real time as usable data."

from Stiller Studios’ website [23].

Having a virtual camera model capable of describing how real cameras in general generate images from scenes is not enough to match them. It is also necessary to find the actual values of the parameters describing the specific camera used and as explained previously, this is non-trivial for cameras with complex lenses. The purpose of geometric camera calibration is to find the intrinsic and extrinsic parameters of a camera. The chapter on Camera calibration written by Zhang in the book "Emerging Topics in Computer Vision" [24] gives a general overview of photogrammetric camera calibration. That is camera calibration using measurements from images taken by the camera. Several different calibration methods exist but can generally function by the following two parts:

1. Detecting geometry in one or more images.

2. Finding the camera parameters that map the detected geometry between the images taken or to a known model of the geometry.

Generally points are used as geometry in step 1, however different methods exist such as methods using lines and vanishing points [25]. Calibration methods can be divided into two categories - calibration with apparatus and self-calibration. Calibration with apparatus is done via locating known points in the image of the calibration apparatus and matching these to the points in a predeﬁned model of the points’ real location. Self-calibration in contrast is done without a known model, but instead by identifying the same points from a number of views and ﬁnding the camera transform that maps these points between each other. Zhang [24] further divides calibration with apparatus into three categories based on the dimension of the calibration: Chapter 3. Theory 20

• 3D reference object based calibration.

• 2D plane based calibration.

• 1D line based calibration.

While self-calibration has the benefit of not requiring a well-defined and well-made calibration apparatus it requires a larger number of parameters to be estimated and thus also is a harder problem. However, in some cases calibration using an apparatus is not possible such as when calibration of a camera from a given video is desired. The more dimensions the calibration apparatus has, the easier the parameters are to solve for where it is possible to calibrate a camera using a 3D apparatus with only one image of the apparatus. However, creating a 3D reference object with precise measurements is hard or expensive while almost anyone can print a pattern on a paper or order a custom made poster of a pattern. A 2D apparatus in turn requires that several images with different views of the apparatus are used. 1D objects, such as a rod and string with beads attached along its length, have the advantage that it is easier to make sure that all points are visible from several different viewpoints at the same time for several different orientations of the calibration object. This is useful when calibrating the extrinsic parameters of several cameras with the same view relative to each other. Calibration techniques can also be classified based on their constraints. By reducing the number of parameters needed to be solved for the problem, the calibration can be simplified. Often the intrinsic parameters are kept constant for all images used in calibration. Methods also exist that allow for varying intrinsic parameters with known rotation of the camera [26].

3.4.1 Zhang’s method One popular calibration method is the one devised by Zhang in 1998 [18]. Zhang’s method is a 2D plane based calibration method that uses the same pinhole camera model with Brown’s distortion model using two radial distortion parameters as described in section 3.3. The method takes at least two lists of pairs of points, each consisting of at least eight pairs. One pair is being detected points from a view of the calibration plane and the other being the corresponding points in a known model of the calibration plane. The homography transformation between the model pairs and the detected points are calculated and from this are the camera matrix and extrinsic parameters estimated. With this information the least square method is used to estimate the lens distortion parameters. This gives an initial guess for camera parameters. These are in turn reﬁned using the iterative Levenberg-Marquardt optimization algorithm to minimize the reprojection error.

3.4.2 Reprojection error The reprojection error is a method for measuring the quality of a camera model by taking the average distance between a set of measured points projected using the camera model and the same points in an image taken by the actual camera. If the points’ precise position is detected in the image and the camera model is completely correct, the distance should be zero. Chapter 3. Theory 21

3.4.3 Camera tracking The traditional approach for camera tracking, or match moving, is usually done in post-production through self-calibration by ﬁnding points and matching their position between frames without information about their actual 3D-position. Good camera calibration algorithms are not known for their speed. However, once the intrinsic parameters have been calculated they will not change unless the optics of the camera are changed. For a prime lens, only the focus can change thus the change in optics is one-dimensional and can easily be stored for later use. Thus, to get a real-time camera tracking solution only the camera’s extrinsic parameters in real-time have to be solved. Finding the extrinsic parameters of a camera knowing the intrinsic parameters using a known set of 3D points and their 2D projection in an image is known as Perspective- n-Point or PnP and can be done much faster than general camera calibration [27].

3.4.4 Control point detection A crucial step in camera calibration is detecting the points used for calibration. For the calibration algorithm the points used in calibration must be correctly detected. Pre- cise detection of points is limited by noise, blur, imprecision in manufacturing of the calibration pattern, image resolution and distortion.

Corner detection One of the simplest calibration point solutions is using a chessboard pattern and a corner detection algorithm such as Harris corner detection [11]. An iterative gradient search can be applied to achieve subpixel accuracy [11]. However, these methods only work optimally for patterns orthogonal to the image plane and due to perspective and lens distortion this will not be the case.

Center of squares Another calibration point method is to apply the corner detection scheme to detect the corners of squares and finding the middle of the square from these corners. The idea is that the error in corner detection will even out to some degree by using the average of several points. Detecting the center of the square is trivial while viewing the frontal image of the plane. In this case the middle of a square is simply the average of the corners. How- ever, this will not be the case when viewing the plane from an angle due to perspective distortion making parts of the square that are further away appear smaller than parts closer to the camera. The real center can be found by defining lines between the di- agonal corners of the square as seen in figure 3.5. The intersection of these lines as Chapter 3. Theory 22 calculated by equation 3.8 is the center of the square. However, due to lens distortion the square center will still be imprecisely detected.

x1 y1 x1 − x2 x2 y2

x3 y3 x3 − x4 x4 y4 x = (3.8a) x − x y − y 1 2 1 2 x − x y − y 3 4 3 4

x1 y1 y1 − y2 x2 y2

x3 y3 y3 − y4 x4 y4 y = (3.8b) x − x y − y 1 2 1 2 x − x y − y 3 4 3 4

(x,y) is the intersection point, (x1,y1) and (x2,y2) are points along one line and (x3,y3) and (x4,y4) are points along the other line.

FIGURE 3.5: A rectangle seen from an angle with the actual center of the rectangle in green and the average of the corner points in red.

Circles The idea of reducing errors by using the average of several points circles can be extended further using circles. Circles can be thought of as polygons with an infinite amount of corners. A contour detection algorithm can be used to detect circles and an ellipse fitting function [11] or an average of the contour points can be used to find the center. However, as for the center for squares, the problem with perspective distortion exists since circles lack of corners. No trivial solution exists. Mateos [28] suggests a solution by finding the shared tangents of four circles neighboring each other in a square grid and calculating bounding squares to find the circle centers using the line intersection method. As with the square and corner detection solutions, an error due to lens distortion will occur. By using concentric circles, more contour points can be used to even out errors. Another benefit of concentric circles is that shapes that can be mistaken for concentric Chapter 3. Theory 23 circles are unlikely to occur in the background of an image, reducing the risk of faulty point detection.

3.4.5 Precalculation of lens distortion One way to solve the problem of lens distortion bias in marker detection is to solve lens distortion separately and correct for it before doing any other calibration. An advanced method for solving lens distortion separately is presented by Tang et al. [29] using what the authors call a calibration harp - a frame with very straight lines stringed vertically. A picture of the harp is taken and the distortion model is optimized to minimize the curve of the lines.

3.4.6 Interative refinement of control points Datta et al. [30] present a method for overcoming the problem of lens and perspective distortion errors using an iterative calibration method. The basic idea is that even if the first intrinsic and extrinsic parameters are not calculated exactly right these can be used for reprojecting the image and still produce an image with less distortion that is viewed more from the front. Detection of control points in a small image with less perspective and lens distortion results in more accuracy. Thus, by detecting the control points even in the frontal image and projecting these back using the calibrated camera, the rotation and translation matrices should get an even more accurate set of control points that can be used for recalibration. This process can be repeated until convergence or a fair enough reprojection error is reached. A more advanced method based on the same idea of using the reprojected frontal image is presented by Wang et al. [31] with the main difference being the use of iterative digital image correlation of concentric circles rather than the same detection algorithm used in the first step for obtaining more precise control points from the frontal image.

3.4.7 Fiducial markers When detecting control points it is generally not necessary to detect every single control point or the same control points in every image used for calibration. It is however usually necessary to correctly match the control point to the equivalent point in the model or between the images. One way to allow for identification of control points is to use markers with identities built into their design. Rice [32] divides these kind of visual markers or tags by two broad coding schemes - template and symbolic. Template-based schemes work by matching the detected tag against a database and predefined marker images using autocorrelation. As such, in theory, any image can be used as a tag. An example of template-based tags are the ones used in ARToolKit (figure 3.6). Using a symbolic scheme, on the other hand, means that the tag is created and read using a set of well-defined rules of how the data is encoded in the tag. One of the most well known symbolic tags are QR codes (figure 3.7). Chapter 3. Theory 24

FIGURE 3.6: Template ﬁducial used in ARToolKit [33].

FIGURE 3.7: QR code, a symbolic data matrix tag [33].

Template-based tags benefit from allowing the use of images with meaning to a human observer to be used as tags. However, they also present a number of problems. To avoid false detections the tags need to be as distinct as possible, which in turn hinders the use of arbitrary images or images with specific meanings. The images also need to be sufficiently distinct at different orientations and only little research has been done into the effects of pixelation and perspective correction on the autocorrelation function. A large dataset of template images will also lead to a lot of potentially expensive comparison operations for each detected marker. Symbolic tags in contrast usually work by dividing the tag into a number of data cells representing binary data by coloring the data cells black (0) or white (1). The marker is detected in an image, the data cells are sampled and a code representing the marker is built from the sampled data. Symbolic tags benefit from being very clearly defined. Since the identity of the marker is built from the data contained in the tag, it is not necessary to do a linear search through a database for the marker’s identity. A risk for errors in the detected data exists. This however can be modeled as bit-errors and solved the same way by Chapter 3. Theory 25 including redundant bits, parity bits, checksums or similar. Increasing the number of markers is also well defined by simply increasing the number of data cells at the cost of more details in the marker. Tags can have different shapes, most commonly circular or square (3.8). The benefits and drawbacks of using either circular markers or square markers have been discussed previously. However, in the context of symbolic data markers, the shape of the marker has some significance when it comes to how the data is structured in the marker. Square markers are usually structured as a simple data matrix and is quite easy to process. If the corners are found, a perspective transform can be used to find the frontal image which in turn can be sampled along the x-axis and y-axis of the image with an offset defined by the data matrices’ size. To optimally make use of the area of a circular tag, the data cells can instead be organized using polar coordinates, by radius and angle, since these can be read sampled using trigonometry for frontal images. When seen from an angle, the problem of defining the perspective of a circle can possibly lead to errors if a lot of data is encoded along the radius of the circle.

FIGURE 3.8: Circular and square symbolic tags [34].

Rotation of fiducials can be handled for both circular and square tags by reading the data cells in a rotationally invariant manner and offsetting the data. This however reduces the number of unique data representations of the tag. Another solution is to add orientation descriptors to the marker as seen in figure 3.7 and figure 3.9.

FIGURE 3.9: Circular symbolic tag with orientation markers [35]. Chapter 3. Theory 26

One interesting and quite different kind of ﬁducial marker is the chromaglyphs [36]. Chromaglyphs consist of concentric circles in different colors (ﬁgure 3.10). As long as these colors are different enough they should provide a simple but yet robust system for unique markers. The use of color however could be a problem in a green screen environment. The question of what constitutes different enough also exists.

FIGURE 3.10: Chromaglyph markers [36].

3.4.8 Image selection Calibration methods that rely on several images of a calibration plane need these images to be from different perspectives in order to provide enough information for the algorithm. Similar images will give similar, and thus redundant, information to the algorithm. More images mean more calculations and thus longer time for the calibration to complete. Similar perspectives also increase the risk of error bias leading to faulty parameters. Ideally the least amount of images with the most variance, and thus information, should be used for calibration. This could possibly be done by either planning the capture of the images in detail or by selecting a number of images for calibration based on whether they look different to the human eye or not. Byrne et al. [37] present a method for automatic selection of images using the concept of the calibration line. By using the calibration line it is possible to describe each image as a single line and using the angle of these lines as image descriptors. This means that how similar images are as a one-dimensional metric can be described and also that information if optimal coverage is achieved for the selected number of images is available. The calibration line can be approximated by ﬁnding the homography between four detected control points in an image and same four control points in the reference. The calibration line is then calculated as a function of the homography matrix as

y = kx + m (3.9a)

3 3 3 2 −h11h32 + h12h31 − h11h31h32 + h12h31h32 k = 2 2 3 3 (3.9b) h22h31h32 − h21h31h32 − h21h32 + h22h31

h21h31 + h22h32 h11h31 + h12h32 m = 2 2 − 2 2 k (3.9c) h31 + h32 h31 + h32 where y = kx + m is the lines equation with k being the slope and m being the intersection of the line and the y-axis and hrc is the value at the r-th row and c-th column if the homography matrix.

θ = arctan(k) (3.10) gives the angle Θ of the line. According to Byrne et al. and shown in their results, this angle can be used as a one-dimensional difference between images for the purpose of calibration. A line can rotate 180o before lining up with itself again. Thus, to get Chapter 3. Theory 27 optimal variance for the images the angles of the calibration lines should be as spread out over 180o as possible. The optimal angle between the calibration lines of the images used can be calculated as 180o β = (3.11) N where β is the optimal angle and N is the number of images used.

3.4.9 Blur detection Determining when an image is in focus and as sharp as possible is of interest for several reasons. When detecting markers of any variety, the image should be as sharp as possible to give an accurate determination of its position. When calibrating the camera it is preferable to accurately determine the focus distance of the optics so as to know what focus settings to use when ﬁlming from what distance. It could possibly be used for auto-focus. Traditionally, the lens focus setting for a particular distance is determined by having the camera look at a pattern, such as the Siemens star as seen in ﬁgure 3.11, and manually adjusting the focus.

FIGURE 3.11: The Siemens star pattern.

Research also exists into determining the focus of an image automatically. Pertuz et al. [38] did a review of a number of different focus measurement methods for use in shape-from-focus depth estimation. That is, using focus distance and depth-of-ﬁeld blur effect, to estimate the shape and distance of objects. Pertuz et al. divides the operators used for focus measurement into six families:

1. Gradient-based operators. Using the gradient or ﬁrst derivative of an image.

2. Laplacian-based operators. Using the second derivative or Laplacian of an image.

3. Wavelet-based operators. Using the capabilities of the wavelet transform to describe frequency and spatial content of an image. Chapter 3. Theory 28

4. Statistics-based operators. Using image statistics as texture descriptors to compute the focus.

5. DCT-based operators. Using the discrete cosine transform to compute focus based on the frequency of the image.

6. Miscellaneous operators. Operators not belonging to any of the other ﬁve categories.

Laplacian operators were shown to have the best performance overall alongside wavelet-based operators. However, they were also shown to be the most sensitive to noise. One operator that showed a general good performance were Nayar’s modiﬁed Laplacian

Φ(x,y)= ∆ I(i,j) (3.12a) X m (i,j)∈Ω(x,y)

∆mI(i,j)= |I ∗ Lx| + |I ∗ Ly| (3.12b) where Lx = −1 2 −1 (3.12c)

t Ly = Lx (3.12d) and Φ(x,y)is the focus at pixel (x,y), Ω(x,y) is local neighborhood of the pixel and I(i,j) is the value of pixel (i,j).

3.5 Depth detection and shape reconstruction

One of the most common ways of detecting depth in images is through triangulation. By using the projection of a point in 2D in two or four images, the 3D-position of the point can be determined (figure 3.12). In theory this problem is trivial. If the camera’s extrinsic and intrinsic parameters are known exactly, it is simply a matter of finding the intersection of the lines going from the camera’s center of projection through the 2D- projection of the point. In practice however, the camera’s parameters are not known with 100% accuracy and neither will the detection and matching of the control point be. Nor will it have infinite resolution [39]. Chapter 3. Theory 29

FIGURE 3.12: Locating a point in space using epipolar geometry and two cameras [40].

This way of detecting depth is fundamentally how human depth vision works. An estimation of the depth of objects can be made by combining the images created by both eyes. Many computer vision applications use the same principles for depth detection using stereo camera systems. One major problem however is to correctly match points between images. Methods for detecting and matching arbitrary points in images such as SIFT (Scale-invariant feature transform) exist but are prone to error. This means that in practice stereo depth cameras will have a lot of noise and require a lot of guessing, cleaning of data and interpolation. One interesting method for reconstructing a 3D object without having to deal with the problem of correctly matching points between images is space carving [41]. This method however is limited by requiring a large number of views and a background that is distinct, and can easily be separated from the object being reconstructed such as a green screen. Rather than projecting and finding the intersection of lines at single points, the silhouette of the object is detected and a virtual cone is projected out from the center of projection to the silhouette’s contour, classifying all points outside the cone as empty. This is repeated for a large number of views, classifying more and more space as empty (essentially carving away virtual space) until an accurate model remains. A drawback of space carving besides the elaborate setup required is that it cannot deal with cavities since these will not be visible from any direction. Complex objects or scenes with several objects may also provide less than optimal results due to the objects obscuring each other from certain views. The problem of finding the depth of filmed objects can be solved on several different levels, where the most basic is a way of finding an approximate depth of the whole image and the ideal way is perfect 3D-reconstruction of the filmed material. Chapter 3. Theory 30

In a green screen studio where foreground objects can easily be distinguished from the green background, a multi-camera setup can be used either for space carving or rough trigonometry. This would identify objects on the green screen as blobs and calculating an average distance. A single-camera setup could possibly be used for giving a rough approximation of the objects’ 3D-position. That assumes that the floor of the studio is visible to the camera, the camera parameters are known relative to the floor and that objects are placed on the floor. If the lower points of objects are detected, it is simply a matter of projecting the points from the camera plane to the floor plane. Chapter 4

Method

4.1 Previsualization using game engines

A previsualization solution for improved rendering at Stiller Studios should be able to:

• Provide a high-quality real-time rendering.

• Control the virtual camera with streamed motion control data.

• Build scenes that can easily be modiﬁed during recording.

• Manage timeline-based animation.

• Be automated to facilitate use.

• Receive image data via the DeckLink SDK in real-time that can be composed with a scene.

High-end game engines are built to provide high-quality real-time rendering, tools and assets for building and modifying virtual scenes and also to be highly extendable and customizable with code. These factors make high-end game engines very interesting as potential previsualization tools. To ﬁnd a suitable game (or other graphics) engine for the given task was an evaluation conducted where graphics performance, usability and development possibilities aspects were compared against each other. To examine this, software documentations were used and in some cases meetings with the engines’ developers themselves. Sim- ple test implementations was also performed to give a more thorough understanding of the engines’ different beneﬁts and disadvantages.

4.1.1 CryENGINE + Film Engine Film Engine, previously known as Cinebox, is a new ﬁlm and previsualization tool still under development, an extension of the game engine CryENGINE [42]. Being built for ﬁlm, Film Engine has both video-in and keying functionalities. The tool handles complex graphics, have good interaction capabilities and high performance. From Film Engine there are handy exporting opportunities as well as live sync to Maya. Film Engine has also live motion capture functionality. CryENGINE has a unique visual scripting language and supports Lua scripting. The Film Engine developers also made it possible to script in Python. Scripts can be sent from other applications to Film Engine to control the software. The support for plugins, however, is less than for both the graphics engines Unity and Unreal Engine (evaluated below).

31 Chapter 4. Method 32

A great CryENGINE advantage is its full dynamic global illumination in real-time with voxel ray tracing. The source code for the CryENGINE is hard to come by, but the sales site Amazon recently released their game engine Lumberyard [43] which is entirely based on the CryENGINE where the source code is free to download.

4.1.2 Ogre3D + MotionBuilder The possibility to maintain MotionBuilder and, if so, to implement another utility to the already existing previsualization was also investigated. Ogre3D [44] is not a game engine itself, but it handles graphics and is often used as a component to games. The source code is open for modiﬁcation and use. A disadvantage of Ogre3D is that there is no global illumination (except simple shadows). However, such a solution would mean that Stiller Studios may continue to use the system they have already mastered in full which was considered to be a smooth and convenient solution. Stiller Studios is interested in using the funds of models, tools and effects found in the different game engines. If an improved rendering would be made in MotionBuilder [45] the task would still be to solve how the previsualization artists should work with MotionBuilder to build good-looking scenes. Possibly it would be desirable to build additional tools for this or, for example, to import scenes with materials from Maya. Game engines are made to construct 3D scenes and make fancy real-time renderings of those, making game engines easier to evaluate from an artistic previsualization point of view.

4.1.3 Stingray Stingray [46] is a new gaming and visualization engine developed by the world’s leading design software publisher Autodesk, which also is the publisher of both Maya and MotionBuilder. This gives Stingray great potential when it comes to connecting the engine with the already existing previsualization system. Stingray has scripting capabilities in Lua and source code written in C++/C# for which access was given for this project thanks to shown interest from Autodesk themselves. It is essential that in a modular way be able to expand the existing editor’s own functionality. Here is Stringray currently lacking although it can be controlled by scripting. The fact that Stingray is new on the market makes it difﬁcult to handle due to the lack of documentation available. The graphics in Stingray support deferred rendering and global illumination based on Autodesk’s Beast, which requires baking. Since lighting of the scene is precomputed when baking, the scene or parts of the scene affected by the lighting must remain static.

4.1.4 Unity 5 Another proposal on a possible previsualization software was the game engine Unity 5. Unity’s scripting system, built in C#, makes this tool a strong candidate among other applications. It is undoubtedly the game engine that has in the shortest development time given the most results when it comes to exploring the possibility to add own camera functionality in the editor, thanks to a very detailed documentation [47]. Compared with for example Unreal Engine, Unity is not equally comprehensive and lacks a well-developed support for cinematics. Unity also has a special license for Chapter 4. Method 33 companies that earn a certain amount, which makes it less useful for a company like Stiller Studios compared to an engine like Unreal Engine. For global illumination is Unity using a product called Enlighten [48]. Besides baked lighting Enlighten offers the option to use precomputed light paths. Instead of storing the resulting lightmaps as with baking, the light paths or visibility between surfaces is stored. This allows for real-time diffuse global illumination with a moving light source by saving processing power on light bounces. This however still requires the scene to remain static.

4.1.5 Unreal Engine 4 The source code for the game engine Unreal Engine 4 is open and available on the web- based Git repository hosting service GitHub. The editor of the game engine can be extended with plugins in C ++ [49]. Unreal Engine has great advantages since it is possible for the program to create animations and save camera data in the scene. The game engine lacks however in its real-time global illumination (requires baking) and also when it comes to opportunities of controlling and modifying the editor. To write plugins to expand the editor is no easy task when the aid available hardly covers anything more than game mode and not the actual editor. In addition to writing plugins, it is also possible to script with Un- real Engine’s visual scripting language blueprints but it is limited and lacks sufﬁcient documentation for this speciﬁc task. Unreal Engine graphics support deferred rendering and baked global illumination called Lightmass. Something that could be of interest to note is that after evaluating Unreal Engine 4.10, Unreal Engine 4.11 and 4.12 were released with 4.12 introducing a number of new features to facilitate using Unreal Engine as a movie making tool such as an improved cinematic and animation tool Sequencer and the CineCameraActor allowing for greater control of the virtual camera settings.

4.1.6 Game engine compilation The different evaluated applications for the previsualization mostly consist of various game engines. The tables below show a summary of the information generated during the evaluation. Since Ogre3D is not a game engine, it is left outside of the compilation which is meant to serve as a clear comparison between engines. To implement Ogre3D also means a very different solution of the project where the current previsualization system remains, which is another reason of keeping it separate. Chapter 4. Method 34

Graphics Unreal En- Stingray Unity 5.3.2 CryEngine gine 4.10.2 Rendering Deferred Deferred Deferred/forward Deferred path rendering rendering rendering rendering Anti- Temporal Variation Doesn’t have Combination aliasing anti- of tempo- any anti-aliasing of pattern aliasing ral anti- apart from post- recognition aliasing process unless a post- forward render- process ing path is used and tem- with a gamma poral anti- buffer aliasing Global illu- Lightmass Beast Enlighten Voxel ray mination tracing

Development Unreal En- Stingray Unity 5.3.2 CryEngine gine 4.10.2 Source access C++ C++/C# Possible to Lumberyard (given get with (C++) (not for Stiller special for Film Studios) license Engine) (C++) Editor plugins C++ Not yet C++/C# C++ Editor scripting Blutility Script edi- C# Lua/macros (very lim- tor (Lua) ited) Documentation Decent Insufﬁcient Detailed Insufﬁcient Chapter 4. Method 35

Usability Unreal En- Stingray Unity 5.3.2 CryEngine gine 4.10.2 Cutscene Matinee Lua script Plugin Track view editor editor Editor Yes No No Yes console command Stiller Stu- Yes No No Yes dios prefers License Free Fee Fee for Free as Lum- Unity Pro beryard, fee for (required CryENGINE/Film if the Engine company income is more than 100,000 US dollars per year) Exporting Matinee Lua script Plugins Film Engine frames Assets Great sup- Great sup- Great sup- Great support (models, port port port audio, images)

4.1.7 Previsualization tool implementation and selection Two softwares stood out in the selection of the most suitable tool for Stiller Studios - Film Engine and Unreal Engine. Film Engine is, unlike the other programs, actually made for the ﬁlm industry. Functionalities such as video-in and keying already exist, which facilitate the process of getting the software to collaborate in the studio. Film Engine is however still under development and therefore not a ﬁnalized software like Unreal Engine, which has a really good graphics support. Unreal Engine has also open source and is more fully documented. The reason why Ogre3D, Stingray and Unity did not make it to the top league mostly depends on the Stiller Studios crew’s own preferences. The previsualization artists did not approve the other engines whether it came to graphics or usability, which led to a decision of not investing in these for further development. Due to great trust in both Film Engine and Unreal Engine, a deeper investigation was conducted for each one examining the possibilities to integrate these with the rest of the previsualization system components which is essential for selection of the most suitable tool.

Unreal Engine implementation Stiller Studios had a rudimentary solution for camera tracking in Unreal Engine even before this thesis, consisting of two parts - a simple MotionBuilder device plugin that Chapter 4. Method 36 takes camera orientation in Euler angles, position and optional extra data and sends this to a specified IP port over UDP, and a custom version of Unreal Engine with the source code modified to receive camera data UDP packets to set the view in the editor thereafter. This solution consists of several problems. If the solution was updated or a newer version of Unreal Engine was installed, the solution had to be reimplemented in the source code and the entire source had to be recompiled which leaves much room for errors and takes a long time. In addition, the solution locked the entire editor if no UDP packet was received. The solution also wrote to arrays out of bounds, possibly explaining some random crashes. To facilitate the continued development, the existing solution was implemented into a plugin. This was somewhat tricky due to limited development opportunities, but at the same time a more modular and easier solution to maintain and test. To get around the problem with the locked editor, attempts were made to read the UDP packets in a separate thread. This however created a new problem. The motion data is sent from MotionBuilder at the same frequency as the frame rate of the camera. By blocking the rendering of the editor by having the UDP request in the same thread, the frame rate of Unreal Engine would be forced to be the same as the frame rate of the camera as long as the frame rate of the camera was lower than the frame rate of Unreal Engine. By receiving the motion data asynchronously the frame rate of Unreal Engine would not match the camera, creating notable motion artifacts when moving it. To avoid having the editor lock up when not receiving UDP packets the solution was remade to continue Unreal Engines’ update cycle of no packet was received after a certain amount of time. The possibilities of getting camera data into Unreal Engine have been examined. A solution was found using OpenCV (section 4.2.1) to bring a webcam or other video feed into the engine as an animated texture, but in Unreal Engine this only works in game mode. Another problem arose when it came to getting the camera image from the DeckLink card as the SDK’s IDL file can not be compiled in Unreal Engine C++ projects. A possible solution to this would be to build a separate library that communicates with the DeckLink card and in turn include this library in the Unreal Engine plugin.

Film Engine implementation The fact that Film Engine is a cinematic tool gives many advantages. Putting up an exe- cutable version in the studio with camera data in and functioning in editor keying was done spending only a work day, which is significantly more time efficient compared to what it would have been for the other surveyed programs. Given the access to Film Engine’s motion capture API, it was possible to develop a plugin to connect the camera data from Flair and MotionBuilder to the virtual camera in Film Engine. Via the motion capture API it is also possible to control several other camera features in addition to the position and direction, such as aperture, focal distance and focal length. When live keying inside Film Engine there was a notable but constant delay of the camera image relative to the motion data. A naive but workable solution (since this delay seemed to be more or less constant) is to set up a corresponding delay in the movement of the camera data manually. This was done by sending a value D from MotionBuilder to Film Engine along with the camera data, setting up a queue in the Film Engine motion capture plugin, storing the motion data and sampling the motion data from the specified amount of frames (D) earlier in the queue. Chapter 4. Method 37

A rudimentary depth placement was also implemented by rendering the video-in feed from the camera on a plane in the scene, locking the plane to the view and scaling and moving it depth-wise. This depth placement allows actors to be behind virtual objects in well planned scenes with known depth. A dialogue took place with Film Engine’s developers who continuously received feedback to match the software to Stiller Studios requirements. The perceived problems with Film Engine at ﬁrst implementation was:

• Delay live in terms of sync between the camera and the engine. The engine should preferably synchronize with the FPS (frames per second) of the input signal.

• Lack of FPS timeline with the opportunity to scroll back and forth in given animations.

• Unautomated system. The desired result should only be a few clicks away.

• Bugs and random application crashes.

Film Engine’s development team took responsibility for developing the tool further with the above challenges in mind. The team visited the studio and was therefore well aware of the problems mentioned. To get around the limitations of the data sent directly from Flair as well as for instance restrictions on the timeline in Film Engine, data from the camera motion was first sent to MotionBuilder and then to Film Engine. This made it possible to use Mo- tionBuilder’s timeline and control other parameters such as depth of field and field of view externally.

4.2 Improved camera calibration

4.2.1 OpenCV OpenCV is an open source computer vision library for C++ that includes many of the algorithms and methods for dealing with the problems discussed so far [11]. OpenCV includes methods for importing and exporting images and video, for example a wide array imaging algorithms and operators, a camera calibration function based on Zhang’s method as well as functions for detecting control points, solving the Perspective-n- Point problem and more. Several extensions also exist for OpenCV such as the ﬁdu- cial marker library ArUco [50], which provides functions for generating and detecting square data matrix markers.

4.2.2 Previzion Lightcraft Technology’s Previzion software includes a camera calibration process. While the documentation for Previzion does not include any information on how the system actually works but only instructions on how to use the system, some parts can be de- duced [51]. Previzion’s system is capable of solving the camera’s intrinsic parameters as well as the center of projection’s distance from the calibration target. The Previzion system works by rigging a large motorized calibration board (ﬁgure 4.1) in front of the camera. The board is aligned so that the center of the board is at Chapter 4. Method 38 the camera’s optical axis. The aperture of the camera is closed to its smallest setting to avoid blur. When activating the calibration the motorized board swings around its vertical axis. Images are captured by Previzion using the camera and these are used for calibration.

FIGURE 4.1: The Prevision’s calibration board setup [51].

Previzion’s calibration board consists of a number of square data matrix markers that become smaller the closer to the center of the board they get. This can probably be motivated by the fact that the markers should be small in order to allow for more markers. Too small markers however are untrackable and when the board is viewed at a sharp angle the outermost markers will be further away and appear smaller, thus need to be larger to be detectable. The use of data matrices for markers greatly simplifies matching the control points to the predefined model since each individual marker can be identified exactly without taking its position relative to the other markers’ into account. Since not all markers on the wide board will be visible at the same time to the camera, simpler markers would make correct identification of the markers a lot harder.

4.2.3 A new camera calibration process The purpose of the camera calibration process is to find intrinsic parameters for different focus and zoom settings and also, given by the motion control system, the position of the center of projection and camera orientation relative to the motion control robot. The positional offset along the optical axis and focal length are the most important factors to find when matching the virtual and real camera since these naturally vary depending on the lens. Other intrinsic parameters such as the optical center and lens distortion are generally the results of flaws in the camera or lens itself or in the case of Chapter 4. Method 39 error in position and orientation, the result of misalignment between the camera and the robot or faulty calibration of the motion control robot.

4.2.4 Conditions The setup of Stiller Studios’s system and studio provides several beneﬁts as well as limitations. The studio has highly controllable lighting of high quality. The camera movement is controlled by a high precision motion control robot and the movement as well as zoom, focus and iris settings are recorded. The camera and most lenses used are of high quality. The use of the motion control system is however limited, primarily for security reasons. It is not allowed to use software to take control of the robot, even though it is doable in theory. The reason for this is lack of documentation, dangerous risks and huge costs if something goes wrong. In general the motion control operator prefers if the camera itself moves as little as possible, even if the movement is predeﬁned by a pattern. This, once again, for security reasons and the time and attention needed keep things safe. Still the calibration process should be as automated as possible and require little to no understanding by the operator of how the calibration process works.

4.2.5 Calibration process The suggested calibration process has been developed through ongoing discussion with Stiller Studios’s motion control operator since that is the one who is responsible for camera calibration and also the one who would use the methods. As mentioned, there was a strong preference for not moving the camera during calibration. A possible doable setup was therefore considered as a precisely constructed rig at the end of the camera robot’s rail with a mountable motor controlling rotating calibration aboard akin to the one used with Previzion. The proposed calibration process can be described with the following steps:

1. A calibration pattern board is mounted at the end of the robot rail.

2. The aperture is closed to its smallest size.

3. The desired focus and zoom for the robot is set.

4. The robot is positioned to look at the center of the calibration board at the correct distance along the rails for optimal focus and zoom.

5. The calibration process starts when the calibration board begins to rotate back and forth to sharp angles towards the camera and a series of images then gets collected.

6. The calibration is done using OpenCV’s camera calibration function.

7. The resulting data is saved.

8. Step 3 through 7 is repeated for the desired focus and zoom values. Chapter 4. Method 40

4.2.6 Calibration software An application was developed to perform the calibration using C++ and OpenCV that is capable of taking either a video feed via webcam, DeckLink or by reading images from a folder, and motion data by either directly from Flair or from a Flair exported JOB ﬁle with saved motion data. There are settings for calibration such as initial parameters, parameter locking, calibration target position and orientation, image input and motion data input. The program provides a live view of the video input together with a blur measure. The interface provides a single button for starting the capture and starting the calibration once capturing is completed. The calibration process itself is run asynchronously from the rest of the application allowing the user the set up the camera for the next calibration while the calibration function is executed.

4.2.7 Calibration pattern design and marker detection The calibration process is agnostic to what kind of calibration pattern is used as long as enough control points can be clearly detected and matched to known model. The marker detection code was structured using a strategy pattern for easy selection of the marker type and calibration pattern. Two custom pattern types were developed for testing. An ArUco-based ﬁducial pattern inspired by Previzion’s calibration process and a concentric circle pattern.

ArUco pattern For ArUco-based patterns, the reference model for the board was not manually con- figured or generated as a mathematical description at the boards’ creation. Instead an image of the calibration pattern using ArUco markers was given. The markers were detected and their coordinates were used as model points. This means that an ArUco calibration pattern could be made using an image editing program choice and database of ArUco markers. Detecting and identifying ArUco markers is a built-in functionality. Using OpenCV’s subpixel function, the detection of the markers corners can be improved. One issue that occurred in testing was the subpixel function detecting corners inside the ArUco tags’ borders for small tags. This can be mitigated by making the ArUco tags’ borders thicker. To make sure this problem would not occur, the detection algorithm was made to skip markers within an area small enough that the search area of the subpixel function not risked being larger than the border of the marker itself. Another problem occurred with markers at the edge of the image. If parts of a marker’s border were partly outside the image it would still be detected with incorrect corners. This was solved by simply skipping markers with one corner close to the edge. Calibration patterns that can be used from a large number of distances are needed in order to allow calibration of a high variety of lenses with different optical centers and angle of view. A calibration board similar to the one used in Previzion solves this. A similar calibration board can be generated using quadtrees. By defining the calibration board as a quadtree or grid of quadtrees and letting the depth of the quadtree be a function of the distance from the board’s center with larger depth closer to the center along with using the chebyshev distance metric, a grid subdivided into progressively smaller cells closer to the center is given. An ArUco fiducial is then placed in each of these cells. Chapter 4. Method 41

FIGURE 4.2: Aruco board generated using quad-trees and chebyshev distance metric.

Concentric circle pattern A grid pattern of concentric circles was also developed. The circles are detected using thresholding and OpenCV’s built-in function for finding contours. The innermost contours are selected and if these have centers that are approximately the same as their N containing contour, where N is the number of circles representing one concentric circle marker, it is added to the list of detected markers. The contour of all markers is detected and the dot product between neighboring contour markers is calculated to correctly match the detected markers with the reference pattern. The markers with the sharpest angle relative to their neighbors are the corners of the calibration pattern. These are used to calculate an approximate homography for the calibration pattern. The reference marker positions are transformed to match the captured image pattern using the homography and the markers matched with the reference by pairing the detected markers with their closest match in the transformed reference. One benefit of concentric circles over the standard circle pattern used in OpenCV is that they are more robust to background objects that might accidentally be classified as markers since concentric circles or a hierarchy of contours that all share the same center are less likely to appear in the background by accident.

FIGURE 4.3: Concentric circle pattern. Chapter 4. Method 42

4.2.8 Image selection A max number of frames used for calibration should be set before calibration. This is important since too many frames will make the calibration very slow without improving the result. However, it is desirable to not force the user to determine which images to use. If frames are taken from similar perspectives this will create bias and possibly not give the algorithm enough information for good calibration. Even worse, the error will likely not be visible in the reprojection error since the faulty camera parameters work for the small subset of frames used. A naive approach would be to select a random set of images or to set a timed in- terval to capture images with the right frequency to match the max number of images and the relative moment of the calibration board and camera. Attempts were made to use the calibration line concept described in section 3.4.8. This however turned out to be less than ideal - either by programming errors, lack of understanding of how it should be used or perhaps the calibration line concept being less than ideal as is. Camera movements with rotation around the calibration pattern center which can be empirically shown to have very good coverage for calibration would give a near constant angle of the calibration line, while camera movements with pure translation of the pattern in the image plane which provide no useful information for calibration gave different calibration lines angle. Transposing the homography matrix to take into account possible differences in matrix multiplication in OpenCV and the paper did not solve the problem of rotational information but also produced near constant angles for the case of pure translation. Instead a method using OpenCV’s Perspective-n-Point (PnP) function was devised. A camera matrix is constructed using the dimensions and extrinsic parameters of the image with the camera matrix calculated using PnP. The vertical and horizontal rotation of the plane can then be extracted from the resulting rotation matrix and the euclidean distance pairs of two angles are used as a distance metric. Finding the optimal variance can be done by taking the set of captured images and ﬁnding the size N subset, where N is the number of images that should be used for calibration, whose smallest distance between two neighboring images is maximized. For simplicity and to avoid having to store a huge amount of images a simpler solution was implemented as described be the following psuedocode

function SELECT(ImageCapture images, Integer numberOfImages) repeat Add image from images to selectedImages until sizeofselectedImages equals numberOfImages repeat Add image from images to selectedImages Select imageP air from selectedImages closest to each other Select image img1 from imageP air closest to a third image img2 in selectedImages Remove img1 from selectedImages until endCriteria return selectedImages end function where the endCritera could be a maximum number of captured images, a minimum distance between the captured images for maximum coverage or a combination thereof. Chapter 4. Method 43

What this algorithm does in essence is to check if each new captured image adds more information to the current set of selected images. If the new image is further away, "more different" than the current image in the set closest to the other images, then it replaces that image.

4.2.9 Iterative reﬁnement of control points For improved calibration the iterative control point reﬁnement method described by Datta et al. [30] was implemented as described by calibrating, undistorting the images and using the resulting intrinsic and extrinsic parameters to calculate the homography transform to obtain the frontal image for each captured image. The markers in the frontal images get detected and transformed back using the inverted homography and OpenCV’s projectPoints function for redistortion. Then the calibration is repeated. The homography for the frontal image is obtained by

t −1 Hr = MR M (4.1) where Hr is the homography and M is the camera matrix and R the rotation matrix. However, since the image is also translated it will be off center. To make sure the pattern is visible the correct translation needs to be added.

MRtT C = − t (4.2a) (R T )y

1 0 Cx   Ht = 0 1 Cy (4.2b) 0 0 Cz 

H = HtHr (4.2c) T is the translation vector.

4.2.10 Calculating extrinsic parameters relative to the robot’s coordinates A static calibration board with known position, orientation and size and a moving camera should in theory allow calculations of the extrinsic parameters of the camera for every image used. The motion capture data stream is however not synced with the video feed. There is both likely a slight delay but also a difference in frame rate. When filming, the motion control will lock to the camera’s frame rate. However, when the frame rate is not connected to the camera it is stuck at 50 Hz. This means that taking the difference between the orientation and position given by Flair and the calibration will not necessarily give the right offset and orientation of the virtual camera. Thus, the extrinsic parameters are not guaranteed to fit with the motion capture data. If the camera however is still at the first few frames the motion capture data will match for the first frame and can thereby be used. This also holds true when calibrating using a moving target and a stationary camera assuming that the calibration target is at its known position and orientation at the initial frame.

Offline calibration The calibration can also be done in offline mode, using filmed material and recorded motion control data. This has two advantages: Chapter 4. Method 44

1. Filmed video taken from the camera’s physical memory is much higher quality with a image width of 4800 pixels vs 1080 pixels of the video stream.

2. The recorded motion control data is synced with the image feed.

4.3 Camera tracking solution review

A number of commercial real-time camera tracking solutions were brieﬂy looked into for integration into the Stiller Studios setup.

4.3.1 Ncam The Ncam system (ﬁgure 4.4) is capable of camera tracking in an unstructured scene. That is without requiring any specially constructed markers for reference. It uses a combination of stereo cameras and other sensors for detecting the camera’s position and orientation [52].

FIGURE 4.4: Ncam sensor mounted on a ﬁlm camera [52].

Ncam has not been tested live and it is unclear how well it performs in scenes with a lot of movement or if a green screen studio does not have to be modiﬁed for Ncam in order to reliably ﬁnd control points from the green backdrop.

4.3.2 Trackmen Trackmen is a company that specializes in camera tracking [53] and has several optical tracking solutions using a variant of square data matrix ﬁducial markers, see ﬁgure 4.5. Instead of square data cells a grid of circles are used. Chapter 4. Method 45

FIGURE 4.5: Trackmen’s optical tracking solution with ﬁducial markers [53].

Trackmen’s optical tracking can be done either by only using the camera image or by attaching an extra camera sensor to the ﬁlm camera, rejecting any worries about having to obscure the markers with props by putting them on the ﬂoor or ceiling.

4.3.3 Mo-Sys Mo-Sys has both an optical tracking system known as the StarTracker and several mechanical tracking systems using stands and cranes as well as rotary sensors for zoom and focus on the camera lens [54]. Mo-Sys’s StarTracker optical camera tracking system was briefly tested live at Stiller Studios. The StarTracker works by attaching a small computer with a gyro and an extra camera sensor with ultraviolet diodes on the film camera, shown in figure 4.6. Circular reflectors are then placed in the studio in a random pattern visible to the camera sensor, usually on the ceiling. The diodes combined with the reflectors make sure that the control points are clearly and unambiguously detected.

FIGURE 4.6: Mo-Sys’s StarTracker ceiling markers and sensor system [54].

Before ﬁlming can start a somewhat extensive calibration process needs to be performed. This can be achieved by moving the camera around the studio and then letting the system construct a model of how the control points are placed. The camera tracking Chapter 4. Method 46 is done by ﬁnding the transform that gives the detected control points from the internal model using information from the previous frames and gyro as an initial guess. When testing the system the tracking provided was, from the subjective perspective of Stiller Studios’s employees, performing well. No exact pixel measurements or comparison to the motion control robot was made. Some delay during fast camera movements could be observed and in certain cases some gliding between foreground and background. This is likely the case of imprecisely positioned center of projection.

4.4 Commercial multi-camera systems

4.4.1 Markerless motion capture systems Stiller Studios was looking into the markerless motion capture systems produced by The Captury [55] and Organic Motion [56]. These systems work by using a large number of cameras positioned around a person. Software is then used to process these images in real-time to create a skeleton model of the person. This skeleton should also be usable to give a rough estimation of the depth of a person being ﬁlmed to detail of the individual body parts by being modeled by the system. It is possible that the images produced by these systems could be obtained and used for 3D-reconstruction using a method like space carving. Organic Motion has a system in development for real-time 3D-reconstruction of people.

4.4.2 Stereo cameras The ZED camera is a consumer grade stereo depth camera that was tested brieﬂy at the studio. The major problem of using the ZED camera at Stiller Studios is the fact that it uses USB for data and has therefore very limited length cable opportunities, making it hard to place on the motion control robot. Using a depth camera would otherwise open up interesting possibilities for camera tracking, using the 3D-reconstruction of the scene given by the depth camera. The Ncam works like this and could possible also be useful for depth-detection.

4.4.3 OptiTrack Another multi-camera solution making methods such as space carving possible is the one provided by OptiTrack, a motion capture system that uses markers as indicators for 3D position and orientation. This would provide additional beneﬁts of allowing ﬁlming of marker-based motion capture material in studio. The markers put on the camera itself could also possibly be used for camera tracking. Chapter 4. Method 47

FIGURE 4.7: OptiTrack motion capture rig with cameras centered around the capture area [57].

4.5 Camera calibration simulation

A camera simulation application using Processing was developed to test the accuracy of the calibration. Processing is a programming language based on Java and is designed for ease use, electronic arts and visual design and thus a good fit for a simple virtual image generation. The application takes a template image of the calibration board as input. An empty virtual environment is initiated and the image is rendered as a texture on a quad in this environment. The extrinsic parameters of the camera can be defined using Processing’s built-in camera method. Camera data in Processing takes the camera position, target and up direction as arguments with roll defined using a rotation around the camera axis and letting the up direction vector be constant. The extrinsic parameters of the camera can also be defined using vector operations such as translation and rotation, which can be applied using predefined methods. Which method is used is a matter of preference or what kind of movement that is about to be described. Movement of the camera can either be defined mathematically as a function of time or by reading camera data from a Flair JOB file giving the data in the form of position, target and roll. The aspect ratio and the field of view of the camera can be defined using Process- ing’s perspective method. This does not however yield any control over the offset of the optical center. For a greater control of the view cone the Processing frustum method was used instead. The frustum method takes the left, right, bottom, top and depth coordinates of the near clipping plane and the depth of the far clipping plane with the position of the center of projection at the origo.

4.5.1 The camera matrix The left, right, bottom, top and depth coordinates of the near clipping plane together with the center of projection form the projection cone of the virtual pinhole camera and, Chapter 4. Method 48 as such, deﬁne the camera matrix. Converting desired focal lengths and optical offsets to the right frustum can be done using basic trigonometry. The depth of the near clipping plane is deﬁned as nd. The width and height of the of the near clipping plane (nw,nh) are given as a function of the horizontal and vertical focal lengths (fx, fy) and width and height of the image (iw, ih) as

nw = ndiw/fx (4.3a)

nh = ndih/fy (4.3b) Thus with the optical center at zero, the coordinates of the frustum can be deﬁned as

left = −nw/2,right = nw/2, bottom = −nh/2,top = nh/2 (4.4)

The optical offset can be modeled by adding the scaled optical offset ox and oy to the frustum coordinates giving

left = −nw/2+ ox,right = nw/2+ ox, bottom = −nh/2+ oy,top = nh/2+ oy (4.5) with ox and oy deﬁned as

ox =(iw/2 − cx)/fx,oy =(ih/2 − cy)/fy (4.6)

where cx and cy are the optical center in image coordinates. This will however move the image plane so that the center of the image is no longer aligned with the optical axis. Generally the calibration pattern should be aligned with the center of the image rather than the optical axis. This can be achieved by translating the image in the camera’s coordinate system by x = oxd,y = oyd (4.7) where d is the depth of the calibration pattern from the camera.

4.5.2 Inverse lens distortion Correcting for lens distortion using Brown’s distortion model can be done trivially by using a post-processing shader. For each fragment the coordinate is sampled from the input image. The same equations can also be used to apply lens distortion to a undistorted image. In order to analyze how correct the distortion parameters returned by camera calibration are, it is desirable to simulate lens distortion defined by the same parameters used for undistorting the image. This could also be of interest for matching the real and virtual camera not by undistorting the the real camera image, but rather distorting the virtual camera rendering. To achieve this it is necessary to solve the inverse of Brown’s distortion model. No analytical solution exists for this problem. One possible solution considered was to calculate and store the mapping of Brown’s distortion model and interpolate between values to solve for sparse mapping. This however would be expensive both in terms of memory, computation time and algorithmic complexity while still possibly providing less accurate results due to the sparse mapping and interpolation. Another solution would be to solve the problem by iterative approximation, but only briefly academic discussions about this was found by Abeles [58]. Abeles presents Chapter 4. Method 49 an algorithm for solving the inverse of Brown’s distortion model (section 3.3) using two coefficients for radial distortion and no tangential distortion which can be expressed as

function DISTORT(point p) d = p − c/f u = d repeat r = ||u|| 2 4 u = d/(1 + K1 ∗ r + K2 ∗ r ) until d converges return u ∗ f + c end function where p is a point in image coordinates, c is the optical center and f are the focal lengths. The basic idea is to express the undistorted coordinates u as function the distorted coordinates d by dividing out the coefﬁcient part m of the algorithm as

2 4 m = (1+ K1r + K2r ) (4.8a)

d = um (4.8b)

u = d/m (4.8c)

r = ||u|| (4.8d) However, since r is the distance of the point u from the optical center in normalized coordinates, this equation is unsolvable analytically. An initial guess of u = d is used to solve u. If m is correct, u will not change and the correct answer has thus been found. Abeles further suggests that a similar solution can be found by adding the terms for tangential distortion. This is done by ﬁrst subtracting the tangential part of the equation and then dividing by the radial part. This and adding a third distortion parameter gives the following algorithm

function DISTORT(point p) d = p − c/f u = d repeat r = ||u|| 2 ux = dx − p2 ∗ (r + 2 ∗ ux ∗ ux) − 2 ∗ p1 ∗ ux ∗ uy 2 uy = dy − p1 ∗ (r + 2 ∗ uy ∗ uy) − 2 ∗ p2 ∗ ux ∗ uy 2 4 6 u = d/(1 + k1 ∗ r + k2 ∗ r + k3 ∗ r ) until d converges return u ∗ f + c end function

This algorithm was trivially implemented in Processing using a GLSL (OpenGL Shading Language) post-processing shader. Chapter 4. Method 50

One interesting question is the convergence of the algorithm. For one parameter of radial distortion with small values, the algorithm should at least move in the right direction each iteration. However it does not necessarily converge for large distortion, tangential distortion or complex distortions with local maxima and minima. An evaluation of the method’s convergence is presented in Appendix A.

4.5.3 Blur and noise When testing a calibration algorithm it is of interest to test its robustness for noise and blur, which naturally occur in real images. Simulating physically accurate blur is a non-trivial problem. However, the purpose of blur in this context is not to calibrate the depth of field, shape from focus or similar but rather to see how the calibration method can handle control points that can not be perfectly detected. A simple way to blur an image is to apply a Gaussian blur filter. The filter calculates the color of each pixel as a weighted sum of the colors of the surrounding pixels with radially decreasing weight for pixels further away with drop depending on the kernel size. Processing includes a Gaussian blur filter with varying kernel size, which was used for the simulation. Noise was applied using additive white Gaussian noise. A random variable with a Gaussian normal distribution and a mean of 0 and a standard deviation given by the user was added to each pixel. Chapter 5

Results

5.1 Previsualization using graphics engine

The implemented result of Film Engine in use is shown in ﬁgure 5.1, a shot from live recording in the studio with composition. Corresponding results for the improved solution in Unreal Engine can be seen in ﬁgure 5.2.

51 Chapter 5. Results 52

FIGURE 5.1: Film Engine in live use. Chapter 5. Results 53

FIGURE 5.2: Unreal Engine in live use. Chapter 5. Results 54

5.2 Camera calibration

Table 5.1 shows the results of camera calibration using the camera simulation application written in Processing with three different calibration pattern detectors. Asymmetric is the standard 11x4 asymmetric circles pattern found on the OpenCV website while both Circles and Concentric used the 7x10 concentric circles pattern in figure 4.3 but with different detection algorithms where Circles uses OpenCV’s standard findCircles- Grid function and Concentric the custom-made concentric circle detection algorithm. Iter is the results from the iterative version of the same detection method. In this case however only the first iteration was shown to actually improve the estimation of the parameters. The value to the right of the estimated parameters for each detector is the difference between the estimated value and the real value given by the simulation. One interesting observation is that while the reprojection error (RE) is a lot better for Circlesiter compared to Concentriciter it is reasonable to question whether the estimated parameters are actually better.

Simulation Circles Circles iter Concentric Concentric iter Asymmetric K1 0.200 0.204 0.004 0.201 0.001 0.200 0.000 0.200 0.000 0.219 0.019 K2 0.000 -0.035 0.035 -0.006 0.006 -0.002 0.002 0.001 0.001 -0.268 0.268 P1 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 P2 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 K3 0.000 0.079 0.079 0.014 0.014 0.009 0.009 -0.009 0.009 1.140 1.140 Cx 600.000 599.507 0.493 599.512 0.488 599.549 0.451 599.500 0.500 599.441 0.559 Cy 450.000 449.487 0.513 449.509 0.491 449.499 0.501 449.485 0.515 449.357 0.643 Fx 800.000 798.939 1.061 799.977 0.023 799.090 0.910 799.982 0.018 798.990 1.010 Fy 800.000 798.939 1.061 799.975 0.025 799.090 0.910 799.995 0.005 799.064 0.936 RE - 0.016 0.010 0.038 0.026 0.023

TABLE 5.1: Table showing calibration results using virtual camera images.

5.3 Image selection

To test the deviced image selection process a large set of simulated camera images using the concentric circle pattern was generated with many images being identical or consisting of pure translation together with images with the calibration pattern viewed from different angles. Five different selection processes were compared, each selecting four images from the set:

1. Ourmethod using the correct parameters for the camera matrix and zero distortion.

2. Random selection.

3. Manual selection.

4. Sequential selection, taking the ﬁrst four images captured.

5. Our method using a badguess for the camera matrix with Fx = Fy = 8000, Cx = 300 and Cy = 225.

According to the results in 5.2 our image selection method provides an image selection equal or even better than manual selection. Interesting to note is that all selection processes provide decent reprojection errors even when the results are bad as with the Chapter 5. Results 55

Simulation Our method Random Manual Sequential Bad guess K1 0.200 0.198 0.002 0.181 0.019 0.222 0.022 18.914 18.714 0.202 0.002 K2 0.000 -0.014 0.014 0.015 0.015 -0.336 0.336 219.151 219.151 -0.026 0.026 P1 0.000 0.001 0.001 -0.003 0.003 0.000 0.000 0.036 0.036 0.000 0.000 P2 0.000 0.000 0.000 0.002 0.002 -0.001 0.001 0.016 0.016 -0.001 0.001 K3 0.000 0.023 0.023 -0.013 0.013 1.504 1.504 -0.093 0.093 0.045 0.045 Cx 600.000 599.369 0.631 606.463 6.463 598.733 1.267 605.363 5.363 598.753 1.247 Cy 450.000 449.941 0.059 439.546 10.454 449.900 0.100 463.306 13.306 449.352 0.648 Fx 800.000 799.933 0.067 771.220 28.780 799.829 0.171 7,758.321 6,958.321 799.818 0.182 Fy 800.000 799.857 0.143 771.313 28.687 799.854 0.146 7,760.709 6,960.709 799.888 0.112 RE - 0.118 0.119 0.114 0.108 0.118

TABLE 5.2: Table showing results of testing the image selection algorithm. random selection or useless as the case with the sequential selection. Thus the reprojection error is only a useful measurement of the quality of calibration if the images used provide enough information to only ﬁt a small fraction the parameter search space. Chapter 6

Discussion and future work

6.1 Graphics engine integration

Compare to the Unreal Engine solution that only works as a renderer in the studio, Film Engine meets more of the initial objectives of implementing a new graphics engine and is therefore considered as the most suitable tool due to Stiller Studios’s requests. The engine provides high-quality rendering, Flair communication, possibilities to modify the scene during recording and also takes image data via the DeckLink SDK. The only functions missing are the timeline-based animation and the automatization of the engine. Even though Film Engine fulfills many of the desired requirements and also works together with the rest of the system, there is still work to be done before the studio can fully let go of the old software setup. Especially, the fact that no FPS-timeline exists in Film Engine is a strong motive for not going through with an exchange yet. Without a timeline it is not possible to control whether a scene is physically doable, which makes the engine unusable for real film job cases since no shots can be digitally planned before the actual recording. Another crucial Film Engine issue is the troubling with bugs, something that especially the new engines Stingray and Film Engine has contributed to. Both the timeline and the bugging is somehow under development by the Film Engine development team. Apart from the FPS-timeline, the issue with no automation persists but can be solved through scripting which after all fulfills the requirement of automatization fa- cilities. The automation issue however is something that is not fully developed for the current system either. Currently, to setup a previsualization between Maya, Mo- tionBuilder and Flair requires many clicks and settings tweaks due to lack of logical structure. The employees at Stiller Studios compare their system with house building, where the house is built by placing one brick at the time without having an architec- tural plan. This leads to very complex solutions, meant to work in the short run. In other words, to automate Film Engine is only a small part of what the studio needs when it comes to system architecture. The discussion of rebuilding the software solutions from scratch has been a recurrent topic during this project, but has never been considered as something doable during such a short period of time. It is worth men- tioning that the web interface and Filemaker structure works fine, and a reconstructure would only include the above mentioned parts (Maya, MotionBuilder and Flair) of the previsualization system. What would be really interesting, since the game and film engine industry is evolving fast, is to build a software architecture where it is possible to just exchange the engine part without having to deal with almost every unit in the system like how it is structured now with several hard coded solutions. Just the fact that the studio does not have access to all code but must call their different external developers working around the world when an engine update is needed is clearly showing

56 Chapter 6. Discussion and future work 57 the deﬁciencies that exist.

6.2 Modular architecture

Both Unreal Engine and Film Engine have their advantages and disadvantages as previsualization rendering solutions. Neither are perfect solutions at the moment. Film Engine however has the advantage of being made for film and is actively developed to solve the issues that Stiller Studios run into. The main advantage of Unreal Engine is that the license gives access to the source code and is overall much more customizable. Both however, or rather most, graphics engines are in constant development. It is possible that the Sequencer updates in Unreal Engine 4.12 dramatically increases its viability of using Unreal Engine for film. Only during these few months that this project has been ongoing has Autodesk shown big effort to get recently released Stingray out on the market, Film Engine has been released and renamed (from Cinebox) and Amazon has become a game engine developer. Even though there are existing solutions to be found in this area, many of the questions that emerged during the engine evaluations have also been current topics for discussion in different forums on the engines’ websites which indicates how new the 3D engine industry for filmmaking really is. As computer power continuously increases, the techniques and possibilities for rendering and camera tracking will to - for example real-time raytracing. Stiller Studios showed interest in selecting a game engine and making a completely integrated solution, doing away with as many different applications and computers as possible and solving most of the previsualization pipeline with a single software. How- ever, due to the constant change and development in the field, perhaps an integrated solution is the wrong approach. One of the benefits of a studio like Stiller Studios with a technical crew is that having a large number of computers, each performing a different task, is not really a problem and provides the benefit of modularity. A modular system with many specialized parts means that each part should be easy to replace. The graphics engine that seems to best today might not be the best tomorrow. Differ- ent graphics engines might also be better for different purposes and projects. Simply writing a plugin that controls the virtual camera based on data from network packets should, if possible at all, be trivial in most rendering solutions while making a complete integrated solution that solves all problems is likely to be hard or at least take a lot more time. For a futuristic integrated solution it would also be desirable to be able to achieve functionalities such as chroma keying, depth compositing and distortion correction. Both Film Engine and Unreal Engine showed problems with getting image data from the DeckLink card with Film Engine having a significant delay and Unreal Engine not allowing compilation of the DeckLink API without a workaround. The use of QTAKE can continue with a modular solution. Writing a separate application that takes two image signals from DeckLink cards and composite them together should not be to dif- ficult either, and doing things like correcting lens distortion in a GLSL shader is trivial. Handling these kind of tasks separately not only saves possible unexpected issues when trying to work with the given tools, it also saves time not having to reimplement solutions when changing the rendering solution. One benefit of modular design was clearly shown when testing Mo-Sys’s Star- Tracker. Since a solution for sending and receiving motion control data already existed in MotionBuilder and solutions for receiving motion control data from MotionBuilder Chapter 6. Discussion and future work 58 were implemented in both Unreal Engine and Film Engine, writing a MotionBuilder plugin for receiving motion tracking data from StarTracker directly allowed the use of StarTracker in three different engines while only having to implement it into one. An ideal solution would revolve less around finding the best tools for the job and super-glue them to the pipeline, but rather to define how the different computers and applications communicate via video cards, network packets, files and defining transla- tors (such as MotionBuilder in the previous example).

6.3 Camera tracking

Tracking systems, such as Mo-Sys’s StarTracker that do not require a motion control robot are versatile. A more in-depth study of different tracking solutions could definitely be of interest. Placing tracking markers around the studio that can be used for high precision offline tracking and filming these with a camera using different real-time tracking solutions could provide a good reference for measuring the quality of the different tracking solutions. In addition, this would make it possible to calibrate the camera on the motion control robot through its full range of motion and possibly also mapping of the mechanical errors to the robot’s motor values. Using a tracking solution such as the StarTracker in conjuncture with the motion control robot by attaching the StarTracker on the robot could also be very interesting. Since there are known flaws with the motion control system there could perhaps be benefits in using Flair for motion control when needed, but for camera tracking to instead use a separate system such as the StarTracker.

6.4 Compositing

Depth compositing would probably require an upgrade of the studio by adding one or more cameras. Good depth compositing would greatly expand upon the kind of scenes that are possible to shoot correctly in real-time. Depending on how Stiller Studios chooses to upgrade the studio various possibilities open up. Depth detection could open up possibilities for light and shadow interaction between the filmed material and the virtual scene. The quality of the chroma keying in the current system depends highly on the lighting and filtering of the image. Real-time keying is an active area of research and improvements could definitely be made. One area of improvement that has not been discussed previously is color correction by using some sort of automatic methods for matching the colors of the virtual scene and the filmed material.

6.5 Object tracking

Besides the use for camera tracking and motion capture, optical tracking solutions could be useful for increasing the level of interactivity in the scene. By using a motion capture solution such as OptiTrack or placing ﬁducial markers on objects, objects in the scene could be tracked and placed into the rendering engine allowing the directors and actors to move and interact with for example virtual furniture. Chapter 6. Discussion and future work 59

6.6 Augmented reality

Virtual and augmented reality (VR and AR) are ﬁelds that are developing fast with several upcoming commercial solutions. VR and AR also relate closely to the purpose of previsualization, breaking the border between the virtual scene and reality and bring- ing ﬁlming on green screen closer to pure live action. Integrating VR or AR solutions into the studio would make it possible for the actors and director to actually be in the virtual scene as if it almost was real.

6.7 Camera calibration

While the developed calibration method is a significant improvement of Stiller Stu- dios’s existing calibration pipeline there is still room for further improvement. Al- lowing automatic, controlled or preprogrammed camera movements would open up several possibilities and simplifications of the calibration process. Focus could be set automatically. Calibration methods with controlled movement such as the one with known rotation by Frahm et al. [26] allows for varying focal length during the calibration by keeping the movement known. If this was used it could make the calibration process significantly faster. The calibration could most likely be further improved by using advanced calibration methods such as the digital correlation method by Wang et al. [31] or separate distortion calibration, but there is also the question of how good the calibration actually needs to be. Camera calibration can be performed with just two images of a calibration pattern or one image of a 3D-calibration object. Perhaps auto-calibration is good enough? It is possible that a separate process for the camera calibration is superfluous and that simple camera movements in a static scene when filming or a single image of a known reference object would give a good enough calibration.

6.8 Using calibration data

When the parameters for different lenses with different zoom and focus settings have been found, these values need to be added to the previsualization. The motion control software Flair offers some support for this that allows the user to store a list of lenses and some settings such as desired offset for the view point from the camera rig, the focus distance for different focus settings, the focal length, the change in center of projection and ﬁeld of view for different zoom settings. The system has several limitations though. Lens settings must be set up manually. The possibility to save and load backups of the lens settings exists, but this can only be done in batch and is saved in an unknown binary format. The system is also limited in what information that can be provided. No settings for orientational offset exist besides a checkbox telling Flair whether the lens is a snorkel lens or a normal lens. The process of setting the values can also be quite confusing if the process given in the Flair manual [59] is not followed. It demands setting offsets based on several physical measurements. The manual in turn incorrectly refers to the nodal point (one of the offsets that must be set) as the no-parallax point. One possible alternative would be to store the values obtained from the camera calibration in a table-based data ﬁle. Since the focus and zoom are sent together with the motion control data from Flair both to the calibration and to the rendering program, these values could be set in the data-table together with the calculated parameters for Chapter 6. Discussion and future work 60 focal length, distortion, orientation and center of projection offset from the camera. The values could then be used when rendering and interpolating between neighboring focus and zoom values. Chapter 7

Conclusions

The thesis work was based on the following three questions:

• How can Stiller Studios’s previsualization be improved?

• Which methods are suitable for improving the previsualization?

• What is the intended outcome of improving the previsualization?

To answer the ﬁrst question four areas where the previsualization could be improved got identiﬁed:

• Improved rendering.

• Improved camera calibration.

• Alternative camera tracking.

• Depth compositing.

Several options for improving each of these areas were identified. While depth compositing and alternative camera tracking was explored mostly in theory, several commercial solutions exist that solve the problem of camera tracking without a motion control robot with decent results. The visual quality of the previsualization was substantially improved using game engines for rendering compared to Stiller Studios’s former solution using MotionBuilder. The camera calibration process was indeed also improved simply by using camera calibration algorithms rather than visual estimation. While some improvements have been found, there is room for much more. A lot more testing and implementation could and should be done and each of the identified areas of improvements such as object tracking and modular architecture could be new master theses on their own. The intended outcome of improving the previsualization was to remove borders between the real and the virtual in filmmaking. Ideally the previsualization should be ready for shooting advanced mixed reality scenes live, ready for TV or streaming without post-production. While not being quite there yet, the work in this master thesis is a step on the way. In order to stay among the most technical studios it is not enough to be aware of new technologies on the market but to be able to efficiently implement those as well. What seems to be the best solution today may change tomorrow, that is the conclusion of how the advanced CGI filmmaking industry works.

61 Bibliography

[1] Unreal Engine. Real-time cinematography in unreal engine 4. https://www. youtube.com/watch?v=pGtb6uMUgZA. Accessed: 2016-10-24.

[2] Unity. Unity Adam demo. https://www.youtube.com/watch?v= GXI0l3yqBrA. Accessed: 2016-10-24.

[3] Mark Roberts Motion Control. Real-time data output to cgi. http:// www.mocoforum.com/discus/messages/19/144.html?1255709935. Ac- cessed: 2016-10-19.

[4] Salvator D. 3d pipeline tutorial. http://www.extremetech.com/ computing/49076-extremetech-3d-pipeline-tutorial. Accessed: 2016-10-16.

[5] Crassin C., Neyret F., Sainz M., Green S., and Eisemann E. Interactive indirect illumination using voxel cone tracing. Comput. Graph. Forum, 30(7):1921–1930, 2011.

[6] Kolivand H. and Sunar M.S. Shadow mapping or shadow volume? International Journal of New Computer Architectures and their Applications (IJNCAA), 1(2):275–281, 2011.

[7] van Oosten J. Forward vs deferred vs forward+ rendering with directx 11. http: //www.3dgep.com/forward-plus/. Accessed: 2016-10-16.

[8] Crystal Space. What is deferred rendering? http://www.crystalspace3d. org/docs/online/manual/About-Deferred.html. Accessed: 2016-10-16.

[9] Owens B. Forward rendering vs. deferred rendering. https://gamedevelopment.tutsplus.com/articles/ forward-rendering-vs-deferred-rendering--gamedev-12342. Accessed: 2016-10-16.

[10] Pinhole camera model. https://en.wikipedia.org/wiki/Pinhole_ camera_model. Accessed: 2016-10-24.

[11] Bradski G.R. and Kaehler A. Learning Opencv, 1st Edition. O’Reilly Media, Inc., ﬁrst edition, 2008.

[12] Cambridge in color. Lens diffraction and photography. http://www. cambridgeincolour.com/tutorials/diffraction-photography.htm. Accessed: 2016-10-24.

[13] Distortion (optics). https://en.wikipedia.org/wiki/Distortion_ (optics). Accessed: 2016-10-24.

[14] LA Video Filmmaker. F-stops, t-stops, focal length and lens aperture. http://www.lavideofilmmaker.com/cinematography/ f-stops-focal-length-lens-aperture.html. Accessed: 2016-10-24.

62 BIBLIOGRAPHY 63

[15] Littleﬁeld R. Theory of the “no-parallax” point in panorama photography. 2006.

[16] Lee J. Head tracking for desktop vr displays using the wiiremote. https://www. youtube.com/watch?v=Jd3-eiid-Uw. Accessed: 2016-10-24.

[17] Brown D.C. Decentering distortion of lenses. Photometric Engineering, 32(3):444– 462, 1966.

[18] Zhang Z. A ﬂexible new technique for camera calibration. IEEE Transactions on pattern analysis and machine intelligence, 22(11):1330–1334, 2000.

[19] Scaramuzza D., Martinelli A., and Siegwart R. A toolbox for easily calibrating omnidirectional cameras. In 2006 IEEE/RSJ International Conference on Intelligent Robots and Systems, pages 5695–5701, Oct 2006.

[20] Kumar A. and Ahuja N. On the equivalence of moving entrance pupil and radial distortion for camera calibration. In 2015 IEEE International Conference on Computer Vision (ICCV), pages 2345–2353, Dec 2015.

[21] Kundur D. and Hatzinakos D. Blind image deconvolution. IEEE signal processing magazine, 13(3):43–64, 1996.

[22] Nguyen H. Gpu Gems 3. Addison-Wesley Professional, ﬁrst edition, 2007.

[23] Stiller Studios. Welcome to stiller studios. http://stillerstudios.com/. Accessed: 2016-10-24.

[24] Medioni G. and Kang S.B. Emerging topics in computer vision. Prentice Hall PTR, 2004.

[25] Caprile B. and Torre V. Using vanishing points for camera calibration. International journal of computer vision, 4(2):127–139, 1990.

[26] Frahm J-M. and Koch R. Camera calibration with known rotation. In Computer Vision, 2003. Proceedings. Ninth IEEE International Conference on, pages 1418–1425. IEEE, 2003.

[27] Lepetit V., Moreno-Noguer F., and Fua P. Epnp: An accurate o (n) solution to the pnp problem. International journal of computer vision, 81(2):155–166, 2009.

[28] Mateos G.G. et al. A camera calibration technique using targets of circular features. In 5th Ibero-America Symposium On Pattern Recognition (SIARP). Citeseer, 2000.

[29] von Gioi R.G., Monasse P., Morel J-M., and Tang Z. Lens distortion correction with a calibration harp. In 2011 18th IEEE International Conference on Image Processing, pages 617–620, Sept 2011.

[30] Datta A., Kim J-S., and Kanade T. Accurate camera calibration using iterative reﬁnement of control points. In Computer Vision Workshops (ICCV Workshops), 2009 IEEE 12th International Conference on, pages 1201–1208, Sept 2009.

[31] Vo M., Wang Z., Luu L., and Ma J. Advanced geometric camera calibration for machine vision. Optical Engineering, 50(11):110503–110503–3, 2011.

[32] Andrew Colin Rice. Dependable systems for sentient computing. PhD thesis, Citeseer, 2007. BIBLIOGRAPHY 64

[33] ARToolKit. Creating and training traditional template square markers. https://artoolkit.org/documentation/doku.php?id=3_Marker_ Training:marker_training. Accessed: 2016-10-24.

[34] Cantag. Marker-based machine vision. http://www.cl.cam.ac.uk/~acr31/ cantag/. Accessed: 2016-10-24.

[35] Lightcraft Technology. Circular barcodes. http://www.lightcrafttech. com/overview/setup/. Accessed: 2016-10-24.

[36] Culbertson B. Chromaglyphs for pose determination. http://shiftleft. com/mirrors/www.hpl.hp.com/personal/Bruce_Culbertson/ibr98/ chromagl.htm. Accessed: 2016-10-24.

[37] Byrne B.P., Mallon J., and Whelan P.F. Efﬁcient planar camera calibration via automatic image selection. 2009.

[38] Pertuz S., Puig D., and Garcia M.A. Analysis of focus measure operators for shape- from-focus. Pattern Recognition, 46(5):1415–1432, 2013.

[39] Hartley R. and Zisserman A. Multiple View Geometry in Computer Vision. Cam- bridge University Press, ISBN: 0521540518, second edition, 2004.

[40] Distortion (optics). https://en.wikipedia.org/wiki/Epipolar_ geometry. Accessed: 2016-10-24.

[41] Computerphile. Space carving. https://www.youtube.com/watch?v= cGs90KF4oTc. Accessed: 2016-10-24.

[42] Crytek. Cinebox Beta technical manual. 2015.

[43] Amazon lumberyard. https://aws.amazon.com/lumberyard/. Accessed: 2016-10-16.

[44] Ogre3d. http://www.ogre3d.org. Accessed: 2016-10-19.

[45] Autodesk. Custom renderer api. http://docs.autodesk.com/MB/2014/ ENU/MotionBuilder-SDK-Documentation/index.html?url=files/ GUID-EBB95B3D-E75B-4033-ABB4-29EE7B1F9F4A.htm,topicNumber= d30e11937. Accessed: 2016-10-16.

[46] Autodesk. Stingray. http://www.autodesk.com/products/stingray/ overview. Accessed: 2016-10-19.

[47] Unity Technologies. Unity manual. https://docs.unity3d.com/Manual/ index.html. Accessed: 2016-10-19.

[48] Enlighten. Demystifying the enlighten precompute process. http://www. geomerics.com/blogs/demystifying-the-enlighten-precompute/. Accessed: 2016-10-19.

[49] Epic Games. Unreal engine 4 documentation. https://docs.unrealengine. com/latest/INT/. Accessed: 2016-09-25. BIBLIOGRAPHY 65

[50] "Applications of Artiﬁcial Vision" research group of the University of Cordoba. Aruco: a minimal library for augmented reality applications based on opencv. http://www.uco.es/investiga/grupos/ava/node/26. Accessed: 2016- 10-19.

[51] Lightcraft Technology. Previzion manual. http://www.lightcrafttech. com/support/doc/. Accessed: 2016-10-19.

[52] Ncam. Ar/vr real time camera tracking. http://www.ncam-tech.com/. Ac- cessed: 2016-10-19.

[53] Trackmen. Camera tracking solutions. http://www.trackmen.de/. Accessed: 2016-10-19.

[54] Mo-Sys. Camera motion systems. http://www.mo-sys.com/. Accessed: 2016- 01-23.

[55] The Captury. Pure performance. http://www.thecaptury.com/. Accessed: 2016-10-25.

[56] Organic Motion. Markerless motion capture. http://www.organicmotion. com/. Accessed: 2016-10-25.

[57] OptiTrack. Optitrack motion capture. http://optitrack.com/products/ flex-13/indepth.html. Accessed: 2016-10-19.

[58] Abeles P. Inverse radial distortion formula. http://peterabeles.com/blog/ ?p=73. Accessed: 2016-10-19.

[59] Wakley S. Flair operator’s manual version 5. https://www.mrmoco.com/ downloads/MANUAL.pdf. Accessed: 2016-10-24. Appendix A

Inverse distortion

Shown from left to right; The image distorted using the inverse distortion algorithm with given parameters and number of iterations, the same image sampled using distorted and then undistorted coordinates and thirdly the difference between the distorted and undistorted image and the original.

FIGURE A.1: K1 = 0.1, iterations = 1.

FIGURE A.2: K1 = 0.1, iterations = 5.

66 Appendix A. Inverse distortion 67

FIGURE A.3: K1 = 1, iterations = 500.

FIGURE A.4: K1 = 25, iterations = 500.

FIGURE A.5: K1 = -0.1, iterations = 5. Appendix A. Inverse distortion 68

FIGURE A.6: K1 = -0.1, iterations = 20

FIGURE A.7: K1 = -0.1, iterations = 500.

FIGURE A.8: P1 = 0.01, iterations = 500. Appendix A. Inverse distortion 69

FIGURE A.9: P1 = 0.1, iterations = 500.