PERFORMANCE EVALUATION FOR FULL 3D PROJECTOR CALIBRATION

METHODS IN SPATIAL AUGMENTED REALITY

A Thesis

Submitted to

the Temple University Graduate Board

In Partial Fulfillment

Of the Requirements for the Degree

MASTER OF SCIENCE

of ELECTRICAL ENGINEERING

By

Michael Korostelev

August, 2011

Thesis Approval(s):

Dr. Li Bai, Thesis Adviser, Electrical Engineering

Dr. Seong Kong, Electrical Engineering

Dr. Robert Yantorno, Electrical Engineering Abstract

Spatial Augment Reality (SAR) has presented itself to be an interesting tool for not only interesting ways to visualize information but to develop creative works in performance arts. The main challenge is to determine accurate geome- try of a projection space and determine an efficient and effective way to project digital media and information to create an augmented space. In our previous implementation of SAR, we developed a projector-camera calibration approach using infrared markers. However, the projection suffers severe distortion due to the lack of depth information in the projection space. For this research, we propose to develop a RGBD sensor - projector system to replace our current projector-camera SAR system. Proper calibration between the camera or sen- sor and projector links vision to projection, answering the question of which point in camera space maps to what point in the space of projection. Calibra- tion will resolve the problem of capturing the geometry of the space and allow us to accurately augment the surfaces of volumetric objects and features. In this work three calibration methods are examined for performance and accu- racy. Two of these methods are existing adaptations of 2D camera - projector calibrations (calibration using arbitrary planes and ray-plane intersection) with our third proposed novel technique which utilizes point cloud information from the RGBD sensor directly. Through analysis and evaluation using re-projection error, results are presented, identifying the proposed method as practical and robust.

i Contents

Abstract i

Contents ii

List of Figures v

1 Introduction 1

1.1 Art and Engineering ...... 1

1.2 Augmented Reality ...... 3

1.2.1 Types of AR ...... 5

1.2.2 Advances in the Field of SAR ...... 9

1.3 Previous Work at Temple ...... 9

2 Current and Related Work 11

2.1 "Dance and Engineering in Augmented Reality" ...... 12

2.1.1 Methodology ...... 13

2.1.2 Calibration ...... 14

2.1.3 Detection ...... 19

2.1.4 Tracking ...... 23

2.1.5 Warping and Projection ...... 29

2.2 Microsoft Kinect "Hacks" ...... 32

ii 2.2.1 Kinect at Temple ...... 35

2.2.2 Getting Usable Sensor Data ...... 35

2.2.3 Interactive Particle Systems ...... 37

2.2.4 Superpowers with Projection ...... 38

2.2.5 Kinect 3D modeling ...... 39

2.2.6 The Point Cloud Library ...... 40

3 Calibration Methods 41

3.1 Projector Calibration Methods ...... 41

3.1.1 Arbitrary Planes and Calibrated Camera ...... 43

3.1.2 Ray-Plane Intersection Method ...... 46

3.2 3D Sensor - Projector Calibration ...... 51

3.2.1 Calibration of depth and image sensors ...... 54

3.2.2 Homography Based Alignment for Two View Correspondence . 57

3.2.3 Stereo Calibration for Two View Correspondence ...... 59

3.2.4 Comparison Between Homography Based and Stereo Calibra- tion for Two View Correspondence ...... 62

3.2.5 Extracting Depth Information ...... 64

3.2.6 Use 3D points for projector intrinsics ...... 65

3.2.7 Determining Projector Intrinsics ...... 68

3.2.8 Find R and t between sensor and Projector, Extrinsics . . . . 73

iii 4 Implementation 77

4.1 Displaying the Augmented Scene ...... 77

4.2 Evaluating Projection Performance ...... 79

5 Discussion and Conclusions 83

5.1 Performance ...... 83

5.2 Application ...... 85

5.3 Future Work ...... 86

5.4 Conclusion ...... 87

References 90

iv List of Figures

1 Jan Borchers, René Bohne and Gero Herkenrath, LumiNet Jacket and LumiNet, 2008, Media Computing Group at RWTH Aachen University, Germany 2

2 In this simple example a printed pattern is recognized by the phone’s camera, then augmented within the camera view with the cube. As the phone moves around the pattern, the cube is rendered in the correct place with correct pose...... 4

3 The Video See Through HMD(1) is the AR display. The display is combined with battery powered(3) portable computer(2) that is capable of real time tracking and reconstruction to allow for robust vision based AR applications...... 6

4 In our current system, an infra-red camera (b) senses infra-red markers in- tegrated into a panel (a). Software (c) then manipulates the coordinates of the markers to transform an image and then send it to a projection system (d) to impose it correctly into the environment...... 13

5 The homography is an invertible transformation from some plane in a projective to space to another projective plane, even though the homography transformation is invariant to scale, it maps straight lines to straight lines...... 15

6 The projective rectification here is transforming the perspective warped camera image to the ideal parallel image...... 16

7 The homography we determined perspective warps the captured image to undistort the cameras view and transform it into the projected image plane. . . . 17

8 The drawback with the current calibration system exists since the 2D ho- mography is only valid for the plane at which calibration was done, so at different distance, parallax problems are observed...... 18

9 Using the Hough transform in OpenCV allows for the detection of edges, ana- lyzing these edges, we can find quadrangles in the image...... 20

v 10 Infrared fiducial markers are made from two IR led’s diffused by ping-pong ball hemispheres...... 21

11 The camera view is thresholded based on luminance in the red channel. It can be seen that using the LEDs is a good solution for a dynamic application such as dance where tracking success is very important...... 22

12 The projection panel a consists of a plain white cardboard with an IR LED marker at each corner and one in the middle, off center. This is tracked and overlaid with projection in b...... 25

13 The detected blobs are classified by comparing the set to an ideal blob pattern. Each frame, we cycle through blob orders and compute a perspective transformation between the ideal and current pattern configuration...... 26

14 The final result of the warped media in OpenGL, when this is made full screen in the projector’s window, it overlays the projection panel...... 31

15 The final result of the warped media in OpenGL, when this is made full screen in the projector’s window, it overlays the projection panel...... 32

16 In the Microsoft Kinect a laser beam is refracted through a lens creates a matrix of laser dots in the space, based on the disparity in the dots, the depth of objects in the space can be determined...... 33

17 The depth information captured with an open source Kinect driver is 640 x 480 x 640. The units are relative all relative to the origin of the camera...... 34

18 The smoothed depth image from the raw Kinect data is deterimed with the help of OpenCV’s inpaint technique...... 36

19 The particle system uses attraction and repulsion forces for each particle as well as a larger attraction force to user’s hands that are detected in the scene. . . . . 37

20 A user waves a stick which attracts a swarm of particles...... 38

21 Lightning and trail effects are projected over the performer’s hands creating an illusion of superpowers...... 38

vi 22 Virtual Playdoh allowed users to mold a warp-able NURBS surface using both of their hands and rendering in anaglyphic 3D...... 39

23 The pinhole camera model maps real space to image space...... 42

24 The two degrees of freedom, 1 in rotation and 1 in translation...... 46

25 The extrinsic parameters, that are used to compute the plane equation. . . . 49

26 Ray plane intersection in its entirety. The object is to determine vectors V from the camera center, to the projected calibration pattern...... 50

27 The depth and RGB image on top of each other, it can be seen that they do not perfectly correspond...... 54

28 Calibration [12] of depth and RGB cameras on the 3D sensor ...... 55

29 Physical 3D chessboard calibration pattern used that can be seen in both RGB and range cameras...... 56

30 Physical 3D chessboard calibration pattern with detected corners...... 57

31 Missmatched chessboard corners are matched after multiplying the RGB image by the determined homography...... 59

32 When not calibrated the position of each corner does not correspond. With a disparity of close to 30 pixels...... 63

33 Plot of correspondence when calibrated with homography...... 63

34 Plot of correspondence when calibrated with stereo calibration...... 64

35 The distance sensed in meters vs. ambiguous units.[12] ...... 66

36 Determining metric distance for object points in the camera...... 66

37 The projector’s camera matrix provides information that will allow to under- stand the perspective of the projected image...... 68

vii 38 Scene generated with intrinsic matrix K1 ...... 71

39 Scene generated with intrinsic matrix K2 ...... 72

40 Scene generated with intrinsic matrix K3 ...... 72

41 Scene generated without the use of intrinsic matrix...... 73

0 42 Where Hπ is a planar homography, e are the epipoles result in a fun- 0 damental matrix: F = [e ] × Hπ ...... 74

43 The scene with default R and T, and on the right, the scene with R and T input. 77

44 On the left, the virtual 3D scene that is modified with rotation, translation and projection matrix, on the right, the scene overlaid by projection...... 77

45 The OpenGL viewing pipeline ...... 78

46 The reprojection was evaluated by projecting a virtual chessboard pattern over a real one...... 80

47 This plot compares the reprojection error between ray plane intersection and the proposed novel RGBD sensor method...... 81

48 This plot compares the distributions of the pixel locations determined with both projector calibration methods...... 82

49 A 3D scene overlay is demonstrated. Accurate overlay persists at all depths. . 85

viii 1 Introduction

The goal of this work is to explore innovative techniques for calibrating an image projector with a 3D sensor as well as compare the technique specific to 3D sensors with projector-camera calibration for accuracy and robustness. The work is structured within the framework of a Spatial Augmented Reality (SAR) project with application to creative works in performance art. Application to performance art allows us to take more liberties in terms of technology as well as pushes our creative limits as engineers. The technology enables artists to augment and overlay spaces with virtual information and media, enhancing the user experience by making information easier to absorb as it becomes a natural part of the environment.

1.1 Art and Engineering

As computing and technology become a part of life, people build their lives around technology in virtually all fields of study and people’s areas of interest. It becomes more problematic in creative disciplines however. Science, technology, engineering and mathematics (STEM) have steep learning curves, and people who are interested in using these concepts to creatively are left with very minimal resources with most being out of the scope of their projects.

In recent years artists have been trying to adapt to the world of technology and new creative fields like graphic and interactive design or video editing have emerged.

1 These however, focus on artists using available tools (software packages, equipment), designed by engineers and computer scientists. So even though they use these tools creatively, instead of involving STEM into their art works are largely consuming

STEM without really using it creatively. Even more recently, small communities of artists have been building their works specifically around technology. Without academic exposure to STEM concepts they have to overcome difficulties and learn by

"hacking", from trial and error and by using the small amount of resources available on the Internet.

Figure 1: Jan Borchers, René Bohne and Gero Herkenrath, LumiNet Jacket and Lu- miNet, 2008, Media Computing Group at RWTH Aachen University, Germany

This combination of art and the use of technology has produced some very in- teresting works. Figure 1 shows a garment made to glow with light pattern in a performance setting. A thesis from an art school, Pratt Institute, published in 2010 states, [capabilities within electronics] allows artists and designers to have more con- trol of their tools and materials, including the option to create their own. [9, p. 61]

The fact that theses of this type are being published by art students, especially with the matters that they focus on, shows that efforts from our field as engineers are

2 relevant, timely and address the problems that creative people are having in their communities. Gibb shows that the most important aspects of technology for cre- ative people is accessibility, openness and most importantly simple implementation for rapid prototyping. [9, p. 66]

1.2 Augmented Reality

Augmented Reality (AR), the art and science of interconnecting the virtual and the real, in AR, the real environment is not suppressed, rather than immersing a person into a completely synthetic world, AR attempts to embed synthetic supplements into the real environment. AR is an inexpensive technology but rich medium for creative works, and creative works have the potential to take full advantage of what AR has to offer. AR is at the forefront of what technology can offer and poses research questions that exercise many expertise. Imagine the near inevitable future where we take it for granted that virtual content is merged with our perception of and interaction with the world around us. Imagine a Jabberwocky roaming your desk, surgeons with true x-ray vision, a dancer creating super novae with every plie, soldiers with extra sensory perception, imagine a world without bounds.

The main research challenges in AR currently deal in various areas of engineering and computer science. Tracking and registration are the most fundamental problems researchers are dealing with currently. In order to produce a convincing augmented re- ality experience, both the user and the real world environment around them as well as

3 virtual elements produced by augmentation must be placed onto some global coordi- nate system. These tracking techniques are usually camera based, but other sensors are sometimes also employed. The goal of course is to use the most non-invasive methods, meaning the user or the environment would need minimal equipment. To achieve this, optical tracking rather than incorporating hardware sensing onto the user or objects is highly preferred.

Figure 2: In this simple example a printed pattern is recognized by the phone’s camera, then augmented within the camera view with the cube. As the phone moves around the pattern, the cube is rendered in the correct place with correct pose.

After tracking, the next challenge presented in AR is real time rendering. Since

AR, as in Figure 2, mainly concentrates on superimposing the real environments with graphical elements, fast and realistic rendering methods play an important role.[17, p. 8] The challenge with real time rendering aims at making the integrated virtual elements indistinguishable from reality to the observer. In order for this type of realism to be produced, rendering has to happen in real time with no latency and with photo realism, meaning the virtual objects need to be convincingly integrated into the environment (properly lit, casting shadows and susceptible to occlusion by real objects).

4 The AR system emerges when combining real time rendering with tracking and registration. Next, a question of how to display or present the processed environment to effectively immerse the user or audience.

1.2.1 Types of AR

There are currently three main ways of presenting an AR interface. These are Head-

Mounted Displays, Mobile Based and Projector Based. It is important to note that to produce an AR system there are several required components that each of the three techniques. There needs to be a way to sense for tracking and registration, a comput- ing device to process information and a way to display the augmented environment.

These components are what creates the contrasts between the three techniques.

Head-Mounted Displays or HMD’s are self explanatory, they employ the use of some types of goggles with small displays for each eye and a video camera mounted on the user’s head. The camera simulates vision for the user by displaying the video it is capturing to the goggles. As a result the user is essentially looking at the world through this camera. As the video is captured, it is sent to some processing device that interfaces with both the camera and goggles. This device processes and

"augments" the captured video and displays the result in the goggles for the user.

Finally, an altered environment is produced, giving the user an enhanced view of the environment.

Looking through the goggles can be a little awkward at first. However, even

5 Figure 3: The Video See Through HMD(1) is the AR display. The display is combined with battery powered(3) portable computer(2) that is capable of real time tracking and reconstruction to allow for robust vision based AR applications.

though the visual experience is inconsistent with the real world, the human vision

system adapts to the new environment very flexibly. Depending on the application

the goggles can be a very good improvement for human perception. Army aviation

is an example, in this regard, where the aviator often needs to see in every direction.

Through the HMD, the aviator sees a variety of situational information, including

pilotage imagery, tactical, and operational data. [16, p. 112]

Currently, the main drawback with the HMD technology is portability. An oper-

ator is required to not only wear the display and camera, a wearable computer for

processing the video is required. Apart from the active components another require-

ment is a power source. The portability constraint calls for trade-offs to be made

between performance, size and power consumption.

Mobile-Based Augmented Reality In the recent years, mobile devices evolved to the point where their computation power as well as a wide selection of integrated sensors now allows for AR software to be implemented. Most available smart phones have a fast processor, large display as well as a high resolution camera and as with

6 HMD’s can be used to view an augmented video feed of the environment. The main mobile software platforms are Google’s Android and Apple’s iPhone OS, both oper- ating systems allow developers access to devices’ integrated sensors and cameras and support a wide selection of software libraries for image processing and rendering, such as OpenCV and OpenGL.

Most of the available applications for these devices use the camera as the primary sensor for finding fiduciary markers known to the software as points of reference. As the markers are detected in the cameras image, their orientation provides geome- try information and allows for an image to be rendered in its place in the video-see through. Other applications rely on compass, GPS and accelerometer sensors avail- able in the devices. The purpose of these apps is to enhance users’ sense of location, giving them a more intuitive interface to visualize things around them. Layar, an iPhone and Android application uses the GPS and compass to overlay place-markers for local business in the camera view. An entertainment app, Zombie Attack uses the

GPS to place fictional zombies on the map with the objective being to avoid them.

Mobile based AR has an exciting potential as has already been demonstrated, but still requires the user to have some kind of equipment on their person and is limited by that equipment’s capabilities.

Spatial Augmented Reality or SAR, is built around using a projector in order to display graphic information on the physical environment. The work in this thesis is centered around this technology. The main advantage in using a projector to display

AR is that it removes the user from the hardware required to display and process

7 information. This allows for multiple users to experience the effects as well as to collaborate in the augmented space. With the other two AR techniques (mobible device based AR and wearable computing AR), the experience was limited to a single user and left no potential for an audience experience.

SAR creates a rich medium for new immersive ways to present information, involve audience and stimulate collaboration. There has already been interest with SAR for creative art projects. By existing in the same space as the exhibition, the audience can understand that their existence within the design community can be flexible and cross-disciplinary. [2] SAR has already been used in advertising to promote various high profile releases. For example, Sony has used SAR last year to project a spectacular display onto Rochester Castle to promote the release of ACDC’s Iron

Man 2 soundtrack.

Another advantage since SAR requires no portable hardware, permanent installa- tions can be constructed with significantly higher sensing and processing capability.

However, the complexity of SAR is increased since tracking and registration must happen not only in the camera view, this view must be projected out to properly overlay real world objects. This real world registration is the focus of this work. In order project convincingly the projector, camera and 3d sensors must be calibrated.

Some work has been done already using traditional cameras and project, but with the introduction of 3D sensors, interesting new methods can be employed.

8 1.2.2 Advances in the Field of SAR

Many researchers have implemented outstanding AR demonstrations. However, these demonstrations are customized to technology savvy people and often lack cohesion and relevance to anything but AR. There happens to be great demand in the research community for artistic and creative applications of AR. Also, from the creative world, many artists are aching to create work that incorporates innovative technologies. [1]

Each of the three methods of AR by nature overlay real world objects with digital information. Work has already been done by companies like BMW to apply this to maintenance of equipment. In their proof of concept material, a maintenance person looks through video goggles at a car engine. As he looks at the hardware, instructions are overlaid into his view. For example, the system would point to necessary bolts on the engine cover that need to be removed to reveal the engine. A system like this can similarly be used by soldiers in the field when approaching unfamiliar equipment, instantly giving them the ability to troubleshoot problems.

1.3 Previous Work at Temple

In the past year, we have explored each method of the technology of AR in a variety of ways. The main project involves ongoing work to implement an AR system into a performance by the Boyer College of Music and Dance. A fun project called Virtual-

Playdoh let users model objects in 3D with their hands. We have in the past looked

9 at HMD systems as well as mobile based applications. A researcher in the Computer

Fusion Laboratory has created an AR panorama capture application for Android.

These projects have all been a focus of a lot of attention from not only web blogs but also local news and inter-university publications.

10 2 Current and Related Work

In this chapter, related work that led up to this thesis will be outlined. In the past year, in our involvement with the School of Music and Dance we have created many different applications for stage performance. All of these employ the use of active projection mapping in spatial augmented reality, where objects are overlaid with projected media. During the process of development various problems have presented themselves that open up a vast area for fundamental research. In this chapter, it is important to pay special attention to the calibration procedures that create the relationship between the sensing device such as the camera and the projector which feeds back to the camera view. This relationship is very important in order to perform overlay accurately and in the past has only been done with manual alignment with only recent attempts at automatic calibration.

In this work we apply a simplistic 2D calibration method to the proposed new implementation of existing full 3D calibration techniques as well as a proposal of a totally novel method with the use of an RGBD (RGB Depth) sensor. Our experiences with the RGBD sensor have proved that it is great tool to enhance standard vision systems. Here we will discuss a standard camera - projector system which was able to track and overlay media onto a plain white panel, then several proof of concept implementations of the RGBD sensor are also outlined in this chapter.

Recently we wrote a paper entitled “Dance in Augmented Reality: Calibration and

Applications” that has been accepted to the 8th ACM Conference on Creativity and

11 Cognition. The following section will describe the perliminary work done in the area over the time of development of this thesis.

2.1 "Dance and Engineering in Augmented Reality"

This project focuses on discussing our experience with this exciting field, the collab- oration with creative people and effects it may have on enhancing STEM education as well as some technical intricacies we deal with in the course of development of our own AR platform. This platform makes use of a simple camera along with a projector to visually augment a space with digital media.

We display a plain white panel marked off with infrared indicators which is tracked and augmented with video in real time. A vision and projection system that is re- silient against noise and occlusion combined with creative media and artistic direction gives us the ability to augment a live performance with computer generated special effects. Digital effects have been widely implemented in films during post processing of video. Currently artists are challenged with static set design and limited light- ing and sound capabilities. Intelligent projection mapping can improve this creative process, generate ideas from artists and enhance the performance for the audience.

12 2.1.1 Methodology

AR applications require the use of multi-view geometry to reconstruct a 3D model of the AR workspace from sequences of video frames. With this model of the workspace and the information about the camera’s gaze and position, it is possible to augment the scene with virtual objects. Because of the nature of the application we take a more simplistic approach as a trade-off for better performance. The system needs to be totally consistent in a live performance, so it is important to avoid potential failures at all costs.

Figure 4: In our current system, an infra-red camera (b) senses infra-red markers integrated into a panel (a). Software (c) then manipulates the coordinates of the markers to transform an image and then send it to a projection system (d) to impose it correctly into the environment.

Figure 4 generally outlines the mechanics of the projection system, a planar surface marked by infrared fiducial points is seen and detected with an infrared camera, then media is transformed and appropriately projected to accurately augment the scene.

To do this there are four general components:

• Calibration

13 • Detection

• Tracking

• Warping and Projection

Each of these components have the potential to be areas rich in fundamental research. In the next several sections, the implementation and integration of each component will be explained.

2.1.2 Calibration

In order to accurately overlay objects in real space with projection once the objects are detected, we need to figure out where this object is in relation to the projector that produces the overlay. This is a problem, since the detection happens with a camera, we only know where the objects are in the two dimensional camera space represented by a pixel coordinate system. A relationship must be derived that essentially converts the camera coordinate system into the projector’s image plane.

It is important to note that a projector can be considered to be like camera, but in reverse. Assuming there is a some displacement between the devices, at least four known image point pairs are needed, which are represented in 2D coordinates of each image plane. Still thinking of the projector as a camera, we can say that two cameras viewing the same scene see objects from a different perspective. To make a relation- ship between the two cameras’ views we have to consider perspective transformation.

14 Even though there are different methods to draw a scene ie. perspective projection,

affine projection, orthographic projection. The way cameras and projectors see or

output images is perspective projection. Thus our task is to transform one view

(from the camera) to another (the projector).

This perspective transform is known as the homography [11]:

Figure 5: The homography is an invertible transformation from some plane in a projective to space to another projective plane, even though the homography transformation is invariant to scale, it maps straight lines to straight lines.

Πh = Hπh       0  x   h11 h12 h13   x                     0  ∼      y   h21 h22 h23   y                    1 h31 h32 h33 1

h x + h y + h x0 = 11 12 13 h31x + h32y + h33

h x + h y + h y0 = 21 22 23 h31x + h32y + h33

The homography has 8 degrees of freedom and to compute the 8 parameters of the

15 homography, at least four points are required for 8 equations. Since the homography

is scale invariant, the last (ninth) parameter in (3, 3) is arbitrary and set to a unitary value by default.

Once the homography is determined, it can be used for projective rectification such as exemplified below in Figure 6 [11].

Figure 6: The projective rectification here is transforming the perspective warped camera image to the ideal parallel image.

Similar methodology is used for calibrating the projector-camera system. We

need to use the homography to perspective transform the camera image to fit the

projection. As stated, at least 4 points are required and since this is a 2 view system,

a correspondence needs to be made between the two devices. In a two camera system

this would actually be more difficult since pattern and feature detection would have

to be implemented for both views. However, since the projector is a reverse camera,

the information in its image plane is known without any sort of processing because

we know what is being projected.

To initiate calibration, a chessboard calibration pattern is projected onto a plane

in space. A camera located close to the projector’s lens then views this scene and

attempts to detect the calibration pattern using a the chessboard detection function

16 available in OpenCv.

bool findChessboardCorners(const Mat& image, Size patternSize,

vector& corners,

int flags=CV_CALIB_CB_ADAPTIVE_THRESH+ CV_CALIB_CB_NORMALIZE_IMAGE)

findChessboardCorners thresholds the camera’s image to determine the positions of corners of the squares on the chessboard grid. These points are defined as the places where the internal corners of the black squares meet or touch each other. Once found, the corners are ordered row by row and left to right. Depending on the input size of the chessboard, many points could be identified. In our calibration we use a 6 x 8 grid, giving us 48 points.

Figure 7: The homography we determined perspective warps the captured image to undistort the cameras view and transform it into the projected image plane.

This grid is projected and once it is detected with the camera, a homography can be computed between the pixel locations of the projected points, as seen in the camera and the pixel locations of the corners in the calibration pattern as specified in software. This relationship is effectively a perspective rectification between the pro- jected pattern and its view in the camera. The Figure 7, above shows the undistorted

17 chessboard with the homography we obtain.

Figure 8: The drawback with the current calibration system exists since the 2D homography is only valid for the plane at which calibration was done, so at different distance, parallax problems are observed.

The planar homography of a camera-projector system is valid only when an image is projected on a planar screen. This is because there is always a 2D homography between image planes of any two projective imaging devices, if the devices take pic- tures of a planar object.[11] This is a big drawback of the current calibration system.

Since we have only calibrated at a single plane in the environment, the homogra- phy is not valid for any other plane in that environment. As you move away from the plane originally calibrated on, the re-projection error increases due to parallax problems. A similar method has been used by Soon-Yong Park and Go Gwang Park

[18]. They solve this problem by calibrating to a physical chessboard pattern while projecting a marker and manually aligning. In this work, global camera-projector systems calibration are examined that overcome this problem.

Once obtained the homographic relationship between what the camera sees and what the projector projects, we save the 3 x 3 homography matrix for future use in

18 the other components of the software. It is important to note that this calibration is

only valid for static projector and camera, once calibrated, both must not be moved.

Having saved the homography HC we can start detecting objects in the camera view.

These objects coordinates can then be warped by the homography transformation and overlaid with media in the projectors view.

2.1.3 Detection

In the application, our goal is to track and overlay a white, rectangular projection panel. To overlay, we need the coordinates of fiducial points or the corners of the panel, as well as some information about its orientation. There is a variety of ways that can be implemented to track the surfaces.

One approach proposed in "A Projector-based Movable Hand-held Display Sys- tem," [14], discussed the implementation of a real-time quadrangle detection and tracking algorithm to detect and track the white panel. In this approach, line seg- ments are detected and extracted from the cameras image using the Hough Transform algorithm, then based on the line segment length, they are classified as lengths or widths for the projection panel.

The line segments L = l1, l2, ...ln are considered features during the tracking and detection. They are obtain using the Hough line detector in OpenCV (below). Out of the lines detected in the image, consistent longer lines are examined. The quadrangle is assumed to have four lines that qualify by a set criteria. Every segment should

19 be longer than some threshold and the sides opposing from each other should be of

a similar length. Next a ciriteria for the angle between the lines is also used, they

should range somewhere between 30 to 150 to compensate for perspective tilt. The

authors also considered the ratio of overlap at each side which needed to be at least

0.7. s1 – the lenght of the line segment, s2 – length of the side of the quadrangle once created, and s3 – the overlapping length of both s1 and s2.

The ratios of overlap that can be associated with each line segment of the quad-

rangle are:

s3 s3 r1 = and r2 = s1 s2

This ratio criteria is needed to definitevely determine if the lines actually make up a

rigid panel and are not things like highlights or reflections in the image.

Figure 9: Using the Hough transform in OpenCV allows for the detection of edges, analyzing these edges, we can find quadrangles in the image.

A benefit of this method is its independence from hardware markers. This is very

valuable, however in their results they claimed frame rates of less than 20 fps. The

20 simple marker based method we implement works at camera frame rate which is very important for the performance arts application. The markers used are while ping- pong balls that glow with infrared light. They are very easy to construct consisting of two 800nm infrared LED’s (1.2V, 20mA) in opposing sides of a perf-board and an

33Ω resistor. The emitted IR light is diffused with ping-pong ball hemispheres. The

five markers are configured in parallel and powered by a 3V source attached to the back of the projection panel. This type of fiducial was chosen over other ways to determine fiducial information primarily because of its ease of application, relatively low computational complexity and reliability of maintaining a presence in the camera view.

(a) IR markers

(b) IR marker components

Figure 10: Infrared fiducial markers are made from two IR led’s diffused by ping-pong ball hemispheres.

Another benefit of this method is it does not require constant tracking. The addition of a fifth marker not only acts as an aid in determining orientation but also makes this system extremely robust against occlusion. When the fifth marker

21 is placed off center between the fiducial points, it establishes an ideal configuration that we must search for in the captured images.

This view is thresholded to detect ’blobs’ which describe the fiducial markers.

With dynamic movement of the performer, the challenge becomes robustly tracking the fiducial points. In the figure below it can be seen that the markers are very prominent in the camera view and are very easy to see and analyze.

Figure 11: The camera view is thresholded based on luminance in the red channel. It can be seen that using the LEDs is a good solution for a dynamic application such as dance where tracking success is very important.

The blob tracking algorithm used has existed for over 20 years was first presented by Gary Agin, then implemented by Dave Grossman of OpenCV. As the frame is raster scanned, regions that stand out are numbered and classified along with regions already encountered on a lower row. It is implemented in a two stage process: First over the rows then the stating and ending columns are of a region. [10] The x centroid

X, add the portion for each of the rows. For every row the positions of an ROI are n X [y1 + yn] at y1, y2, ..., yn thus sum the values: yn. Or equivalently, n ∗ . One this i=0 2 operation has been executed for all rows, the accumulated value has to be divided by n the area A. The y centroid Y : in each row x, if the length is n, then X x × n then i=0 divide by the area A.

22 2.1.4 Tracking

Once blob detection is implemented and the centroid positions of the blobs are known,

we need to analyze this information and determine which blob corresponds to which

position on the projection panel. The orientation point helps with this. To determine

proper order of points and orientation we must sort the blobs during every captured

frame. The ideal blob pattern is known to be rectangular with one blob off center

encapsulated by the rest. As the projection plane moves through the camera frame

the perspective can change to where the pattern is warped from this ideal to where the

sides are no longer parallel. Many researchers use consistent tracking to determine

the positions of objects in the camera frame. In order to recover from occlusions

algorithms such as mean shift are used. In this case, mean shift is difficult since the

surface is featureless and uniform color and consistent tracking is difficult to use for

occlusion recovery. Other methods that involve consistent tracking require generous

memory resources as well as complex computation techniques. Knowing the four

corners of the quadrangle of known size as well and because calibrated cameras are

used, the relative rotation and translation (or pose) can be represented in the form:

qk = [rx, ry, rz, tx, ty, tz]

Here, the rx, ry, rz represent the rotation, and tx, ty, tz correspond to the translation of the quadrangle in the camera space. Each position of the panel is represented as a discrete particle with their posterior density P (qk|qk−1)being determined by particle

23 filter. [7] The filter’s state dynamic model shows how the this pose changes from frame to frame during capture, while the observation model assigns weights to the data. The current image is observed as yk so the past frames up to the current can be defined as y1:k. The density is determined approximately as P (qk|y1:k). The particle

filter generates a set of weighted samples that describe the pose of the projection

1 1 S S panel, ((qkwk)...(qk wk )) with S representing the sample number.

In this method [14], the state dynamic model is based on uniform density U and an uncertainty factor e from the unpredictable movement of the panel. The dynamic model is then described as:

P (qk|qk−1) = U(qk−1 − e, qk−1 + e)

The interesting characteristic of the filter approach in [14] is the observation model.

Since the features in the frame consist of the panel’s edges in the current image k,

k k k yk = L = l1 , l2 , ...ln, the particles can be reprojected onto the image. Using a line comparison technique that checks the angles and distances between corresponding lines (or features) in the image, the best matched line segments can be assigned higher weights based upon some critera. Bad matches are assigned very low weighting factors. Panel fiducials are then extracted higher weighted particles and replace the current ones. Likelihood is determined from the overlap ratios determined in tracking:

4 n X t t P (yk|qk ) = r1 × r2 t=1

24 Finally with the likelihood determined, the weights of each particle are given by: [14]

P (y |qn) wn = k k k S P n P (yk|qk ) n=1

From this result, the relative orientation of the panel is known in each frame and

the fiducial coordinates of the board can be used for projection of augmentation once

modified by the homography from the calibration procedure in section 2.2.2.

In order to increase frame-rate and speed, especially with low cost camera equip-

ment, the approach developed for “Dance and Engineering in Augmented Reality”

does not implement any filtering for tracking. The tracking procedure uses only

information in the current frame to determine the pose and orientation of the pro-

jection panel. This method is fast, practical and immune to recovery problems from

occlusion.

(a) Projection Panel (b) Augmented

Figure 12: The projection panel a consists of a plain white cardboard with an IR LED marker at each corner and one in the middle, off center. This is tracked and overlaid with projection in b.

To find the projection panel in the single frame, we assume that the position of each marker on the panel is known in the coordinate system of the panel Π and that

25 the configuration of the LED’s is known to be four at each fiducial point with an

i T orientation point off center. The ith LED’s position can then be P = (xi, yi, 0, 1) with z being zero because all points are on the plane.

Figure 13: The detected blobs are classified by comparing the set to an ideal blob pattern. Each frame, we cycle through blob orders and compute a perspective transformation between the ideal and current pattern configuration.

In order to make a correspondence the centroids of the LEDs are unprojected,

removing lens distortions using the camera matrix obtained during calibration. The

points are unprojected as follows:

26       ui − u0 r u˜ u  0   i   i   f r    = CamUnproject   =  u             vi − v0 r  v˜i vi 0 fv r

Here u0 and v0 are camera centers and fu , fv are the horizontal and vertical focal r 0 2 2 parameters. The distortion r is determined directly by ( ui−u0 ) + ( vi−v0 ) fu fv

Once the markers are detected and uprojected into the image plane, their centroids

are known in the camera’s pixel coordinates. Because of our five point configuration,

it is known that the orientation point will always be enclosed between the other four.

To determine which four points are on the outside, we use the convex hull algorithm.

As a result of the convex hull, we are given indeces of the blobs that are on the

outside of panel, the fiducial points. The remaining point is the orientation point.

The problem with this is the order of the indexes is not necessarily correct with

respect to the panel’s orientation. It could be that the board is sideways or upside

down and since the indeces are just ordered clockwise, we need to find some way to

recover and match the orientation.

This matching requires some prior knowledge of the configuration of points, this

will let us make a correspondence quickly at frame rate. Different orders are re-

peatedly chosen from these outside points. Since it is known that the points are

fiducials, instead of getting permutations, it is quicker to move through the indeces

in a circular pattern. This is shown in the figure below. For every different point

combination, a planar homography HN warps the four fiducials to a homogeneous

27 plane. HN satisfies the relationship below. A similar method is proposed in ‘Outside

In LED tracking’ [13], however is not as robust, it lacks the implementation of convex hull and combining permutations of points is slower than the circular combination method.

    c1a1 c2a2 c3a3 c4a4 x1 x2 x3 x4          c1b1 c2b2 c3b3 c4b4  = HN  y1 y2 y3 y4      c1 c2 c3 c4 1 1 1 1

where     a1 a2 a3 a4 0 1 1 0   =   b1 b2 b3 b4 1 0 0 1

The last point, the orientation point is also transformed by the homography,

5 T O0 = [c a5 c5b5] . This is the point that we are most interested in. It will act as a key to determine whether the combination of points is a correct one.

Using this relationship the LED that is the key (orientation) point is expected to lie close to its image - plane analog. This way the combination with the smallest error between the orientation points will reveal the correct order of the points. Once identified, this combination will reveal the pose of the projection panel. We will compare the transformed orientation point with the current LED order, O0 to the ideal orientation point position on the homogeneous plane, Oi The method to determine the error in the orientation point’s position is simple L2 norm.

v u n n uX i 2 X 0 2 |e| = t |O | − |O | k=1 k=1

28 Once the combination with the smallest e has been found, this combination is used

to describe the pose of the panel for warping the media and overlaying with projection.

We denote this proper configuration as Pc in the camera coordinate frame.

2.1.5 Warping and Projection

This is the final component of the system. After getting the homography relationship between the camera and projector, finding the fiducial points of our panel in the camera space we need to warp their coordinates to fit the panel in the projector space using the calibration homography HC . Using the properly ordered points in camera frame, Pc, we need to transform for the projectors points Pp as such. Pp = HC × Pc.

Fitting media to these new coordinates is made simple with the use of OpenGL.

The display environment for this system is set up in an OpenGL window where we

are given the ability to draw shapes by specifying vertices. When the shapes are

drawn they can be texturized with outside media such as images of videos. We draw

vertices in the OpenGL environment using Pp and fit a texture accordingly. In our

software, this is defined as the “WarpablePlane” class that contains a warp method:

void WarpAblePlane::warp(const cv::Mat& H) {

cv::Mat cr;

transform(cv::Mat(corners), cr, H);

for (int i = 0; i < cr.rows; i++) {

cv::Point3f p3f = cr.at(i);

29 corners[i].x = p3f.x / p3f.z;

corners[i].y = p3f.y / p3f.z;

}

}

Here the points of the quadrangle are transformed and normalized by the z coor- dinate. The application of the texture occurs as follows:

const GLfloat vVertices[] = {

corners[0].x, corners[0].y, 0.0f,

0.0f, 0.0f, // TexCoord 0

corners[1].x, corners[1].y, 0.0f,

1.0f, 0.0f, // TexCoord 1

corners[2].x, corners[2].y, 0.0f,

1.0f, 1.0f, // TexCoord 2

corners[3].x, corners[3].y, 0.0f,

0.0f, 1.0f // TexCoord 3

};

const GLfloat* tVerts = vVertices + 3;

GLsizei stride = 5 * sizeof(GLfloat); // 3 for position, 2 for texture

texture.bind();

glPushMatrix();

30 glEnableClientState(GL_VERTEX_ARRAY);

glVertexPointer(3, GL_FLOAT, stride, vVertices);

glEnableClientState(GL_TEXTURE_COORD_ARRAY);

glTexCoordPointer(2, GL_FLOAT, stride, tVerts);

This excerpt of code shows how the texture corners are fit to the corners of the

quadrangle. Where the corners vector contains the properly ordered Pp. The resulting

projector window appears as follows:

Figure 14: The final result of the warped media in OpenGL, when this is made full screen in the projector’s window, it overlays the projection panel.

The final result of the overlay, (Figure 15) appears accurately on the projection panel and follows with minor delay as the panel is moved through the projection space.

This system is robust against occlusion and works in all indoor lighting conditions but suffers from interference from sunlight IR.

There are several problems with the method of calibration of this system, they will be discussed in more detail in the Calibration Methods section.

31 Figure 15: The final result of the warped media in OpenGL, when this is made full screen in the projector’s window, it overlays the projection panel.

2.2 Microsoft Kinect "Hacks"

The computer is now a tool for virtually any job, but ever since the introduction of the PC, the primary tools for human - computer interaction have been the keyboard and mouse. Novel techniques from the world of console gaming have influenced the way we interact with computers, not only for enhancing entertainment but work and productivity as well. At times complex tasks lack natural tools when performed on computers.

The Microsoft Kinect has been a game changing piece of hardware. Its actual purpose as a controller for the XBOX console is to allow players to interact with the gaming environments using their bodies alone without any sort of handheld input devices as has been done in gaming in at any time previously.

The device is essentially a 3D camera, capable of capturing full spatial geometry in its field of view. The Kinect contains a conventional RGB camera, an IR camera, and an infrared laser that is refracted by a special lens to create a series of IR laser

32 dots that cover the space. The IR camera sees these dots and based on the disparity

between nearest neighbors, a the PrimeSense chip (like a GPU) computes the relative

distance of objects in the space.

Figure 16: In the Microsoft Kinect a laser beam is refracted through a lens creates a matrix of laser dots in the space, based on the disparity in the dots, the depth of objects in the space can be determined.

This creates a point cloud that can be overlayed by the RGB camera image, effectively assigning a depth to each pixel in the camera frame. Figure 16 displays how this technique works. After obtaining a 3D point cloud and supplementing RGB information, intelligent algorithms can be used to detect bodies or objects in this point cloud. The 3D information makes detecting joints and skeleton much easier than methods that were used previously using just camera information. As a result this controller is robust enough and practical to use in commercial applications such as console games.

Even before the Kinect was officially released, hobbyists as well as researchers

33 Figure 17: The depth information captured with an open source Kinect driver is 640 x 480 x 640. The units are relative all relative to the origin of the camera. immediately saw the potential of an affordable 3D sensor. [19] When the Kinect first was released for sale, the editors of Make magazine, a publication which focuses on

DIY projects, announced that they would be offering a $3,000 bounty to anyone who managed to release free software to make the Kinect work with a computer. This happened within days, and as soon as the software was released “hacks” began to surface. First individual developers used the hacked software and soon after, Microsoft embraced the hacker community and an official API OpenNI was released. This

API, though proprietary to the company that developed the Kinect (PrimeSense), let developers track the human skeleton from the 3D point cloud. With this the Kinect quickly became adapted from a gaming console controller, to the PC.

Seeing the potential with the device we have purchased several Kinects and incor- porated it into the Dance and Engineering in Augmented Reality project.

34 2.2.1 Kinect at Temple

As a result of investigating the Kinect open source software and the OpenNI API, we have created several proof of concept projects in Augmented Reality. Most of these have been done in the context of performance art. The goal was to demonstrate interactive special effects that potentially could be implemented in a stage setting.

The software libraries and tools that we have implemented include both open source and OpenNI for the Kinect interface, OpenCV for easy implementation of image processing algorithms and OpenFrameworks because of its excellent improvement over the OpenGL graphics API. This allowed us to create impressive works that capture audience’s attention and have won awards at recent competitions.

All these projects employ the same homography based calibration technique that was used in DEAR, after the Kinect is calibrated using the RGB image, a correspon- dence is made in the 3D point cloud. Then the coordinates of objects or body parts that are detected with the 3D sensor are translated into the projector space and over- layed with projection. In the following sections, the different effects demonstrations will be outlined.

2.2.2 Getting Usable Sensor Data

With implementation of OpenFrameworks, generating appealing effects becomes easy.

In these demos, we did not use the proprietary OpenNI API and instead captured

35 the depth information using open source Kinect drivers. In this depth image, the

Kinect automatically zeroes values which are undefined because of occlusion. There is also a lot of noise in the raw depth image. So before being able to use the depth information, some kind of filtering was required.

To smooth the raw depth image, the inpainting algorithm was used. To explain simply, the way it works is after the image is segmented, the texture or color at the border of defined segments is used to fill and propagate this color into the areas where the image is ‘damaged’. [5] After segmentation, the areas are defined by convex hulls and the color information around the perimeter is mixed inside the regions. The result of this smoothing is shown in Figure 18:

Figure 18: The smoothed depth image from the raw Kinect data is deterimed with the help of OpenCV’s inpaint technique.

Once getting a smooth depth image, various image processing techniques can be used to detect and track objects in the image. For the following demos simple image subtraction, contour tracing and blob detection are used to find the users’ hands, bodies and changes in the scene.

36 2.2.3 Interactive Particle Systems

The demonstration in Figures 19 and 20 shows the effect that the dancers are currently working with and employs binned particle systems in OpenFrameworks. Changes in the scene are detected within the depth image and attraction forces are set to where these changes occur. This technique is preferred by the performers because of its visual appeal and predictability. They are looking for predictable changes in the visualization that react to their movements. Another benefit is reaction to multiple sources of change, the particle system here supports multiple sources of attraction.

Figure 19: The particle system uses attraction and repulsion forces for each particle as well as a larger attraction force to user’s hands that are detected in the scene.

We experimented with a different particle system that uses tree and leaf structures, this is however more computationally intensive so fewer particles can be rendered at the appropriate frame rate. This one, instead of setting attraction forces to users hands, applies them to any changes in the scene.

37 Figure 20: A user waves a stick which attracts a swarm of particles.

2.2.4 Superpowers with Projection

This demo in Figure 21 shows the type of special effects that can be used to augment a performance. Here users’ hands are tracked and overlaid with projections of trails and lightning effects. Currently this implemented with only a single user. A contour of a body is found in the depth image and extreme left and right points are defined to be the hands, then tracked using a similar method as in DEAR with particle filter.

Figure 21: Lightning and trail effects are projected over the performer’s hands creating an illusion of superpowers.

38 2.2.5 Kinect 3D modeling

This project is an updated version of last years “Virtual Playdoh”. Previously Wii remotes were used as infrared cameras to track an LED control pen in stereo. It was developed with the idea that computers still lack a way to give a user a feel of natural interaction when dealing with things like 3D modeling. The audience for this presentation consisted of 4th and 5th graders, we needed to find a way to convey these ideas of human computer interaction in a way that kids would understand. Play-Doh was the most natural example of 3D modeling. To further add to the realism, the modeling scene was rendered in anaglyphic 3D. Since kids love play-doh, 3D glasses, and especially love playing video games, this presentation was a big hit.

Figure 22: Virtual Playdoh allowed users to mold a warp-able NURBS surface using both of their hands and rendering in anaglyphic 3D.

This year, to create a contrast between the gaming technologies, using OpenNI we integrated the XBOX kinect into Play-Doh to track a users hands instead of a physical tool, like the control pen. As part of the presentation, we explain how anaglyphs or red/cyan 3d images, stereo tracking as well as the tech behind the Kinect works. The kids find out how their favorite video games work and we trick them into getting interested in science, math and engineering.

39 2.2.6 The Point Cloud Library

In an effort to further distance ourself from closed source OpenNI, we are investigating methods to work with the raw point cloud data obtained from the Kinect sensor. Very recently an open source library, [4] Point Cloud Library or PCL was released as a standalone beta by Willow Garage. The PCL framework contains numerous state- of-the art algorithms including filtering, feature estimation, surface reconstruction, registration, model fitting and segmentation.

Algorithms packaged with the library can be used to determine outliers in noisy point cloud data, combine clouds with stitching, segmentation of relevant objects in the scene, extraction of descriptors and the recognition of objects based on the geometry captured in the scene. Apart from computational tools, the library also provides robust visualization tools so less time is spent on rendering allowing more effort to be dedicated towards the development of algorithms. This is still a very very new field with a lot of potential for breakthrough publications and research. The library contains many bugs and is seeking out help from independent developers and researchers. After the calibration problem has been solved, most of the focus will be towards analyzing using techniques like provided in this library. Though still very young, as 3D sensors become more accessible, it is likely to become the OpenCV for

3D point clouds.

40 3 Calibration Methods

3.1 Projector Calibration Methods

Because the focus of much of this work is Spatial Augmented Reality, accurate projec- tion is very important. We must be able to accurately project onto any point in the projector’s field of view. For all of the work described above, the simple planar homog- raphy calibration method was used [18] to achieve acceptable overlay. This method was described in the calibration section in the description of the DEAR project.

Even though this is a practical way to do calibration, this method is not totally accurate at different depths. Since the projector - camera system is an example of two view camera geometry, it suffers from parallax issues. To put it simply, one can imagine looking and focusing on an object at some depth (calibration), then when holding an object closer to the eyes, an illusion of two objects is created because two eyes is a two view system. The further away that object is from that focused depth, the larger the disparity between the two objects in the illusion. As humans we counter that issue by consistently refocusing and rotating our pupils to follow objects around scenes. As a result we cant focus on both far and near planes without the views splitting and creating an illusion of duplication. When closing one eye, however, even though the objects away from the plane in focus are blurry, no parallax exists, emulating a single camera.

Our main objective is to calibrate the projector - camera system in such a way

41 that instead of focusing on a certain depth as in the planar calibration method, but to

focus on the entire scene, like closing one eye. Then the camera has to be replicated

within the projector. So to project onto what the camera is seeing, the projector

has to have the same focal parameters and the scene that is projected out has to be

rotated and translated relative to the projector - camera displacement between one

another.

The goal of the calibration is to determine the projector’s camera matrix and the

displacement between the camera and projector. This matrix consists of the focal

parameters and define the frustum of the projector. For a camera it dictates the

transformation of 3D real space coordinates to 2D coordinates in the camera’s image

plane. The following classic equation demonstrates how real world coordinates M 0 relate to image coordinates m0, by the camera matrix A (intrinsic parameters of the camera) and the rotation and translation (extrinsic parameters) [R|t].

Figure 23: The pinhole camera model maps real space to image space.

42 There has been some work done in the area of camera-projector systems in the very recent years with much of the publications dated later than 2005 so this is still a very developing focus for academic researchers. Our proposed method for projector calibration will make use of the new 3D sensors and will be compared to the following existing projector calibration methods.

3.1.1 Arbitrary Planes and Calibrated Camera

This method ‘Projector Calibration Using Arbitrary Planes’ defines the projective relationship between the projector’s projection space and a two dimensional calibra- tion pattern. It is different than planar calibration to rectify keystone effects that is generally used. This process is the basis for systems that perform measurement in 3 dimensions. It is centered around the idea that the light emitted from projector can be modeled as a reverse pinhole camera in perspective projection. The authors claim that in existing techniques special objects like complex 3D calibration patterns and active patterns are used to perform calibration. They avoid having to project onto elaborate objects with enough depth characteristics. This method will simply employ standard camera calibration technique using a chessboard pattern. The result is a method that uses only a calibrated camera, planar calibration pattern and is prac- tical, accurate as well as not computationally intensive. The projection model for a standard projector is similar to that for a pinhole camera, the only difference is the direction of projection. Perspective projection can be described as:

43 m˜ ∝ PM˜

where M˜ is M˜ = [x, y, z, 1]T and describes object space and m˜ = [u, v, 1]T , image

space. This relationship is created by P , the matrix of intrinsic and extrinsic param-

eters.

   α 0 Cx        P =   |R, t|  0 β Cy       0 0 1 

The intrinsic matrix contains α and β, these are the focal lengths (distance between the camera center and the image plane) with skew parameters. The reason for two different lengths in x and y direction is because usually pixels in the camera’s sensor are not perfectly square. The extrinsic matrix, as in the camera, has rotation R and translation t. Since projectors and cameras exist within the framework of perspective projection, any arbitrary plane in a space can yield a homography relationship, where:

m˜ a ∝ Hbam˜ b

The points m˜ b are the 2D coordinate system in the projector’s projected pattern while the points m˜ a are the coordinates of the same pattern in the camera’s 2D image.

The inverse relationship Hab exists as an inverse matrix of Hba. Considering this,

the projector must then have a fundamental matrix describing the epipolar geometry

between the two. This matrix is a result of two projections from camera and projector

and an epipole:

44 + Fba ∝ [˜ea]×PaPb

To determine the fundamental matrix Fba a number of homographies must be deter- mined by translating and rotating a plane through the camera and projector space.

Consider Hbai and Habi to be homographies for plane i, an indefinite homography Haa

is then determined from two different positions of the plane:

Haa = HbaiHabj where i 6= j and m˜ ∝ Haam˜

The point m˜ must be a point on the cross section line of planes i and j as well as a

point that is projected from the focal point of Pb onto the image plane of projection

Pa. Thus the epipole ea is then determined by decomposition of Haa. [11] The

epipole, the arbitrary plane homography, the matrix Fba is then calculated:

Fba ∝ [˜ea]×Hbai

The same procedure is applied for epipole e˜b and fundamental matrix Fab [15].

Assuming the camera’s intrinsic matrix is known, we let Pa to be the known camera

intrinsics and Pb the unknown projector intrinsics. Thus from the above relationships

it can be said:

[˜ea]×HbaPb ∝ [˜ea]×Pa

[˜eb]×HabPa ∝ [˜eb]×Pb

45 Since the camera matrix Pa is assumed to be known, both of the planar homographies

and epipoles are determined with Pb still unknown. In estimation, some reasonable

intrinsic matrix is set then the extrinsic and intrinsic parameters are optimized to

fit the conditional expressions with the homographies and epipoles. This is trivial

since the focal points of both projector and camera are restricted by back projection

lines on epipoles ea and eb. Because of this the extrinsic matrix has only two degrees

of freedom as shown below: Using an arbitrary scale factor in the homography,[15]

Figure 24: The two degrees of freedom, 1 in rotation and 1 in translation.

both intrinsic and extrinsic parameter of Pb are optimized with the same conditional

expressions using acquired homographies. The authors claim when using two XGA

cameras and an XGA projector, the re-projection using the calibrated projector was

evaluated to only have 0.4 pixel ambiguity and is close to the margin of error to the

camera calibration of the cameras they used. The high accuracy is attributed to the

evaluation of multiple homographies simultaneously.

3.1.2 Ray-Plane Intersection Method

This method’s goal is to uncover the extrinsic and intrinsic parameters of the projector

and camera in a projector - camera system. As before, both of the devices can be

46 modeled with the pinhole camera geometry model, where parameters include the

image center or principle point, the focal lengths, size of pixels and skew factor with

the extrinsics being rotation and translation from world coordinates to the projector

or camera coordinate system. The method requires determining the 3D points of the

corners in the projected pattern to use them with the two dimensional points in the

projected image to get the intrinsics and extrinsics of the projector. The method can

be broken up in several steps:

• Calibrate the camera

• Detect physical chessboard corners on calibration plane in camera image

• Project chessboard and detect corners

• Use ray-plane intersection to get 3D position of projected pattern’s corners

• Calibrate projector with 2D image points and 3D object points of the projected

pattern

Camera calibration is a trivial problem. Easy to use functions exist in OpenCV that take in image and objects to determine a 3 x 3 camera matrix. It is important to note that image points are the coordinates of the detected chessboard corners in the cameras image, denoted in pixels. The object points are the phisycal corners of the chessboard in centimeters or some real world units. For Zhang calibration with the chessboard pattern, these are also 2 dimensional. Several snapshots (at least 4)

47 of the calibration pattern are required to calibrate. To make life easier, OpenCV also

contains a chessboard corner detection function.

Once the camera has been calibrated the rest of the calibration procedure requires

a known plane for which to project onto. A large rigid poster has a small physical

chessboard pattern attached to it with enough room to project onto. This way using

the calibrated camera and the corners detected on this panel, we can determine the

panel’s pose and normal vector. When the pose of this plane is known, we project

a chessboard calibration pattern. When holding the panel at different positions, the

projected pattern becomes warped and skewed, detecting the projected points on the

known plane, we can determine the projectors intrinsics.

The third column of rotation matrix in the camera’s extrinsic parameters give us

the origin of the plane p while the last column n of the matrix is the normal vector

or the surface of the plane with the chessboard pattern. The corresponding visual

representation of the Kext matrix is shown in Figure 25

   r11 r12 r13 tx           r21 r22 r23 ty    Kext =        r31 r32 r33 tz       0 0 0 1 

So p is a vector with a position of a point known to be on the plane and n, a normal vector to this plane. In this case the plane is a set of points r that

48 Figure 25: The extrinsic parameters, that are used to compute the plane equation.

Then, being p the vector representing the position of a known point in the plane, and n a nonzero normal vector to the plane, the desired plane is the set of all points r such that :

n · (r − p) = 0

n = axˆ + byˆ + czˆ

r = xxˆ + yyˆ + zzˆ

Here, x,ˆ y,ˆ zˆ are unit vectors, then d is a dot product d = −n · p. The plane equation is then:

ax + by + cz + d = 0

After determining the plane equation, we need to create a representation of the

3D vectors from the camera center going through to the corners of the projected chessboard pattern extracted with OpenCV’s chessboard function. To get the ray from the image plane center cx, cy, a projective transformation is applied using the camera’s intrinsics and extrinsics.

49 Figure 26: Ray plane intersection in its entirety. The object is to determine vectors V from the camera center, to the projected calibration pattern.

   sRx        c    x       sRy      −1     = [Kint Kext]  c     y       sRz          1  s 

The vector that describes the ray [Rx Ry Rz] is up to a scale factor s. This scale

defines the length where they intersect with the 3D point on the image of the projected

calibration pattern (points P i in the figure).

50 To get vectors V in Figure 26, the intersection must be determined between the rays going through the image of the projected pattern and the physical plane where the pattern is actually being projected. We need to find the s factors that satisfy the planar equation.

ax + by + cz + d = 0

and substitute for the ray:

a(sRx) + b(sRy) + c(sRz) + d = 0

Since a, b, c and d and Rx, Ry, and Rz are known, s can be recovered. Then once s is determined, we have determined the 3D position of each corner in the projected calibration pattern in the object coordinate frame. Once this is done for several positions of the plane, [8] we can calibrate the projector in the same way that the camera was calibrated.

The camera calibration function is used with image points of the projected pattern

(its actual pixel coordinates) and the object points, the ones determined with the ray play intersection process.

3.2 3D Sensor - Projector Calibration

In the other two methods outlined, calibration using arbitrary planes and ray-plane intersection calibration, both relied on a camera to see for the projector to deter- mine the 3D points required to get the intrinsic matrix for the projector. With the

51 availability of the 3D depth sensor, we can replace the camera and make the process of calibrating the projector faster, easier and requiring less preparation in terms of creating elaborate calibration pattern setups.

The RGBD sensor has just recently began to be explored in the

field. As can be seen by the wide array of applications and projects, it has great potential to be a tool almost as important as the camera.

As stated in a previous section, the RGBD sensor consists of a standard digital camera, an infrared camera and an infrared laser projector. The image sensors of the infrared camera and regular camera are physically separated by 3cm. This essentially is a two view stereo camera, with one being a standard camera image and the other a range image.

The general outline of the new calibration method is as follows:

• Project a calibration pattern on some surface

• Detect corners of projected calibration pattern in RGB camera

• Make depth sensor correspondence to corners in RGB image

• Use 3D object points determined to get projector matrix.

• Calibrate 3D sensor RGB camera to get extrinsics between 3D sensor and pro-

jector

Even though this method is fairly straight forward, it has not yet been imple-

52 mented, and the contribution it will make to ease the calibration of projection map- ping systems will yield many interesting results in the fields of spatial augmented reality and projection mapping.

In these steps, there are some complications however. Since the depth sensor has both IR and RGB cameras separated by some distance, a correspondence must be made between the two in order to “look-up” depths of RGB pixels. Another problem will be determining the displacement (extrinsics) between the sensor and projector.

Then finally, when the intrinsics and extrinsics have been determined, the question becomes how to use them to accurately augment a scene with the projector.

In Zhang calibration for cameras, the planar chessboard pattern is used, it is not necessary to know depth values for detected points, since the pattern is known to be planar. So real unit values are used for x, y positions of corners and the z is assumed to be 0. [20] For accurate calculation of projector intrinsics, the z coordinates of the chessboard corners ‘object points’ on the projected image must be known relative to the camera. The advantage of the RGBD sensor is they can be found with the help of the sensor’s depth image when previously and in the other methods of projector calibration, complex calculations and estimation techniques were required.

The performance of this method will of course vary and improve with higher resolution and better quality sensors.

Since it is impossible to detect the printed or projected chessboard pattern in the range image, we use the RGBD sensor’s camera. The chessboard is easy to detect in

53 the regular image, once it is detected the depth of each pixel could be ‘looked up’ in

the range image. There is, however, a problem with that. Because the cameras are

separated by a small distance, their images do not overlay and are taken from slightly

different places, so lookup of a pixel in the regular image xi, yi will correspond to a

different object in the range image. In figure 27, the actual range and camera images

are added to display the differences. Not only are the cameras separated by distance,

their intrinsics are also different, with the range camera having a smaller field of view,

thus different focal characteristics.

Figure 27: The depth and RGB image on top of each other, it can be seen that they do not perfectly correspond.

3.2.1 Calibration of depth and image sensors

Before doing anything with the detected points, a correspondence must be made so

the depth and RGB images in the above figure align accurately. To do this we can

use a similar method of calibration as was implemented in Dance and Engineering

in Augmented Reality, where both cameras would see a calibration pattern and a homography can be generated from the two views of it, transforming one of the

54 images by this homography would align the depth and RGB image. It is complicated by the fact that one of these cameras can not see visible light, only depths of solid objects from the reflected array of dots.

In order to make accurate correspondences in both RGB and range images and look up depths of RGB pixels, the images must overlay each other perfectly.To achieve this, corresponding points must be identified in both images. One way to make the chessboard calibration pattern visible in the range image is the use of transparent plexiglass. In this case, the plexiglas pattern was constructed to calibrate the the two images. Here solid squares were adhered to the sheet of plexiglass that would reflect infrared light, making them appear as solid objects in the range image, while the ‘white’ squares were left clear, to not reflect the RGBD sensor’s infrared light.

This created a resemblance of a chessboard in the range image. The RGB camera would see the chessboard corners as normal, so a correspondence was made between the corners in the RGB and range images. Figure 28 shows the plexiglass calibration rig with chessboard corners detected.

Figure 28: Calibration [12] of depth and RGB cameras on the 3D sensor

However, the performance of this method was not as desired. Since the chessboard corner detection algorithm works by detecting points where the squares of on the

55 chessboard connect, it often lost the pattern due to noise in the depth image and the low resolution of the range sensor. Another factor that contributed to error was when the glass became smudged, infrared light was actually reflected, producing erroneous data. To achieve more accurate results, a different approach to constructing a calibration rig for the range and image cameras was taken.

This calibration rig was three dimensional physically. It consists of a physical chessboard pattern constructed from foam board, with a white solid background, removed by some distance from the checkers. This pattern, pictured in Figure 29 worked flawlessly and is even actually detected better in the range image than in the

RGB image.

Figure 29: Physical 3D chessboard calibration pattern used that can be seen in both RGB and range cameras.

Once the calibration pattern can be identified in both images - Figure 30, to make the images align, two methods were examined: Homography based alignment and stereo calibration.

56 Figure 30: Physical 3D chessboard calibration pattern with detected corners.

3.2.2 Homography Based Alignment for Two View Correspondence

The goal of this method is to produce a homography Hkinect or projective transforma- tion between the RGB and range images. When this homography is applied to one of the images, it will skew and warp to align with the other for corresponding points to overlay. The calibration method used in the previous work, Dance and Engineering in Augmented Reality the homography is calculated between the detected corners in the depth Id and RGB Irgb image.

Id = HkinectIrgb        xd   h11 h12 h13   xrgb                      ∼      yd   h21 h22 h23   yrgb                    1 h31 h32 h33 1

h11xrgb + h12yrgb + h13 xd = h31xrgb + h32yrgb + h33

h21xrgb + h22yrgb + h23 yd = h31xrgb + h32yrgb + h33

In the table are examples of resulting Hkinect homographies for two trials.

57 1 2      1.09681 −0.0170 −50.2755   1.02605 −0.00610 −2.6444                   −0.0039 1.1009 −37.437   0.0072 1.0168 −3.3147           3.8396e − 5 −9.2104e − 5 1   3.1977e − 5 1.0884e − 5 1 

The chessboard detection step slows down the speed of homography capture to slower than frame rate, at about 0.6 seconds for each frame. The calculation of homography speed is negligible in the C++ implementation. Because of the quick capture, it is trivial to obtain a large set of homography data.

Once Hkinect is known, this step does not need to be repeated because the cameras are fixed relative to the specific RGBD device. The homography is saved for use in the next step.

To overlay the images accurately, we simply warp the x, y positions in all the

pixels of one of the captured images are multiplied by the homography. This results

in a warped image that corresponds to the range image. Using the homography from

trial marked ‘1’ above, Figure 31 shows the detected corners before and after the

multiplication by Hkinect.

When the two images are matched and a correspondence is made, the result

becomes usable for direct lookup of depth in the RGB image.

58 Figure 31: Missmatched chessboard corners are matched after multiplying the RGB image by the determined homography.

3.2.3 Stereo Calibration for Two View Correspondence

Theoretically, this method is more accurate than the previous, homography based method. Unlike the above, both depth and RGB cameras must be calibrated and have known intrinsics. Stereo calibration allows us to determine the physical relationship between two cameras, specifically, how corresponding points are situated in each cameras field of view or cone. The goal of this is a determination of the rotation and

59 translation one each camera with respect to the other as well as the essential and fundamental matrices describing epipolar geometry for the cameras’.

For this, the same 3D calibration pattern was used. Calibrating a camera requires several poses of the calibration pattern to be captured, from which the camera intrin- sics can be estimated. The same is done for both depth and RGB cameras and the following intrinsic parameters were estimated.

   612.2128 0 283.4850        RGB camera matrix:    0 609.8435 262.0371       0 0 1 

   591.7439 0 301.5931        Depth camera matrix:    0 588.1731 247.1483       0 0 1 

fx fy cx cy

RGB 612.2128 609.8435 283.4850 262.4850

Depth 591.7439 588.1731 301.5931 247.1483

This is reasonable with both camera centers approximately at half the width and height of their respective images. Several calibration trials result in similar numbers, all within reasonable range. Following calibration of each camera individually, a stereo calibration is performed to determine the cameras’ rotation and translation relationship to each other.

60    0.98872 1.1545e − 3 −1.8120e − 2        Rotation Matrix:    −1.4197e − 2 1.0048 −1.2718e − 2       1.7512e − 2 1.3110e − 2 .9992 

   2.0863e − 2        Translation Matrix:    −8.8119e − 4       −0.9871e − 2 

Again, these numbers are reasonable. It can be seen that all radian values in the rotation matrix are very small, showing that the cameras are oriented facing the same direction as they are on the sensor. Also the translation matrix meter values show that the cameras are on the aligned laterally and are separated by 2cm with their virtual camera centers separated by 1cm, explaining the difference in the field of view.

With the rotation and translation matrices known, we can then use the ‘lookup’ method factoring in the transformations to determine the RGB pixel location of their corresponding ones in the range image, by a similar method proposed in [6].

P 3D0 = R.P 3D + T

0 0 P 2Drgb.x = (P 3D .x ∗ fxrgb/P 3D .z) + cxrgb

0 0 P 2Drgb.y = (P 3D .y ∗ fyrgb/P 3D .z) + cyrgb

61 3.2.4 Comparison Between Homography Based and Stereo Calibration

for Two View Correspondence

Both the methods are appropriate for making alignment of the two views. The stereo calibration, is a more general approach and requires calibrated cameras with known intrinsics and extrinsics. This makes it more complex than the homography based approach which only requires a single snapshot of the calibration pattern and no calibration.

Figure 32 shows the distribution of the chessboard’s origin corner when no cali- bration is used. It is obvious that this is unusuable with a disparity of about 30 pixels between the centers of the distribution. The standard deviation of the distributions is larger with the depth image due to sensor noise with the RGB distribution falling within one pixel of the distribution center. This deviation precists with both calibra- tion methods because of its direct relationship to the nature of the image and depth sensors.

std(x) std(y)

depth 1.6229 1.4206

rgb 0.8959 0.3880

Calibrating with homography, we see the same distributions of points, deviating less than 2 pixels in the depth image and less than 1 pixel in the rgb image from the distribution centers as in the uncalibrated case. Figure 33 shows the distribution of

62 Figure 32: When not calibrated the position of each corner does not correspond. With a disparity of close to 30 pixels. pixels with the homography calibration method. The following figure, 34 shows the stereo calibrated distribution of points.

Figure 33: Plot of correspondence when calibrated with homography.

From both of the figures, 34 and 33 it can be seen that no significant accuracy

improvement can be seen that puts either method at an advantage. The pixel devi-

ations are within the same range as the uncalibrated case and the distance between

the centers of each distribution fall within acceptable range for our applications.

63 Figure 34: Plot of correspondence when calibrated with stereo calibration.

Homography Stereo Calibration Uncalibrated

Center Disparity 0.8215 0.9783 27.9575

Deviation x rgb 0.8171 0.7492 0.8959

Deviation y rgb 0.6482 0.8841 0.3880

Deviation x depth 1.8901 2.0857 1.6229

Deviation y depth 1.6794 1.7301 1.4206

3.2.5 Extracting Depth Information

When this correspondence has been made with the homography, we can use RGB data to lookup depth coordinates of each RGB pixel in the undistorted images. using depth camera intrinsics, each pixel xd, yd is projected to the 3D space with the following:

x3d = (xd − cxd) × depth(xd, yd)/fxd

y3d = (yd − cyd) × depth(xd, yd)/fyd

z3d = depth(x, y)

64 In reverse, we can project each 3D point to get its RGB counterpart.

Once the cameras on the 3D sensor are calibrated we can move on to projector calibration.

3.2.6 Use 3D points for projector intrinsics

With the calibrated 3D sensor, we can use standard methods to detect a calibration pattern within the image. The difference here from the ray-plane intersection method, since we know the depths of the pixels from the sensor, is we will not need a physical calibration pattern, only a known projected pattern and a plane for projection.

Once projected, the chessboard pattern will be visible to 3D sensor, we detect the chessboard pattern within the 3D sensor’s RGB image and get a corresponding depth value. Thus we get full 3D object coordinates for each corner point in ambiguous units.

To get real metric units of these points we use data that has been experimentally determined in [12]. The depth sensor values appear to be proportional to the inverse of the actual depth to an object. This has been obtained by doing a measurement from the center pixel in the depth image and regressing the data. The following equation transforms the ambiguous raw depths sensor units to meters:

1.0 dm = draw ∗ −0.0030711016 + 3.3309495161

In order to determine usable object points, the depths are with respect to the

65 Figure 35: The distance sensed in meters vs. ambiguous units.[12] camera capturing the points, and the x and y position with respect to the origin of the pattern, or the first corner. These units must also be in consistent physical units, in this case meters. With the depth of each point known in meters, the case in Figure

36 can be considered. Here all points are projected from real space onto the image plane.

Figure 36: Determining metric distance for object points in the camera.

Here:

66      x   X              Z λ   =   whereλ =  y   Y      f      f   Z 

A calibrated camera is needed to determine f, but we have already done that in a previous step. So with that as well as the depths and pixel values of each corner known, we can transform them to meters with the following equations.

Z × x X = 10 × fx Z × y Y = 10 × fy where

x = ximage − xoriginofchessboard

y = yimage − yoriginofchessboard

Knowing, the real physical XYZ coordinates of each corner of the projected cali- bration pattern, we can now use them as object points in the projector calibration.

The image points in this case are simply the xy pixel coordinates of the projected image in the projectors image plane, which are just the raw values specified in the projected image. Which is the reverse of standard camera calibration. When calibrat- ing a camera, the image points are ones determined from detecting the chessboard in the camera view and since the projector is the opposite of the camera, all image plane coordinates are known.

67 3.2.7 Determining Projector Intrinsics

With the set of metric 3D object points and the pixel coordinates of the projected cal- ibration pattern we can then use standard camera calibration methods as in ray-plane intersection [8] to determine the projector’s intrinsic parameters. Objects points are determined by identifying the projected chessboard patern on a surface in real space.

The metric coordinates of each chessboard corner are determined in the previous step.

Their depths are with respect to the camera, while horizontal and vertical distances with respect to the origin of the calibration pattern.

As with camera calibration, we need several snapshots of the projected calibration pattern from different angles. Again, at least four are required, but more snapshots will improve estimation results. The projector’s camera matrix will provide the trans- formation between image points (or the pixels of the corners in the projected image) and rays in Euclidian 3D space. With this we are able to get directions of the pro- jected rays, these rays define the perspective of objects with respect to the projector as in Figure 37.

Figure 37: The projector’s camera matrix provides information that will allow to understand the perspective of the projected image.

68 Given n corners of xi and Xi where xi is an image point and Xi a corresponding

3D scene point, we can compute a projection matrix P that satisfies:

P = K[R|t]

such that xi = PXi

Each point correspondence will give two equations:

p11Xi + p12Yi + p13Zi + p14 xi = p31Xi + p32Yi + p33Zi + p34 and

p21Xi + p22Yi + p23Zi + p24 yi = p21Xi + p22Yi + p23Zi + p24

When multiplied out, it can be seen that linear equations are produced:

xi(p31Xi + p32Yi + p33Zi + p34) = p11Xi + p12Yi + p13Zi + p14

and

yi(p31Xi + p32Yi + p33Zi + p34) = p21Xi + p22Yi + p23Zi + p14

Concatenating all equations from n ≥ 6 corners generates 2n equations simulta-

neously that can be written as Ap = 0 and A is a 2n × 12 matrix. This will not have

an exact solution, but a linear solution that minimizes Ap and is obtained from the

vector which corresponds to the smallest singular value of the singular value decom-

position of A. This solution for a single snapshot is then used as part of a non-linear

69 minimization of the difference between measured and projected point with multiple

snapshots used.

n X 2 min(P ) ((xi, yi − P (Xi,Yi,Zi, 1)) i

This projection matrix P consists of the the projector’s intrinsics and extrinsics

R|t. In order to extract the camera matrix K, the first 3 × 3 submatrix which is a

product of K and R is decomposed with QR decomposition. Then the translation vector t is defined as:

−1 T t = K (p14, p24, p34)

The R|t in this case are not important and can be discarded, they describe the

relationship of the calibration pattern to the projector. Later we will determine the

R|t of the 3d sensor with respect to the projector, which is important.

The produced intrinsic matrix K:

   fx s cx        K =    0 fy cy       0 0 1 

With s being a skew parameter that describes the angle θ between the x and y axes. s = tan(θ) and should be very low.

To calibrate the projector, a calibration pattern of known pixel size is projected full

70 screen in the projectors window. For this work we use a calibration 6 × 8 calibration

pattern with 64 pixel squares. Thus the image points are all known. The RGBD sensor

is placed close to the projector’s lens. When the chessboard pattern is detected in the

image, the XYZ position of the points is determined with methods mentioned above and after several snapshots (k ≥ 4), the projector’s camera matrix is determined. Here are several projection matrices for three trials along with the views they produce:

   1163.121 0 396.337        K =   1  0 1079.167 269.162       0 0 1 

Figure 38: Scene generated with intrinsic matrix K1

   933.280 0 384.642        K =   2  0 1048.575 325.577       0 0 1 

71 Figure 39: Scene generated with intrinsic matrix K2

   1002.738 0 568.881        K =   3  0 925.723 373.264       0 0 1 

Figure 40: Scene generated with intrinsic matrix K3

Just as an example, the following Figure 41 shows the scene without the intrinsic matrix of the projector.

Now that we can properly generate the 3D scene with correct perspective,he next task is to determine the displacement relation of the projector and camera.

72 Figure 41: Scene generated without the use of intrinsic matrix.

3.2.8 Find R and t between sensor and Projector, Extrinsics

Points in space will be expressed in terms of a different coordinate frame, the world coordinate frame. Two coordinate frames are related via a rotation R and transla- tion t.

The rotation and translation are given by:

   r11 r12 r13 t1        [R|t] =    r21 r22 r23 t2        r31 r32 r33 t3

[R|t] are external or extrinsic camera parameters and are not affected by the focal characteristics of the camera.

The Fundamental Matrix, F encapsulates the projective geometry between two views. It is independent of of scene structure and only depends on the cameras’ internal parameters.

73 0 Figure 42: Where Hπ is a planar homography, e are the epipoles result in a funda- 0 mental matrix: F = [e ] × Hπ

It is a 3 × 3 matrix of rank 2. If a point in 3-space X is seen as x in the first view and as x0 correspondingly in the second, the image points must then satisfy the relation:

x0T F x = 0

It can be computed from scene points alone without knowledge of camera intrinsics.

OpenCV has a built in function cvFindFundamentalMat that accepts two sets of scene points as cvMat’s, returning a fundamental matrix F

The Essential Matrix E is a specialized form of the fundamental matrix. The fundamental matrix can be considered a generalized case of it where cameras are not assumed to be calibrated.

x˜0T Ex˜ = 0

Where x˜0 and x˜ are homogenized normalized image coordinates in two image frames resepectively.

74 As a consequence of normalized cameras, their respective coordinate systems are related by means of rotation and translation as mentioned above. The two sets of 3D coordinates are related by:

x˜0 = R(˜x − t)

Where R is a 3 × 3 rotation matrix and t is a 3 - dimensional translation vector.

It is possible to define the essential matrix E as:

E = R[t]×

Where [t]× is a matrix representation with the cross product with t.

The method of determining R and t from E is based on performing singular value decomposition of E.

In order to properly position the 3D scene in the projector’s window, we need to shift the scene by the rotation and translation between the sensor and projector.

To determine this as above, we need to compare the two image planes. Since the image plane of the projector is just what it is projecting, this can be done with the

’SolvePnP’ algorithm. This can get us a usable angle or rotation measure, stereo calibration still has to be used to get translation between two lenses.

To project overlays onto real world objects, it is desirable to position the sensor as close to the projector’s lens as possible, this way the R|t values will be small and

75 the scene will not have to be shifted too much. The effect this will create is the scene

will be shifted to display as if it was captured from the projector and not the sensor.

The following are several determined rotation and translation matrices.

For easier understanding (and also for OpenGL) we use degrees rather than radians

for rotation and meters in the vectors. We use vectors for simplification as well, since

full rotation matrices contain skew and distortion parameters which are assumed zero

for a close proximity case of video projectors.

     −1.028   0.019              R =   T =    177.50   0.127           0.206   −0.141 

     −1.055   0.022              R =   T =    177.640   0.119           0.285   −0.130 

These rotations and translations are incidental to the actual placement of the sensor next to the projector so will be different for multiple trials. But without movement, they stay consistent when neither is moved. Figure 43 shows the affect rotation and translation have on the scene display.

Finally, with the correct projection matrix implemented as well as correct rotation and translation the virtual scene and real scene are shown in Figure 44.

76 Figure 43: The scene with default R and T, and on the right, the scene with R and T input.

Figure 44: On the left, the virtual 3D scene that is modified with rotation, translation and projection matrix, on the right, the scene overlaid by projection.

4 Implementation

4.1 Displaying the Augmented Scene

Once determined the projector’s intrinsic matrix, P and extrinsic [R|t], we must figure out how to properly display the scene so the overlay aligns with the actual objects.

The point cloud captured with the 3D sensor exists in a 3D environment from the point of viw of the sensor. When displayed, we have various options on how to view the scene in terms of perspective. This scene can be rotated, skewed, stretched and scaled. The projector calibration gives us all we need to properly transform this cloud

77 for overlay. In essence, we have to rotate the scene so it looks like instead of being viewed from the sensor, it is being viewed from the projector. Then we have to scale and stretch it to mimic the projector’s focal parameters.

To do these operations, the environment of OpenGL provides a perfect platform for display. Here, it is easy to perform rotate and translate operations as well as provide a parametrize camera frustum through which to view the scene. When the projector’s intrinsics are mimicked by the projection matrix of the frustum, the scene will be accurately projected out.

The OpenGL transformation pipeline can be though of as the following analogy of a camera viewing a scene.

Figure 45: The OpenGL viewing pipeline

Finally, keep in mind that the viewing transformation commands must be called before any modeling transformations are performed, so that the modeling transfor- mations take effect on the objects first. We have to construct a OpenGL projection matrix from the determined intrinsic parameters.

78 the pattern

4.2 Evaluating Projection Performance

The main metric for evaluating performance will be reprojection error. This factor

has been used as a standard in camera calibration and now projector calibration

systems.

The reprojection error can be considered a geometric distance between the pro-

jected point and a measured point. We seek a homography Hˆ and pairs of matched

0 points xˆi and xˆi that minimize the reprojection error function:

n X 2 0 0 2 0 ˆ d(xi, xˆi) + d(xi, xˆi) where xˆi = Hxˆi i

0 We will need to [11] determine correspondences between xˆi and xˆi. From this we get the algebraic error vector that is associated with the point correspondence and

0 0 the camera mapping. d(xi, xˆi) and d(xi, xˆi are the geometric distance between corresponding points. It is used to quantify the estimate of 3-space points Xˆ and its true projection x.

Another metric that can be used to compare the different calibration methods is the number of feature points used in the calibration pattern as well as the types of patterns that produce the most accurate result. In standard Zhang calibration, the chessboard pattern is a 6 x 8 grid, 48 points in total. It may be possible that

79 the different methods respond differently as the number of points is increased or

decreased.

It can be already seen that the method is easier and more practical than traditional

calibration methods. As for accuracy, there can be many potential sources of error

that propagate through to the main metric for evaluating the performance of this

projector calibration method, be re-projection error. In the first step, the calibration

of the depth and image sensors can contribute to error, then the resolution of the

depth image with the Microsoft Kinect device is currently just 640 x 480 x 640.

Reasonably practical results can still be expected for implementation.

Figure 46: The reprojection was evaluated by projecting a virtual chessboard pattern over a real one.

In order to evaluate the determined projector intrinsics as well as the relation- ship between the sensor and projector, the method taken was the projection of the calibration pattern on top of itself. First a physical chessboard pattern would be placed onto the scene, then a software that implements the methods in this thesis would recognize and attempt to re-project points over the detected pattern’s corners.

After a projection is made, both are seen by a camera which will measure the error in projection between the projected corners and the physical corners. The setup looked

80 as in Figure 46.

The following plot in Figure 47 illustrates the results of this test with over 1000 data points.

Figure 47: This plot compares the reprojection error between ray plane intersection and the proposed novel RGBD sensor method.

The distributions of the data sets were much more uniform with ray plane inter- section, Figure 48.

More comparison of the data are shown in the table below:

81 Figure 48: This plot compares the distributions of the pixel locations determined with both projector calibration methods.

Ray/Plane RGBD Sensor

Center Disparity 0.4673 2.8232

Deviation x 0.6145 0.8676

Deviation y 0.6512 1.0143

82 5 Discussion and Conclusions

Out of the three methods described (calibration using arbitrary planes, ray plane intersection and RGBD sensor method), two were implemented (ray plane intersection and RGBD sensor) for comparison. The unimplemented method required a higher resolution camera than that in the RGBD sensor (640 × 480) to produce conclusive results.

5.1 Performance

The ray plane intersection method and RGBD sensor method both produced rea- sonably accurate overlays even though ray plane intersection had more consistent results within reasonable range. At times calibration with the RGBD sensor resulted in completely erroneous projection matrices that were discarded by inspecting the intrinsic parameters. For the projector used, the focal lengths in both x and y axes were within 30 pixels of 1050 not counting the times the RGBD sensor would result in focal lengths of orders of magnitude larger or smaller. The principal point cx, cy was close to the projector’s center of 512, 384 with both examined methods. The reason for the similar results can be attributed to the very low resolution of the RGB camera on the sensor, thus any improvement would fall within the bounds of noise.

As for the feasibility of the novel RGBD sensor, apart from sparse errors, it is a robust and quick method of projector calibration. The results are of reasonable accu-

83 racy and no special physical calibration rigs are required apart from when calibrating the range and RGB camera on the sensor itself.

When reprojection error was evaluated, the ray plane intersection method did produce better results, the distribution had a smaller deviation and the mean was much closer to the actual location of the physical surface projected onto. The RGBD sensor had a consistent error of about 2.5 pixels, but it can be attributed to errors propogated through other calibrations along the process. In order to get usable sensor data, the RGBD camera must be calibrated with the depth camera which can further produce error that can affect the final result. With ray plane intersection there are less initial calibrations to do so less error sources are present.

In relation to the homography based vs. stereo calibration methods of making a correspondence between the range and RGB, both produced good results with 1 pixel accuracy. More experiments determined that the homography method began to falter with evaluation at different depths. These depths, however, were outside of the projector’s focus range so even though these errors do exist, they do not affect projector calibration in the practical setting. The stereo calibration, did not suffer from these errors, so if a high quality projector with a large focus range is used, this method could fair slightly better with producing more accurate projector intrinsics.

84 5.2 Application

The main advantage of the RGBD sensor calibration method, is its ease of imple-

mentation. Even though these methods produce results slightly different, one can not

be definitively picked over another, this method is so easy to deploy that it becomes

much more practical in actual application. All that is needed is to project an im-

age and the calibration occurs automatically. With the other method, a physically

marked surface must be moved through the scene at different angles in order to cap-

ture a usable intrinsic matrix for the projector. The application of the RGBD sensor

calibration method proves to be as practical as the homography calibration in Spa-

tial Augmented Reality displays. Using this method we have been able to correctly

overlay spaces with projection exemplified in Figure 49.

Figure 49: A 3D scene overlay is demonstrated. Accurate overlay persists at all depths.

The image overlaying the space is just the depth information captured by the

RGBD sensor overlaid on top of corresponding objects in the scene. Using a camera we can determine how well the projection matches to corresponding objects. Applications

85 can change what is being projected onto the scene and create visually impressive AR displays which change the way the space is perceived.

5.3 Future Work

There is a lot of potential for further research in the area of calibration of projector- camera systems. This thesis has covered only a small number of potential cases. Work directly related to this content can be extended with evaluation of a wider verity of sensors and projectors with multitudes of resolution combinations. In order to get usable results from calibration with arbitrary planes, a higher resolution camera should be used. Also, it would be very useful to develop a software with easily variable projection matrix parameters to display the resulting scene. The current method of doing so is not efficient and requires some manual code manipulation. The current method of implementation of the calibration is done in C++ with open source libraries, OpenCV and libfreenect. OpenCV is available for C as well as python, while libfreenect is just a C library. It would be useful to create a MATLAB script with the help of the camera calibration toolbox. MATLAB would allow for easier and more powerful evaluation of the results as well as encourage academic interest in the area. One other method we would like to explore in the future is calibration with structured light. It has been widely used in the calibration of projector-camera systems and spatial augmented reality with projection mapping.

86 5.4 Conclusion

SAR creates a rich medium for new immersive ways to present information, involve the audience and stimulate collaboration and creativity. There has already been in- terest in SAR for creative art projects. Since SAR requires no portable hardware, permanent installations can be constructed with significantly higher sensing and pro- cessing capability. However, the complexity of SAR is increased since tracking and registration must happen not only in the camera view, this view must be projected out to properly overlay real world objects.

Automatic calibration methods are a much needed enhancement in art forms using technology and relating closely to augmented reality that have been more and more popular in the past years. In 3D projection mapping, like SAR, a projector is used to overlay visual effects onto a real world object or scene. This use of image projection has created much excitement within the art community. An example of the use of this technology is the projection on buildings to the effect of making them look like dynamic structures, enhancing individual walls or creating illusions of nonexistent light sources or warp effects. In order to overcome the distortion that occurs when an image is projected onto an arbitrary surface, the projector must be aligned manually and the image pre-distorted in such a way that the image projected matches the view seen from the projector’s position [3]. This can be a painstaking process. Existing camera based automatic calibration methods as discussed in this thesis, make this procedure easier. However, even though accurate and robust, with these methods

87 elaborate calibration setups are required. For practical implementation, especially for artists and creative people outside the STEM fields, the most important factor is for things to “just work”. The proposed RGBD sensor - projector calibration, no physical calibration setups would be required. Only a projected pattern has be used, allowing users to deploy projection mapping SAR systems in virtually any space.

As a residual of this project it is important to consider the impact applications can have on STEM education. We face the critical challenge of stimulating student interest in the science, technology, engineering and mathematics disciplines. Often science and engineering discoveries, while extraordinary and necessary, are distant and disconnected from the lay person. With research that is motivated by creativity, we hope to enhance the approachability of the STEM disciplines and generate not only general student interest in STEM but engage them in interactive and dynamic learning environments. Today’s students live in a different society technologically. Their lives are immersed in social media, interactive environments presented by gaming and constant connectivity allowed by mobile devices and computing. A sixth sense of the presence of this digital world is almost created as they integrate with this uninterrupted sensory input. In a way, their reality is already augmented.

The systems they are so integrated in have two main components. The creative application front ends that allow for user interaction that engage the user, and the technologically complex STEM back ends that make the system really work. Younger people are almost always consumers in this new environment. Even though they easily adapt to this new world’s components, they are rarely driven to contribute or produce

88 these technology components. They are instead drawn to create content or media similar to which they are exposed to. Young people want to become writers, artists or musicians. SAR creates an obvious connection between creativity and technology.

When using a creative driving force behind STEM concepts, students can become more susceptible to explore areas of science and mathematics on their own. In the past few years, artists have explored new technology and integrated it into their works. As they strive for a tangible goal, they are motivated to learn these concepts during the creative process. Without formal education, people in fields of fine arts can become expert and mathematicians. Not only does this thesis present a novel method for calibrating SAR systems, with this we also aim to harness art and creative motivation and use it to create a better educational experience in science, mathematics and engineering.

89 References

[1] Ismar 2010 call for papers, January 2010

,http://www.ismar10.org/index.php/Call for Papers/Participation.

[2] Design korea 2010, December 2010, http://www.kimchiandchips.com/link.php.

[3] How to project on 3d geometry [online], March 2010, VVVV.org

http://vvvv.org/documentation/how-to-project-on-3d-geometry.

[4] The point cloud library documentation, March 2011,

http://pointclouds.org/about.html.

[5] Gary Bradski and Adrian Kaehler. Learning OpenCV. O’Reilly Media Inc., 2008.

[6] Nickolas Burrus. Kinect calibration [online], June 2011,

http://nicolas.burrus.name/index.php/Research/KinectCalibration.

[7] S. Arulampalam et al. A tutorial on particle filters for on-line non-gaussian

bayesian tracking. In IEEE Trans. on Signal Processing, pages 50 – 52, 2002.

[8] Joan Massich Gabriel Falcao, Natalia Hurtos. Plane-based calibration of a

projector-camera system. In VIBOT, 2008.

[9] Melissa Gibb. New Media, Art, Design, and the Arduino Microcontroller: A

Malleable Tool. Master’s thesis, Pratt Institute, New York, NY, 2010.

[10] David Grossman. Blob extraction library documentation, January 2006,

http://opencv.willowgarage.com/wiki/cvBlobsLib.

90 [11] R. I. Hartley and A. Zisserman. Multiple View Geometry in Computer Vision.

Cambridge University Press, ISBN: 0521540518, second edition, 2004.

[12] Stéphane Magnenat Ivan Dryanovski, William Morris. Automatic Calibration

Tests. Willow Garage, 2011.

[13] Georg Klein. Visual Tracking for Augmented Reality. PhD thesis, University of

Cambridge, 2006.

[14] Man Chuen Leung, Kai Ki Lee, Kin Hong Wong, and M.M.Y. Chang. A

projector-based movable hand-held display system. In Computer Vision and

Pattern Recognition, 2009. CVPR 2009. IEEE Conference on, pages 1109 –1114,

2009.

[15] Takeo Kanade Makoto Kimura, Masaaki Mochimaru. Projector calibration using

arbitrary planes and calibrated camera. In IEEE Trans. on Signal Processing,

2007.

[16] Bruce H. Thomas Michael Haller, Mark Billinghurst. Emerging Technologies of

Augmented Reality: Interfaces and Design. Idea Group Publishing, 2007.

[17] Ramesh Raskar Oliver Bimber. Spatial augmented reality: merging real and

virtual worlds. A K Peters, 2005.

[18] Soon-Yong Park and Go Gwang Park. Active calibration of camera-projector sys-

tems based on planar homography. In 2010 International Conference on Pattern

Recognition, pages 320 –323, 2010.

91 [19] Jenna Wortham. With kinect controller, hackers take liberties. New York Times,

Nov. 21, 2010.

[20] Zhengyou Zhang. A flexible new technique for camera calibration. In IEEE

Transactions on Pattern Analysis and Machine Intelligence, December 2000.

92