COMPARISON OF OPEN SOURCE STEREO VISION ALGORITHMS

by

CHOUSTOULAKIS EMMANOUIL

Engineer of Applied Informatics and Multimedia

A THESIS

submitted in partial fulfillment of the requirements for the degree

MASTER OF SCIENCE

DEPARTMENT OF INFORMATICS ENGINEERING

SCHOOL OF APPLIED TECHNOLOGY

TECHNOLOGICAL EDUCATIONAL INSTITUTE OF CRETE

2015

Approved by:

Assistant Professor Kosmopoulos Dimitrios

1

Abstract

Stereo vision in the extraction of 3D information from a pair of images

depicting the same scene viewed from different angles. It happens in nature in

creatures that possess two eyes. It is also a very active field in where

the pair of images is digital and instead of eyes, it is obtained by cameras.

In order to achieve that, there are several methods and algorithms. In this Master

Thesis there is a theoretical and experimental comparison of a few of them that are

open source and can be found implemented online.

2

Acknowledgements

I would like to thank Professor Dimitris Kosmopoulos for the assistance in the

completion of the thesis. I would also like to thank my family for their patience and

support during my studies for this MSc degree. Finally special thanks to the

examiners committee and to all fellow stereo vision researchers and engineers whose

work contributed to the completion on this work.

3

Table of Contents

Abstract ...... 2 Acknowledgements ...... 3 Table of Figures ...... 6 Chapter 1: Introduction and Goals ...... 7 Chapter 2: Stereo Vision Basics ...... 9 2.1: Overview...... 9 2.2: History of Stereo Vision ...... 10 2.3: Model ...... 14 2.4: Camera Resectioning/Parameters ...... 14 2.5: ...... 15 2.6: Fundamental matrix ...... 16 2.7: ...... 17 2.8: Disparity map ...... 17 2.9: Stereo Matching ...... 18 2.10: Matching Cost...... 20 2.11: Normalized Cross Correlation ...... 21 2.12: Ground truth ...... 21 2.13: Applications of Stereo Vision ...... 21 2.14: Challenges and difficulties ...... 24 Chapter 3: Stereo Algorithms Evaluation Process ...... 26 3.1: Previous Work ...... 26 3.2: State-of-the-Art Middlebury Evaluation ...... 27 3.3: Thesis Evaluation Process ...... 28 Chapter 4: Testing and Comparison of Stereo Algorithms ...... 31 4.1: Intro ...... 31 4.2: Common Algorithm Parameters ...... 31 4.3: Semi Global (Block) Matching ...... 32 4.3.1: Algorithm Overview...... 32 4.3.2: Pixelwise Cost Calculation ...... 33 4.3.3: Aggregation of Costs ...... 34 4.3.4: Disparity Computation ...... 35 4.3.5: Implementation Details ...... 35

4

4.3.6: Testing and Results ...... 37 4.4: Block matching ...... 37 4.4.1: Algorithm overview ...... 37 4.4.2: Algorithm Analysis ...... 38 4.4.3: Implementation Details ...... 41 4.4.4: Testing and Results ...... 42 4.5: Loopy belief propagation ...... 42 4.5.1: Overview ...... 42 4.5.2: Markov Random Fields ...... 43 4.5.3: MRF Formulation ...... 43 4.5.4: DataCost ...... 44 4.5.5: SmoothnessCost ...... 45 4.5.6: Loopy Belief Propagation main part ...... 46 4.5.7: Implementation Details ...... 48 4.5.8: Testing and Results ...... 49 4.6: Fast stereo matching and disparity estimation ...... 49 4.6.1: Overview ...... 49 4.6.2: Algorithm Analysis ...... 50 4.6.3: Implementation Details ...... 52 4.6.4: Testing and Results ...... 53 4.7: Probability-Based Rendering for View Synthesis ...... 54 4.7.1: Algorithm Overview...... 54 4.7.2: SSMP with RWR ...... 55 4.7.3: PBR with SSMP ...... 56 4.7.4: Implementation details ...... 58 4.7.5: Testing and Results ...... 58 4.8: Results Analysis and Conclusion ...... 59 Chapter 5: Discussion and future work ...... 62 References ...... 65

5

Table of Figures

Figure 1: Simple stereo vision illustration ...... 9 Figure 2: Wheatstone's Stereocope ...... 10 Figure 3: Brewster's ...... 11 Figure 4: A typical ViewMaster device ...... 12 Figure 5: Nintendo's Virtual Boy ...... 13 Figure 6: Pinhole Camera Model ...... 14 Figure 7:Epipolar Geometry Illustration ...... 16 Figure 8: A groundtruth disparity map ...... 18 Figure 9: Simple window matching illustration ...... 19 Figure 10: Nintendo's 3DS ...... 22 Figure 11: The NASA STEREO Project ...... 23 Figure 12: Cones, teddy, tsukuba and venus left view ...... 30 Figure 13: SGBM matching costs aggregation ...... 34 Figure 14: SGBM Results ...... 37 Figure 15: Block Matching Algorithm for SVM ...... 38 Figure 16: Block Matching Sample Images ...... 39 Figure 17: Block Matching Results...... 42 Figure 18: Markov Random Field illustration ...... 43 Figure 19: Various cost functions ...... 45 Figure 20: LBP message passing ...... 47 Figure 21: Loopy Belief Propagation Results ...... 49 Figure 22: Fast stereo Matching and Disparity Estimation ...... 50 Figure 23: Fast stereo Matching and Disparity Estimation Results ...... 53 Figure 24: SSMP Results ...... 58 Figure 25: Percentage of bad matching pixels (lower value is better) ...... 60 Figure 26: Current Virtual Reality Devices ...... 62 Figure 27: Stereoscopic Endoscope ...... 63

6

Chapter 1: Introduction and Goals

Stereoscopic vision (called in nature) is the extraction of 3d

information about a scene from a pair of images depicting different views of that

scene. The process in nature is called and it occurs in the brain which

combines the images received from the 2 eyes.

Computer stereo vision is the extraction of 3D information from a pair of digital

images usually obtained by 2 CCD cameras. This is made possible using various

techniques and algorithms. Over the years many algorithms have been proposed to

combine 2 images into a final image and their performance is measured by efficiency

and speed.

The primary goal of this work is to illustrate a simple process of comparing methods

used for stereo matching. The secondary goal is to compare experimentally the

relevant algorithms that can be found implemented on various sources online by using

the aforementioned process and some of the state of the art datasets. The steps

followed in this work can be summarized here:

1. Extensive study of computer stereo vision resources and literature to gain as

thorough understanding as possible of all related terms and methodology.

2. Study any previous work related to the topic of this master thesis and determine the

state of the art.

3. Define the process for comparison, choosing a simple and comprehensible one.

4. Research online for implemented stereo algorithms and methods. Make any

changes to the code for optimal run, without affecting the core algorithm.

7

5. Test the said methods using all 4 stereo sets of the Middlebury platform and select the ones giving usable results.

5. Make the comparison, document, study and analyze the result.

6. Write the thesis document including terminology, process description, algorithm and result analysis.

Any reference to existing work is of course acknowledged and documented here. Also the results displayed here are produced by executing the relevant code with the optimal parameters for each method and none is found online and just used as is.

8

Chapter 2: Stereo Vision Basics

2.1: Overview

In traditional stereo vision [1], two cameras, displaced horizontally from one another are used to obtain two differing views on a scene, in a manner similar to human binocular vision . By comparing these two images, the relative depth information can be obtained, in the form of disparities, which are inversely proportional to the differences in distance to the objects. To compare the images, the two views must be superimposed in a stereoscopic device, the image from the right camera being shown to the observer's right eye and from the left one to the left eye.

Figure 1: Simple stereo vision illustration In real camera systems however, several pre-processing steps are required.

1. The image must first be removed of distortions, such as barrel distortion to

ensure that the observed image is purely projectional.

2. The image must be projected back to a common plane to allow comparison of

the image pairs, known as image rectification.

3. An information measure which compares the two images is minimized. This

gives the best estimate of the position of features in the two images, and

creates a disparity map.

9

4. Optionally, the disparity as observed by the common projection is converted

back to the height map by inversion. Utilizing the correct proportionality

constant, the height map can be calibrated to provide exact distances.

2.2: History of Stereo Vision

In this chapter, the most important moments in the history [2] of stereoscopic vision are described.

Stereopsis was first explained by Charles Wheatstone in 1838. "...the mind perceives an object of three dimensions by means of the two dissimilar pictures projected by it on the two retinae..." was his exact definition. He recognized that each eye views the world from slightly different horizontal position. As a result of that, each eye views a different image.

Also objects at different distance seem to have a different horizontal position for each eye (horizontal disparity) leading to the concept of depth. Wheatstone created the illusion of depth from flat pictures that differed only in horizontal disparity. To display his pictures separately to the two eyes, Wheatstone invented the stereoscope.

Figure 2: Wheatstone's Stereocope

10

Although Wheatstone was the first man to explain and showcase stereoscopic vision, he was not the first to notice it and try to understand it. Leonardo Da Vinci had also realized that objects at different distances project images to the eyes that differ in their horizontal positions. Despite his efforts he concluded that it is impossible for a painter to portray a realistic depiction of depth in a scene with a single canvas. Da Vinchi chose for the near objects a column with a circular cross section and for his far object a flat wall. His column projects identical images of itself in the two eyes.

Stereoscopy became popular during victorian times with the invention of the Prism

Stereoscope by David Brewster. Combined with the advances of photography, tens of thousands stereograms were produced.

Figure 3: Brewster's Stereoscope

In 1939 the View Master [3] line was introduced. It is a series of special stereograms that are loaded with proprietary discs (namely reel) containing a film of stereoscopic scenes. Transition between scenes happens with a switch that rotates the reel. The viewer looks inside the two lenses to view the scene. In order to put light into the film, most models feature 2 opaque white films in front of the reel. The viewer needs to look into the View Master against an light source, although some models with self

11 illumination were introduced. View master is best known as a toy for children and is still available, although it is less popular than it used to be. Mattel Corporation currently owns the rights for its production.

Figure 4: A typical ViewMaster device

In the 1960's Bela Julesz invented the Random Dot Stereogram. Unlike previous stereograms, in which each half image showed recognizable objects, each half image of the first random-dot stereograms showed a square matrix of about 10,000 small dots, with each dot having a 50% probability of being black or white. No recognizable objects could be seen in either half image. The two half images of a random-dot stereogram were essentially identical, except that one had a square area of dots shifted horizontally by one or two dot diameters, giving horizontal disparity. The gap left by the shifting was filled in with new random dots, hiding the shifted square.

Nevertheless, when the two half images were viewed one to each eye, the square area was almost immediately visible by being closer or farther than the background. Julesz called it Cyclopean image in the notion that each eye was seeing part of an object and it was combined into one in the brain.

In the 70’s Christopher Tyler invented . That is random dot stereograms that can be viewed without a stereoscope. A famous example is the

12

Magic Eye pictures, a series of books featuring images which allow people to view

3D images by focusing on 2D patterns.

In 1989 Antonio Medina Puerta demonstrated with photographs that retinal images with no disparity but with different shadows are fused stereoscopically, imparting to the imaged scene. He named the phenomenon "Shadow

Stereopsis". He showed how effective the phenomenon is by taking two photographs of the Moon at different times, and therefore with different shadows, making the

Moon to appear in 3D stereoscopically, despite the absence of any other stereoscopic cue.

In 1995 Nintendo Corporation introduced the Virtual Boy [4]. It was a table top console and a first step into virtual reality devices for consumers. The goal was to

"immerse players into their own private universe", according to Nintendo. The goal was never achieved though due to large number of factors. Limited technology to keep the cost down, bad design causing eye and neck strain and a small number of games catalog led to the commercial failure of the device that was discontinued from production only a year later. Nevertheless, it was the first step in the use of stereo vision and virtual reality for entertainment purposes.

Figure 5: Nintendo's Virtual Boy

13

2.3: Pinhole Camera Model

The pinhole camera model [5] describes the mathematical relationship between the coordinates of a 3D point and its projection onto the image plane of an ideal pinhole camera , where the camera aperture is described as a point and no lenses are used to focus light. The model does not include, for example, geometric distortions or blurring of unfocused objects caused by lenses and finite sized apertures. It also does not take into account that most practical cameras have only discrete image coordinates. This means that the pinhole camera model can only be used as a first order approximation of the mapping from a 3D scene to a 2D image. Its validity depends on the quality of the camera and, in general, decreases from the center of the image to the edges as lens distortion effects increase.

Figure 6: Pinhole Camera Model

2.4: Camera Resectioning/Parameters

Camera resectioning [6] is the process of estimating the parameters of a pinhole camera model approximating the camera that produced a given photograph or video.

Usually, the pinhole camera parameters are represented in a 3 × 4 matrix called the

14 camera matrix. There are typicall y two types of camera parameters, intrinsic and extrinsic.

Intrinsic parameters encompass focal length, image sensor format and principal point.

Those parameters are contained in the intrinsic matrix.

Extrinsic parameters denote the coordinate system tr ansformations from 3d word coordinates to 3d camera coordinates. Equivalently, the extrinsic parameters define the position of the camera center and the camera’s heading in world coordinates.

T is the position of the origin of the world coordinate system expressed in coordinates of the camera-centered coordinate system (and NOT the position of the camera as is often mistaken). C is the position of the camera expressed in world coordinates . R is a rotation matrix.

2.5: Epipolar geometry

Epipolar geometry [7] is the geometry of stereo vision . When two cameras view a 3D scene from two distinct positions, there are a number of geometric relations between the 3D points and their projections onto the 2D images that lead to constraints between the image points. These relations are derived based on the assumption that the cameras can be approximated by the pinhole camera model .

15

The figure below depicts two pinhole cameras looking at point X. In real cameras, the image plane is actually behind the center of projection, and produces an image that is rotated 180 degrees. Here, however, the projection problem is simplified by placing a virtual image plane in front of the center of projection of each camera to produce an unrotated image. OL and OR represent the centers of projection of the two cameras. X represents the point of interest in both cameras. Points xL and xR are the projections of point X onto the image planes.

Each camera captures a 2D image of the 3D world. This conversion from 3D to 2D is referred to as a perspective projection and is described by the pinhole camera model.

It is common to model this projection operation by rays that emanate from the camera, passing through its center of projection. Note that each emanating ray corresponds to a single point in the image .

Figure 7:Epipolar Geometry Illustration

2.6: Fundamental matrix

The fundamental matrix [8] F is a 3×3 matrix which relates corresponding points in stereo images. In epipolar geometry, with homogeneous image coordinates, x and x′, of corresponding points in a stereo image pair, Fx describes a

16 line (an epipolar line) on which the corresponding point x′ on the other image must lie. That means, for all pairs of corresponding points holds

2.7: Image rectificati on

Image rectification [9] is a transformation process used to project two -or -more images onto a common image plane. This process has several degrees of freedom and there are many strategies for transforming images to the common plane.

It is used in computer stereo vision to simplify the problem of finding matching points between images. It uses triangulation based on epipolar geometry to determine distance to an object. More specifically, is the process of relating the depth of an object to its change in position when viewed from a different camera, given the relative position of each camera is known.

2.8: Disparity map

Disparity [10] refers to the difference in location of an object in corresponding two

(left and right) images as seen by the left and right eye which is created due to parallax (eyes’ horizontal separation). The brain uses this disparity to calculate depth information f rom the two dimensional images.

In computer vision, the disparity map is an image that depicts how far from the viewing source are the objects of the scene. The depiction is based on intensities.

Brighter objects are closer to the source (they have larger apparent distance between

17 the left and right image) and darker objects are further (they have smaller apparent distance between the left and right view).

Figure 8: A groundtruth disparity map

In short, the disparity of a pixel is equal to the shift value that leads to minimum sum- of-squared-differences for that pixel.

Disparity map calculation is essentially the result of stereo matching. This work revolves around some of its calculation methods.

2.9: Stereo Matching

Stereo matching [11] is used for finding corresponding pixels in a pair of images, which allows by triangulation, using the known intrinsic and extrinsic orientation of the camera. There are two methods [12] for stereo matching, local and global. Local methods attempt to match small regions of one image to another based on intrinsic features of the region. Global methods are supplementary to local methods by considering physical constraints such as surface continuity or base- of-support. Local methods can be further classified by whether they correlate a small area patch among images (called correlation or area based) or match features (called feature based).

18

Correlation based (or area based) stereo matching considers a certain area on the left image (usually) and tries to find an equally sized area on the right image that is the closest match. That area is called matching or correlation window. Since the images are rectified, the algorithm searches only horizontally by a predefined offset. They produce dense disparity maps. Generally a smaller matching window will give more detail but more noise and a larger window will produce a smoother disparity map but with less captured detail. Such algorithms are by default fast and memory efficient so they are usually preferred. However, finding the combination of optimal window size and other algorithm parameters can be challenging and requires a lot of testing.

To find a match for a pixel in the left image, the left window is drawn centered on that pixel. It is then compared to several windows in the right image, beginning with a window at the same location (zero disparity) and moving left in increments of one pixel (increasing disparity by one with each move). Whichever window in the right image gives the lowest cost is said to match the left image. The difference in x coordinates between the center of this match and the center of the left window is the disparity value of the pixel in question on the left image.

Figure 9: Simple window matching illustration Feature based stereo matching computes the corresponding pixels by using the extracted features from the images. Features are usually chosen to be lighting and

19 viewpoint independent. As a result, they compensate for viewpoint changes and camera differences. Techniques used to find the image features include edge, corner and blob and ridge detection. Such features include

• edge elements

• corners

• line and curve segments

• circles and ellipses

• regions defined either as blobs or polygons.

Area based algorithms are simple and efficient in general. Some of those algorithms

are ideal for real-time stereo matching applications. But, as discussed earlier, it can be

challenging to produce robust matching. Feature based on the other hand, can produce

fast and robust matching but usually require expensive feature extraction. Area

methods are more commonly used by researchers and this work will focus on them for

the most part.

2.10: Matching Cost

At the base of any matching algorithm is the matching cost which measures the (dis-) similarity of a pair of locations, one in each image. Matching costs can be defined at the pixel level or over a certain area. Common examples of this are Absolute intensity

Difference, Squared intensity Difference, Filter-Bank responses and gradient based measures. Binary matching costs area also possible, based on binary features such as edges.

20

2.11: Normalized Cross Correlation

The higher the normalized cross correlation [13] of two windows, the better they match. Normalized cross correlation is calculated by computing the mean and standard deviation of intensity in each window. Then, the mean intensity is subtracted from each pixel's intensity. Corresponding values of intensity - mean intensity from the left and right windows are multiplied together. These multiplied values are summed over the entire window. Finally, this sum is divided by the number of pixels in either window and divided by each standard deviation.

2.12: Ground truth

Ground truth [14] is a term used in various fields to refer to the absolute truth of something and is used to test the efficiency of an algorithm or a system. Used widely in machine learning it denotes a set of measurements that are much more accurate than the system being tested.

In the case of stereo vision systems the question is how well they can estimate 3D positions. The ground truth disparity map is composed by = the positions given by a laser range finder which is known to be much more accurate than any camera system.

Practically, it describes the perfect disparity between left and right image and it is compared to the produced disparity map using various metrics.

2.13: Applications of Stereo Vision

Stereo vision is well known to the general public mainly by 3D movies. But that is only a small part of a wide range of applications not only in everyday life, but also in industry and research and even space exploration.

21

• Robotics to extract information about the relative position of 3D objects in the

vicinity of autonomous systems, object recognition where depth information

allows for the system to separate occluding image components. Such robotic

systems are primarily used in industrial applications .

• 3D displays [15] and head mounted di splays to provide stereoscopic imaging

to the human eyes. In such applications where the goal is depth perception, the

basic requirement is to display offset images that are filtered separately to the

left and right eye.

There are two methods to accomplish that. One method is when the user wears

glasses to filter the offset images to each eye and another where no glasses are

required.

In the case where glasses (or filters) are used there are two types used, passive

and active filters. Passive filters do not require power and could be either color

filters or polarization filters. Active shutter filters have active shutters to filter

the image as their name suggests and require power.

In the glass free case the light source splits the images dire ctionally into the

viewer's eyes. Such displays are called autostereoscopic. Maybe the most

famous example of such technology is the Nintendo 3DS game console.

Figure 10: Nintendo's 3DS

22

Finally there are the head mounted displays where a separate display is

positioned in front of each and the image is projected through lenses to assist

the eye focusing. Such devices are used in a plethora of concepts like military

to provide augmented reality applications, engineering to provide stereoscopic

views of CAD schematics and of course entertainment, like 3D gaming and

movies or tours in virtual environments.

• Calculation of contour maps and geometry extraction for 3D building mapping

mainly from aerial surveys [16]. Those surveys are usually conducted with the

use of unmanned aerial vehicles or UAV's for short. The large number of

aerial images captured is then converted into geo-referenced 2D high

resolution orthophotos and 3D surface models and point clouds. Various

automated systems.

• The NASA STEREO project [17] one of the most important and high scale

projects ever. It stands for Solar Terrestrial Relations Observatory and in

essence it is a solar observation mission. Two nearl identical spacecraft were

launched in 2006 into orbits around the sun in a manner that enables

stereoscopic imaging of the sun.

Figure 11: The NASA STEREO Project

23

The goal is to study solar phenomena (principally coronal mass erections-

massive bursts of solar wind, plasma and magnetic fields ejected into space

that can disrupt earth communications and power networks) in the far side of

the sun. This practically enables solid forecasts of solar activity through 360

degree view of the sun at all times.

• Driverless cars [18] is a hot topic at the time of writing. Driverless cars as the

name suggests, can drive to their destination without requiring human

intervention. Stereo vision is the way the car can "see" the world in front of it.

Of course, a plethora of sensors are used, including laser, ultrasonic, GPS etc,

so a 360 degree "map" of the surrounding world can be formed by the car.

There are quite a few similarities with robot navigation in the process. Of

course this technology is still not in the reliable stage, especially due to the

risk for traffic accidents in case of errors/inaccuracy.

2.14: Challenges and difficulties

The refers to the problem of ascertaining which parts of one image correspond to which parts of another image, where differences are due to movement of the camera, the elapse of time, and/or movement of objects in the photos.

Other major pitfalls include reflections and transparency. It is usually very hard for a machine to distinguish whether it is looking at an object or the reflection of that object. Similarly, it is hard for a computer vision system to recognize the existence of transparent objects between the view source and the target scene.

24

The third pitfall is continuous and textureless regions. It is very difficult to determine which point on the left image corresponds to which point on the right image. Finally there can be technical difficulties like sensor noise and calibration noise to the cameras.

25

Chapter 3: Stereo Algorithms Evaluation Process

In this chapter there is a short analysis of previous work related to the topic of this thesis. Also, there is analysis of the state of the art evaluation method as well as the process followed for the thesis.

3.1: Previous Work

• A Taxonomy and Evaluation of Dense Two Frame Stereo Correspondence

Algorithms [19] by D. Scharstein and R. Szeliski. It is the state of the art

evaluation and is analyzed in the next paragraph.

• An Experimental Comparison of Stereo Algorithms by R. Szeliski and R.

Zabih [20]. In this work by Szeliski and Zabih there is an effort to compare

experimentally a few stereo vision algorithms. They make use of two stereo

pairs, the well known set from Tsukuba university and another produced by

them ( a simple scene with a slanted surface). Their methodology consists of

comparison with ground truth depth maps and the measurement of novel

prediction errors.

• Review of Stereo Matching Algorithms for 3D Vision by L. Nalpantidis, G.

Sirakoulis and A. Gasteratos [21]. In this work there is a theoretical

comparison and summary of various methods. It considers both local and

global methods, computational intelligence techniques and the speed and

accuracy of those. Also, some hardware implementation techniques are

presented.

• Overview of Stereo Matching Research , by R.A.Lane and N.A. Thacker [22].

This is a literature survey of a few area and feature based methods. It includes

26

a short description of those methods and some conclusions drawn. It is a

relatively old paper and part of a large series of stereo vision journals.

3.2: State-of-the-Art Middlebury E valuation

The state of the art evaluation method for stereo vision algorithms is offered by the

Middlebury College. The creators are Daniel Scharstein and Richard Szeliski and the evaluation process is documented in their publication titled "A Taxonomy and

Evaluation of Dense To Frame Stereo Correspondence Algorithms [19] .

The goal of the creators of this evaluation process was to comp are a large number of methods within on e common framework. For that reason they have focused on techniques that produce a univalued disparity map.

The evaluation process is very detailed and quite complicated. In essence, t here are 2 error measurements, RMS error and percentage of bad matching pixels in 3 different image areas.

RMS (root-mean-squared) error (measured in disparity units) between the computed disparity map dC (x, y ) and the ground truth map dT (x, y ) is computed by the following formula:

Percentage of bad matching pixels is computed by this formula:

27

Also the images are segmented into three different areas.

• textureless regions T : regions where the squared horizontal intensity gradient

averaged over a square window of a given size is below a given threshold.

Essentially, it is areas on the scene with little to no texture.

• occluded regions O: regions where the left-to-right disparity lands at a location

with a larger (nearer) disparity. This means that an occluded region is visible

on one of the images and not visible on the othe r.

• depth discontinuity regions D: regions where neighboring disparities differ by

more than a predefined gap, dilated by a window of a given width . It is

practically the areas in the scene where there is a sudden change in the depth

between the objects.

Th ese regions were selected to support the analysis of matching results in typical problem areas.

The Middlebury College offers an online evaluation tool for computer vision researchers to upload and test their algorithms and compare them against many others.

There are also a few datasets offered in various resolutions for testing. The online evaluation tool utilizes 4 certain image pairs and compares the user submitted disparity maps with the groun d truth maps for the four pairs. Note that the online evaluation tool is of version two at the time of writing this thesis.

3.3: Thesis E valuation Process

The first step was to collect all the open source algorithms that can be found implemented on various sources online. They were all tested and the ones producing

28 unusable results were discarded. The ones that gave meaningful results are analyzed here.

The comparison of the algorithms is based partly on the state of the art evaluation, namely the percentage of bad matching pixels (sum of absolute differences of the disparity map and the ground truth image matrices, BadMatchPercent formula from the previous paragraph). The focus is to find how good the algorithm has performed in estimating the disparity map. Mean elapsed time is measured in all cases, but it is not directly comparable, since two different tools are used and OpenCV is vastly faster than Matlab because it is written in C++ language. Also, elapsed time depends on more factors, like the parameters of each algorithm, image size and of course the hardware it is running on. The algorithms presented here were tested on an Intel

Celeron G1620 CPU with 4GB of RAM and an AMD HD6450 GPU.

A pixel by pixel subtraction is conducted between the result and the ground truth image matrices. There is a 30 pixel margin on all sides to eliminate empty image borders, since some algorithms produce disparity maps with black borders that could lower the score for no reason. Any result that is larger than the predefined threshold

(which is around 1.0 traditionally) is considered a bad matching pixel. The threshold is the same for all images so the comparison is fair. When a bad matching pixel is found it is added to the previous sum. In the end the sum is divided by the total number of pixels. The final result is a percentage of the bad matching pixels in the disparity map.

The images used are the widely popular stereo datasets from Middlebury College.

They contain several right and left views of the same scene, as well as a ground truth

29 image for evaluation. As mentioned before, the Middlebury online evaluation platform uses 4 standard image sets, a total of 8 images (cones, teddy, tsukuba, venus). Those images along with their ground truth will be used. Note that those four image sets are used in the second version of the online evaluation tool of Middlebury, which is still online at the time of writing. Version three should be online soon after this work is completed and it will use different image sets for the evaluation of algorithms.

Figure 12: Cones, teddy, tsukuba and venus left view

The method described above is summed up to a mathematical equation and the code that implements it is written by the author of this thesis. The platform chosen for the comparison is Matlab, due to the simplicity of matrix operations.

30

Chapter 4: Testing and Comparison of Stereo Algorithms

4.1: Intro

Stereo vision is still a popular topic when it comes to research. It is very active and there is a large number of algorithms being evaluated in the Middlebury platform.

Unfortunately, finding implementation of the various algorithms can be difficult because communication with creators is rarely successful and many of them refuse to help. Furthermore, available implementations most of the times do not function as expected and produce unusable results.

In this work the algorithms compared can be found implemented on the internet and their implementation is correct. That means, it gives satisfactory results not only by visual examination but also in comparison with the ground truth disparity maps. All the methods described here give a bad matching pixel percentage of less than 50%.

4.2: Common Algorithm Parameters

Each stereo algorithm is unique and features a certain number of inputs and parameters. But there are some common parameters among the ones analyzed in this work that apply also to the majority of existent stereo algorithms.

As expected, the input to all the algorithms is the stereo pair, traditionally left-right views of the scene in that order. Some algorithms accept as input the image matrix

(RGB or grayscale) and others accept plain images reading the matrix afterwards. The rest of the algorithm inputs are actually the parameters.

The first parameter is window size, when window/block matching is used for the search of similarities. Smaller window size usually means more detailed but coarse

31

(with noise) disparity map. Larger window size gives a smoother disparity map overall, but with less detail captured. Of course this parameter should be odd number since there is always a "center" pixel on the matching window.

The second parameter is the maximum disparity range. This means the minimum and maximum disparity value that the matching algorithm will search for similarities between the blocks of the image pair. Disparity values outside that range will be ignored. The minimum disparity value can be a negative number. In the case of

Middlebury test images the minimum value is always 0 and the maximum varies depending on the image.

4.3: Semi Global (Block) Matching

4.3.1: Algorithm Overview

Semi-Global (Block) Matching [23] successfully combines concepts of global and local stereo methods for accurate, pixel - wise matching at low runtime. This is probably the most popular algorithm for stereo matching. It has spawned many other algorithms and has been widely used by stereo vision researchers. As it is evident from the results in this work, it gives results with a relatively high number of bad matching pixels and is surpassed by other algorithms. Despite that, it is fast and very effective for real time stereo applications since the number of bad matches in its output is not prohibitive.

The core algorithm considers pairs of images with known intrinsic and extrinsic orientation. The method has been implemented for rectified and unrectified images. In the latter case, epipolar lines are effectively computed and followed explicitely while

32 matching. Of course in this work only rectified images are used (with known epipolar geometry).

The whole method is based on the idea of pixelwise matching of Mutual Information and approximating a global, two dimensional smoothness constraints by combining many single dimensional constraints. In a nutshell, the main algorithm has the following processing steps: 1) Pixelwise cost calculation 2) Implementation of the smoothness constraint 3) disparity computation with sub-pixel accuracy and occlusion detection.

4.3.2: Pixelwise Cost Calculation

In step 1 (pixelwise cost calculation ) the matching cost is calculated for a base image

pixel (the left one usually) from its intensity and the suspected correspondence of the

match image. An important aspect is the size and shape of the area that is considered

for matching. The robustness of matching is increased with large areas.

One way to perform pixelwise cost calculation is to use the Birchfield-Tomasi

subpixel metric. The cost is calculated as the absolute minimum difference of

intensities in the range of half pixel in each direction (8 directions) along the epipolar

line.

Another way to calculate the pixelwise cost is based on mutual information

(abbreviated as MI) which is insensitive to recording and illumination changes. It is

defined as the sum of the entropies of the two images minus their joint entropy

according to the following formula:

MI I1,I2 = H I1 + H I2 – HI1,I2

33

H. Hirshmueler in his work favors the Mututal Information approach, contrary to the

OpenCV implementation that uses Birchfield - Tomasi.

4.3.3: Aggregation of Costs

Pixelwise cost calculation is generally ambiguous since wrong matches can easily have a lower cost than correct matches due to factors like noise etc. Therefore, an additional constraint is added that supports smoothness by penalizing changes to neighboring disparities.

A global, 2D smoothness constraint is approximated by combining several 1D constraints.

Figure 13: SGBM matching costs aggregation

The matching costs in 1D are aggregated from all eight directions equally as illustrated on the figure above. The aggregated (or smoothed) cost for a pixel p and disparity d is calculated by summing the costs of all 1D minimum cost paths that end in pixel p at disparity d.

34

4.3.4: Disparity Computation

The disparity image D that corresponds to the reference image I is determined as in local stereo methods by selecting for each pixel p the disparity d that corresponds to the minimum cost.

For sub pixel estimation, a quadratic curve is fitted through the neighboring costs and the position of the minimum is calculated.

4.3.5: Implementation Details

SGBM is implemented in OpenCv and is embedded in the library. Also it is included in Matlab since version 2011b. The Matlab implementation did not produce usable results despite the extensive experimentation. Consequently, only the OpenCV version will be used.

OpenCV uses a modified version of the original Hirschmuller algorithm. Contrary to the original algorithm that considers 8 directions, this one considers only 5 (single pass). Also this variation matches blocks, not individual pixels thus the Semi Global

Block Matching naming. The parameters of this modified version can be tuned so the algorithm will behave like the original one.

Also, mutual information cost function is not implemented. Instead a simpler

Birchfield-Tomasi sub-pixel metric is used. Finally, some pre and post processing steps from the Konolige Block Matching implementation are included, for example pre and post filtering. It is evident from the few identical parameters between the two algorithms.

35

The OpenCV SGBM implementation features the common parameters and a few more that are listed here (the OpenCV documentation is insufficient so the explanation is based on experimentation):

• preFilterCap: Clips the output to [-preFilterCap, preFilterCap].

• uniquenessRatio: Computed disparity d* is accepted only if

SAD(d)>=SAD(d*)*(1+uniquenessRatio/100) for any d!=d+/-1.

• speckleRange, speckleWindowSize: Parameters of the OpenCV function

filterSpeckles which is used to post process the disparity map. It replaces

blobs of similar disparities (the difference of two adjacent values does not

exceed speckleRange) whose size is less or equal to speckleWindowSize (the

number of pixels forming the blob) by the invalid disparity value.

• disp12MaxDiff: A left-right check is performed. Pixels are matched from left

to right image and then from the right back to the left. The disparity value is

accepted only if the distance of the first match and the distance of the second

match have maximum difference of disp12MaxDiff.

• fullDP: If set to true, the algorithm considers eight directions instead of five

(like the original) but with higher memory consumption.

• P1: Penalty for small disparity changes.

• P2: Penalty for higher disparity changes.

It should also be noted that the disparity range consists of two parameters, minDisparity and numberofDisparities. The first value is the minimum disparity for the search window. The second shows the maximum difference from the minimum disparity. It works the same way as with the next algorithm, Block Matching.

36

4.3.6: Testing and Results

The produced disparity maps for Cones, Teddy, Tsukuba and Venus image pairs are the following:

Figure 14: SGBM Results

The percentage of bad matching pixels according to the Sum of Absolute Differences metric is the following:

Cones Teddy Tsukuba Venus Avg time

SGBM 0.4943 0.4977 0.3923 0.4982 ~0.065 sec

4.4: Block matching

4.4.1: Algorithm overview

Th is method is based on the block matching al gorithm. It is mainly used in video frames for motion estimation, but its principles can apply succes sfully to stereo matching also.

Block matching algorithm [24] involves dividing the current frame of video into

‘macro blocks’ and comparing each of the macro -block with corresponding block and its adjacent neighbors in the previous frame of the video. A vector is created that captures the movement of macro -block fr om one location to another in the previous

37 frame. This movement calculated for all the macro blocks comprising a frame , constitutes the motion estimated in the current frame .

The search area for a good macro-block match is decided by the ‘search parameter’, p, where p is the number of pixels on all four sides of the corresponding macro-block in the previous frame . The search parameter is a measure of motion. Larger the value of p, larger is the motion; however this becomes a computationally extensive task.

Usually the macro-block is taken to be of size 16 pixels and the search parameter is set to 7 pixels.

4.4.2: Algorithm Analysis

The tested implementation was submitted to OpenCV library by Kurt Konolige and is partly based on his work Small Vision Systems: Hardware and Implementation [12].

The paper revolves around the Small Vision Module or SVM, a compact, inexpensive real-time device for computing dense stereo range images.

In the case of stereo matching, the adjacent neighbor is the second image of the stereo pair.

Figure 15: Block Matching Algorithm for SVM

38

The algorithm that is implemented here has the following features:

• Laplacian of Gaussian transform (LOG for short), L1 norm(absolute

difference) correlation.

• Variable disparity search in pixel unit.

• Postfiltering with an interest operator and left/right check.

• x4 range interpolation.

The LOG transform and L1 norm were chosen because they give good quality results and can be optimized on standard instruction sets available on DSPs and microprocessors.

The following images are copied directly from the paper and help in the explanation of the algorithm. The disparity maps are green on the paper but they are converted to grayscale here for uniformity reasons (this work examines grayscale disparity maps).

Figure 16: Block Matching Sample Images

39

Image (a) shows the grayscale input image. Figure (b) depicts the typical disparity map produced by the algorithm. Brighter areas indicate higher disparities (closer objects) while darker areas indicate lower disparities(further objects). There are 64 possible levels of disparity total. In figure (b) the highest level is around 40 while the lowest is about 5. It is obvious that there is significant error in the upper left and right corners of the image. That is due to the uniform areas without enough texture to determine the disparity.

In figure (c) the interest operator is applied as a post filter. Areas with insufficient texture are rejected and appear black in the produced image. Even after using this filter, some errors still remain in portions of the image with disparity discontinuities, in this case the side of the person's head. Those errors are caused by overlapping the correlation window on areas with very different disparities.

One way to eliminate those errors is by applying left/right check. The left/right check can be implemented efficiently by storing enough information when doing the original disparity correlation. As the author concludes, the combination of interest operator and left/right check has proven to be the most effective at eliminating bad matches. As mentioned by the author, correlation surface checks were not used, since they do not add to the quality of the range image and can be computationally expensive.

As mentioned earlier, the algorithm described in the paper was intended to be used with the Small Vision Module, which is a small programmable device with limited resources, so it was designed with storage efficiency in mind.

40

4.4.3: Implementation Details

This algorithm is implemented in OpenCV and is embedded in the library. It is also part of Matlab 2011b onwards. The Matlab implementation strangely, gave no usable results even after extensive experimentation with the parameters (similarly to the

SGBM), so only the OpenCV version is presented here.

The input and parameters include of course the common ones and some more. The

OpenCV documentation does not sufficiently explain the parameters, so the analysis here is based mainly on experimentation. Most of them are optional and only the ones used for the testing are analyzed. The disparity range is actually 2 parameters, one for minimum disparity and one for maximum disparity range. (minDisparity and numberofDisparities respectively, final disparity range is [minDisparity, minDisparity+numberofDisparities]). The rest of the parameters are the following:

• preFilterSize: Window size of the prefilter.

• preFilterCap: Clips the output to [-preFilterCap, preFilterCap].

• textureThreshold: Calculates the disparity only at locations where the texture

is larger than or equal to this threshold.

• UniquenessRatio: Computed disparity d* is accepted only if

SAD(d)>=SAD(d*)*(1+uniquenessRatio/100) for any d!=d+/-1.

• speckleRange, speckleWindowSize: Parameters of the OpenCV function

filterSpeckles which is used to post process the disparity map. It replaces

blobs of similar disparities (the difference of two adjacent values does not

exceed speckleRange) whose size is less or equal to speckleWindowSize (the

number of pixels forming the blob) by the invalid disparity value.

41

• disp12MaxDiff: A left -right check is performed. Pixels are matched from left

to right image and then from the right back to the left. The disparity value is

accepted only if the distance of the first match and the distance of the s econd

match have maximum difference of disp12MaxDiff.

4.4.4: Testing and Results

Produced disparity maps (cones, teddy, tsukuba, venus respectively):

Figure 17: Block Matching Results

The percentage of bad matching pixels according to the Sum of Absolute Differences metric is the following:

Cones Teddy Tsukuba Venus Avg time

BM 0.4450 0.4583 0.3556 0.4707 ~0.028 sec

4.5: Loopy belief propagation

4.5.1: Overview

This method [25] by Ngia Kien Ho focuses on solving the stereo problem using

Markov Random Fields and Loopy Belief Propagation. The method is heavy on mathematics and quite complicated. The creator offers extensive analysis of the algorithm as well as an OpenCV implementation on his w ebsite.

42

4.5.2: Markov Random Fields

Markov Random Fields (abbreviated as MRF) are undirected graphical models that can encode spatial dependences. They consist of nodes and links as all graphical models, but also feature cycles/loops. Given a 3x3 image, the stereo problem can be modeled using MRF as follows:

Figure 18: Markov Random Field illustration

The blue nodes are observed variables and represent pixel intensity values in this work. The pink nodes are the hidden variables and represent the unknown disparity value. The hidden variable values are referred to as labels. The links between the nodes represent a dependency. For example, the center node depends only on the four nodes it is connected to. This rather strong assumption that each node depends only on the nodes it is connected to, is called Markov assumption.

4.5.3: MRF Formulation

The stereo problem can be formulated in terms of MRF as the following energy function:

43

Where Y is the observed node, X is the hidden node, i is the pixel index and j are the neighboring nodes of node xi (see above diagram).

This energy function sums up all the costs at each link given an image Y and a labeling X. The aim is to find a labelin g for X that produces the lowest energy. This is essentially the disparity map. The energy function contains two other functions,

DataCost and SmoothnessCost.

4.5.4: DataCost

The DataCost function returns the cost/penalty of assigning a label value of x i to data yi. Good matches require a low cost and bad matches a high cost. Usually, sum of absolute differences or sum of squared differences are ideal to serve as cost metrics.

Practically, the function calculates the SAD (or any other metric chosen) betwe en blocks (or even single pixels) in the two images of the stereo pair taking into account the different tested disparity values. The following pseudo code illustrates all this:

Naturally, the direction of the matching window on the right image depends o n the stereo pair.

44

4.5.5: SmoothnessCost

The SmoothnessCost function enforces smooth labeling across adjacent hidden nodes.

To achieve that, a function that penalizes adjacent labels that are different is needed.

The following table shows some commonly used cost functions.

Figure 19: Various cost functions

The Potts model is a binary penalizing function with a single tunable lambda ( λ) variable. This value controls how much smoothing is applied. The linear and

45 quadratic models have the extra parameter K which is a truncation value that caps the max penalty.

As the creator of the method comments, the choice of DataCost and SmoothnessCost functions is vague and should be based on experimentation.

4.5.6: Loopy Belief Propagation main part

When the DataCost and SmoothnessCost functions have been chosen and the parameters tuned, next step is to solve the energy function. Trying all possible combinations (brute-forcing) would require a quantum computer. So finding an exact solution is definitely not easy and should not be expected. Instead finding an approximate solution is a more viable approach.

The Loopy Belief Propagation (LBP) algorithm was chosen among others (Graph Cut,

ICM etc) to find an approximate solution for the MRF. The original Belief

Propagation algorithm [26] was proposed by Pearl in 1982 for finding exact marginals on trees. Trees are essentially graphs that contain no loops, but as it turned out the same algorithm can successfully be applied to general graphs that contain loops. The

“loopy” word in the naming originates from there.

LBP is a message passing algorithm. A node passes a message to an adjacent node only when it has received all incoming messages, excluding the message from the destination node to itself. The following figure illustrates the process:

46

Figure 20: LBP message passing

Node x 1 wants to send message to x 2. So it waits for messages from all other nodes

(A, B, C, D) before sending it. As explained earlier, it will not send the message from x2 to x 1 back to x 2. Node x 1 maintains all possible beliefs about node x 2. The choice of using cost/penalty or probabilities is dependent on the choice of the MRF energy formulation.

This pseudo code can illustrate the process discussed above. The first step is always the initialization of the messages. As mentioned earlier, each node has to wait for all incoming messages before sending its message to the target node. This means that at

47 the start of the algorithm, each node will wait forever and receive nothing, so no message can be sent from it. To overcome that problem all messages are initia lized to some constant so the algorithm can proceed. The initialization is typically 0 or 1.

The main part of LBP is iterative . By adjusting the respective parameters, the algorithm can run for a chosen number of iterations or until the change in energy dr ops below a threshold. For each iteration, messages are passed around the MRF.

The passing scheme is arbitrary and any sequence is valid (the algorithm creator chooses right, left, up and down). As it is mentioned, different sequences will produce different results.

Once the LBP iteration completes, the best label at every pixel can be found by calculating its belief using the following formula, where msg is the message sent to node I from k with label l:

4.5.7: Implementation Details

As mentioned earlier, there is an OpenCV implementation available at the author’s website . The common parameters and a few more are featured. The disparity range is called labels and the window size is controlled by the variable wradius which accepts even numbers. Afterwards one is added to the selected number so the window size is odd as it should. The rest of the parameters are the following:

• BP_ITERATIONS: An integer that defines how many iterations/loops the

algorithm will run for.

48

• LAMBDA: This value controls how much smoothing is applied in the

SmoothnessCost function.

• SMOOTHNESS_TRUNC: Truncation value for the truncated linear model

that is used in the implementation.

4.5.8: Testing and Results

The produced disparity maps for Cones , Teddy, Tsukuba and Venus image pairs are the following (note that the algorithm ran for 5 loops) :

Figure 21: Loopy Belief Propagation Results

The percentage of bad matching pixels according to the Sum of Absolute Differences metric is the following (note that the time is measured for 5 loops) :

Cones Teddy Tsukuba Venus Avg time

LoopyBP 0.0953 0.4206 0.0410 0.4203 ~155.88 sec

4.6: Fast stereo matching and disparity estimation

4.6.1: Overview

This method is based on the paper "A hybrid Algorithm for Disparity Calculation from Sparse Disparity Estimates Based on Stereo Vision " [27] .

49

This excellent work proposes a hybrid method for disparity estimation by combining the existing methods of block based and region based stereo matching. It utilizes image segmentation through K-Means clustering, morphological filtering and connected component analysis, SAD cost function and disparity map reconstruction.

The process is very clearly documented by the authors and will be analyzed here step by step. The following diagram depicts an overview of the whole algorithm.

Figure 22: Fast stereo Matching and Disparity Estimation

4.6.2: Algorithm Analysis

The first step is color conversion from RGB color to Lab color. The majority of imaging equipment captures images in RGB format. This format though does not properly approximate human vision. To overcome that difficulty the lab color space was developed to better approximate human vision. The lightness component, abbreviated L matches closely the human perception of lightness and is widely used by image processing algorithms. This algorithm only retains the L values of the pixels for further processing.

50

Step two is image segmentation . It is performed on the L values of the left image pixels using fast implementation of the K-Means algorithm. The image pixels are represented using one dimensional feature, namely a vector containing the L value for each pixel. Next a histogram of the L values is built and used instead of the actual pixel values for the subsequent iterations of the K-Means clustering. A histogram has a smaller fixed number of bins than the actual pixels thus the runtime is significantly reduced.

Step three is segment boundary detection and refinement. Segment boundary detection is achieved by comparing the cluster assignment of each pixel with that of its 8 neighboring pixels. If any of them is found to be different, the pixel is marked as one (belongs to a segment boundary), or else it is marked as zero. Thus, the boundary map is generated from the segmented left image. Since the clustering in step two is based only on the pixels' lightness values there are limitations in the accuracy of the said clustering. Consequently, many pixels can be falsely identified as belonging to segment boundaries. To overcome that, the authors of this work apply two morphological filters to refine the boundary map by removing such noisy pixels.

There are two types of morphological filters, Fill and Remove. Fill isolates interior zero pixels that are surrounded by ones and sets them to one also. Remove sets individual zero pixels to one if all of its four connected neighbors are one leaving only the boundary pixels on. Furthermore, they use connected components analysis and remove small artefacts in the boundary map due to segmentation errors. If disparity is calculated for those artefacts, it will most probably be false. Finally the smallest connected components that contribute about 4% of the total number of pixels are removed.

51

Step four is disparity calculation of the boundary map . The well known SAD (Sum of

Absolute Differences) cost function is used to determine only the disparities of the boundary pixels, using the L values of the left and right image pixels. A partial disparity is map is built, considering the sparse disparity measurements.

The fifth and final step is disparity map reconstruction from boundaries . The algorithm scans through each row of the partial disparity map and computes the remaining disparities based on the ones that have already been calculated. It operates in two stages:

-Disparity propagation ('fill' stage): In this first stage the disparity map is scanned row-wise, left to right. Whenever two boundary pixels with identical disparity values are encountered, the intermediate pixels of that row (aka the pixels between the boundaries with the same value) are 'filled' with that disparity value. An exception is made near the left and right end of each row. The left and right ends of each row are filled with the disparity value of the nearest border pixel until a boundary pixel is encountered.

-Estimation from known disparities ('Peek" Stage): In the second stage the algorithm searches for the pixels whose disparity is not determined yet and estimates it based on the disparity value of their neighboring pixels. When such a pixel is found, the known disparities of its neighbors are stored in an array and the unknown disparity is computed using statistical analysis.

4.6.3: Implementation Details

The algorithm input parameters except the common ones are:

52

• K representing the number of intensity based clusters for K -Means clustering

in the second step.

• Disp_scale represents the factor by which calculated disparities will be

multiplied. Value s hould be such that, max_disparity * disp_scale <+ 255.

It should be noted that all the parameters except the image pair are optional. A random value will be used if they are not specified and the results might not be optimal.

4.6.4: Testing and Results

The produced disparity maps for Cones, Teddy, Tsukuba and Venus image pairs are the following:

Figure 23 : Fast stereo Matching and Disparity Estimation Results

The percentage of bad matching pixels according to the Sum of Absolute Differences metric is the following:

Cones Teddy Tsukuba Venus Avg time

FSM 0.0770 0.1338 0.0320 0.1202 ~10.14 sec

53

4.7: Probability-Based Rendering for View Synthesis

4.7.1: Algorithm Overview

The main objective of this work [28] is to synthesize a virtual view, given two reference images, without deterministic correspondences. The first challenge that occured was to construct the probability of all probable matching points. The second was to render an intermediate view using a set of all matching candidate points with the probability.

To address the aforementioned challenges the authors of the paper presented the probability-based rendering (PBR) approach that robustly reconstructs an intermediate view with the steady- state matching probability (SSMP) density function.

SSMP: In this particular work the matching cost, typically referred to as a cost volume in the correspondence matching literature, is re-defined as the probability of being matched between points, enabling random walk with restart (RWR) to be applied to optimize the matching probability. The RWR uses edge weights between neighboring pixels to enhance the matching probability similar to aggregation methods for local stereo matching.

PBR: The rendering process is re-formulated as an image fusion, so that all probable matching points represented by the SSMP can be considered together. This approach has a couple of significant advantages. First it suppresses existing flicker artifacts.

Second the intermediate view is free from the hole filling problem since the SSMP considers all positions of probable matching points.

54

4.7.2: SSMP with RWR

First of all, the SSMP is defined. Two images are assumed, left and right. The probability p measures how likely I l(m 1,m 2) (a point on the left image) is to be matched to I r(m 1-d,m 2) (a point on the right image with disparity) or the opposite.

Also, the probability is inversely proportional to the cost, since smaller matching cost means higher matching probability. The above can be summarized in the following formulas:

Where p 0 is an initial calculated matching probability based on an initial matching

0 cost e . Z(m) represents a normalization term. The variable m denotes coordinates m 1 and m 2 and d denotes the disparity.

Next step is SSMP estimation using RWR. The random walk has been widely used to optimize probabilistic problems as the authors suggest. A random walker iteratively transits to its neighboring points according to an edge weight. Also, the random walker goes back to the initial position with a restarting probability a (0<=a<=1) at each iteration. A matching probability in the SSMP can be obtained by the RWR in an iterative fashion as follows:

55

Where N m denotes the four -neighborhood of a reference pixel m. No te that the above formula becomes the random walk when the restarting probability is zero. With an assumption that neighboring pixels tend to have similar matching probability when the range distance between the reference pixel m and its neighboring pixel n is small, an edge weight w(m,n) is computed by the following formula:

where y represents the bandwidth parameter, typically set to the intensity variance and

|| . || 2 denoted l 2 norm. Then a steady state solution p s(m,d) which is reffered the SSMP in this work, can be obtained by iteratively updating p t+1 (m,d) until p t+1 (m,d)=pt(m,d).

According to the authors, this work presents significant advantages. First of all, it does not require specifying a window size for reliable matching, contrary to the conventional methods due to the small number of adjacent neighbors. Second, there is no need to specify the number of iterations, since it gives a non -trivial solution in the steady state. Third, this method gives the optimal solution for given energy functional.

4.7.3: PBR with SSMP

Now the two reference images and the sets of their corresponding SSMPs are given.

The rendering process is cast into the probabilistic image fusion. A baseline between the left and right cameras is assumed to b e normalized to 1. Beta ( β) denotes the location of a virtual camera where 0<= β<=1. Also, P l(m,d) and P r(m,d) encode the

u u matching probability of a pixel on the left and right image (I l(m,d) and I r(m,d) ) respectively as follows:

56

where Zl(m) and Z r(m) are:

And < . > represents a rounding operator. The virtual view is the synthesized via an image fusion process. Specifically, a probabilistic average, E l(I l(m)) and E r(I r(m)), for two reference images is computed with corresponding probability P l(m,d) and P r(m,d)

u u and the textures, I l(m,d) and I r(m,d) along with the disparity hypothesis d and then blended as follows:

w w Left and right disparity maps can be denoted as d l(m) and d r(m) respectively. The

u u u sampled points I l(m,d) and I r(m,d ) are then converted as functions of m, I l(m) and

u I r(m) respectively. Furthermore, the matching probability functions P l(m,d) and

Pr(m,d) are simplified as a set of shifted Dirac delta function as follows:

and

57

Then, the PBR on the previous equation b ecomes:

For a given fixed point m* , the PBR synthesizes the intermediate view I u(m*) with

u the function of reference view I l(m*,d) and the probability P l(m*,d) as follows:

Finally, the PBR is able to handle occlusion and dis -occlusion (hole) regions by assuming that the background texture varies smoothly. The problematic regions have their textures synthesized in a probabilistic manner.

4.7.4: Implementation details

The implementation of this algorithm runs on Matlab. As discussed earlier, it does not require one of the common parameters (window size). The most important parameter the user has to modify is the disparity range. Apart from that there is a large number of parameters that can be tuned to control various aspects of the algorithm, but none ar e necessary to be changed and could be left to their default values.

4.7.5: Testing and Results

Figure 24: SSMP Results

58

The percentage of bad matching pixels according to the Sum of Absolute Differences metric is the following:

Cones Teddy Tsukuba Venus Avg time

SSMP 0.0883 0.1087 0.0379 0.4993 ~231.1 sec

4.8: Results Analysis and Conclusion

Below is a table that summarizes the percentage of bad matching pixels for all the algorithms. Highlighted in blue is the best (lowest) score for each image pair and in red the worst (highest).

OpenCV Cones Teddy Tsukuba Venus Avg. Time

BM 0.4450 0.4583 0.3556 0.4707 ~0.028 sec

SGBM 0.4943 0.4977 0.3923 0.4982 ~0.065 sec

LoopyBP 0.0953 0.4206 0.0410 0.4203 ~155.88 sec

Matlab

SSMP 0.0883 0.1087 0.0379 0.4993 ~231.1 sec

FSM 0.0770 0.1338 0.0320 0.1202 ~10.14 sec

The following diagram depicts a visual representation of the performance of the five algorithms. Since the measured amount is percentage of bad matching pixels, it means that lower value shows better performance and higher value shows more performance mismatches.

59

Figure 25: Percentage of bad matching pixels (lower value is better) The most "difficult" image pairs to match seem to be Venus and Teddy. All the algorithms give the highest number of bad matching pixels in those two image sets. A possible reason is that those two sets exhibit small variation in the scene and higher uniformity on some regions which could "confuse" an algorithm and give a high number of candidate matching points. The other two image sets, Cones and Tsukuba give better scores in all the algorithms. Those images exhibit a greater variation on the scene and matching points are easier to identify, with Tsukuba giving the best result in all of the algorithms.

Overall, the most efficient algorithm of the ones tested here is the Fast Stereo

Matching and Disparity Estimation from professor GREDDY and S. Mukherjee. It gives a small number of bad matching pixels under all circumstances and has a low running time. The SSMP gives excellent results also but it is highly complicated and exhibits a high running time due to the large number of steps required for its completion. Also, as is evident from the Venus result, it cannot handle uniformity in

60 images effectively in all cases. The Loopy Belief Propagation also handles uniformity in scenes poorly as the results from Teddy and Venus show. Additionally if a higher disparity range is selected the running time will be quite high even in a smaller number of loops. Finally, the SGBM and BM have many common points in their

OpenCV implementation with BM giving slightly better scores. Both those algorithms performed poorly when it comes to the matching process itself, but have a very small runtime. They seem ideal for real-time applications or where speed is more important than robust stereo matching.

61

Chapter 5: Discussion and future work

Stereo vision is implemented as mentioned earlier in scientific, industrial, military and even consumer fields. Although it is still considered as a gimmick by many people, it steadily gains traction and acceptance.

Maybe the most obvious research field that will employ stereo vision in the near future is virtual reality applications. They have already existed for a few years but mostly for educational/entertainment purposes with limited use, mainly in virtual tours in rather small 3D environments. Nowadays, with increased computational power virtual reality can also be used in immersive and interactive applications, like video games.

Several consumer virtual reality devices like Google Cardboard [29] and Oculus Rift

[30] have started to make their way to consumers. More such devices are expected from many manufacturers in the near future.

Figure 26: Current Virtual Reality Devices

Another interesting project scheduled to be released commercially in October 2015 is a device dubbed as virtual reality toy and is intended to be the greatest remodeling of the abundantly famous View Master. Mattel toy corporation works with Google for the project, which is largely based on the Google Cardboard. The traditional reels are

62 replaced by plastic cards and a smartphone. The user slides the smartphone inside the headset and scans the cards. A 3D image based on the theme of the cards is depicted.

The trademark switch on the future View Master is now used to zoom or to focus on objects in the virtual scene.

Also, another field that has recently started to employ stereo vision is medicine and more specifically endoscopy. Traditional endoscopes feature a single camera that provides a two dimensional image of the patient's examined internal organ. The stereoscopic endoscopes feature two cameras that provide three dimensional imaging thus allowing more thorough visual examination by extracting information about the internal surface of the organs.

Figure 27: Stereoscopic Endoscope

Research is also very active in the driveless cars, discussed in the first chapter and is expanding to other vehicles, mainly autonomous drones and Unmanned Ground

Vehicles. Also there are several space exploration projects that employ stereo vision, such as an innovative planetary landing algorithm [31] used to extract planet surface

63 information and safely guide the space vessel to touch down as proposed by S.

Woicke and E. Mooij.

Finally it is the autonomous or not robotic systems that use stereo imaging along with many other sensors. Robots are becoming more efficient and intelligent and their use is set to be expanded in the near future in almost every sector imaginable.

64

References

1. http://en.wikipedia.org/wiki/Computer_stereo_vision.

2. https://en.wikipedia.org/wiki/Stereopsis#History_of_investigations_into_stereopsis.

3. https://en.wikipedia.org/wiki/View-Master#History.

4. https://en.wikipedia.org/wiki/Virtual_Boy.

5. http://en.wikipedia.org/wiki/Pinhole_camera_model.

6. http://en.wikipedia.org/wiki/Camera_resectioning.

7. http://en.wikipedia.org/wiki/Epipolar_geometry.

8. http://en.wikipedia.org/wiki/Fundamental_matrix_%28computer_vision%29.

9. http://en.wikipedia.org/wiki/Image_rectification.

10. http://www.jayrambhia.com/blog/disparity-maps/.

11. http://www.cs.stolaf.edu/wiki/index.php/Stereo_Matching.

12. Konolige, K. Small Vision Systems: Hardware and Implementaion. Springer. 1998.

13. https://en.wikipedia.org/wiki/Cross-correlation#Normalized_cross-correlation.

14. https://en.wikipedia.org/wiki/Ground_truth.

15. http://techcrunch.com/2010/06/19/a-guide-to-3d-display-technology-its-principles- methods-and-dangers/.

16. http://www.self.gutenberg.org/articles/Aerial_survey.

17. http://www.nasa.gov/mission_pages/stereo/main/index.html.

65

18. https://en.wikipedia.org/wiki/Autonomous_car.

19. D. Scharstein, R. Szeliski. A Taxonomy and Evaluation of Dense Two Frame Stereo

Correspondence Algorithms. 2001.

20. R. Szeliski, R. Zabih. An Experimental Comparison of Stereo Algorithms. Springer. 2000.

21. L. Nalpantidis, G. Sirakoulis, A. Gasteratos. Review of Stereo Matching Algorithms for 3D

Vision . 2007.

22. R.A. Lane, N.A. Thacker. Overview of Stereo Matching Research. 1998.

23. Hirschmuller, H. Semi-global Matching - Motivation, Development and Applications.

2011.

24. http://en.wikipedia.org/wiki/Block-matching_algorithm.

25. Ho, Ngia Kien. http://nghiaho.com/?page_id=1366#LBP. [Online]

26. https://en.wikipedia.org/wiki/Belief_propagation.

27. s. Mukherjee, G.R.M. Reddy. A Hybrid Algorithm for Disparity Calculation From Sparse

Disparity Estimates Based on Stereo Vision. IEEE. 2014.

28. B. Ham, D. Min, C. Oh, M.N. Do, K. Sohn. Probability-Based Rendering for View

Synthesis. IEEE. 2014.

29. https://www.google.com/get/cardboard/.

30. https://www.oculus.com.

31. S. Woicke, E. Mooij. A Stereo-Vision Based Hazard-Detection Algorithm for Future

Planetary Landers. 2014.

66