COMPARISON OF OPEN SOURCE STEREO VISION ALGORITHMS
by
CHOUSTOULAKIS EMMANOUIL
Engineer of Applied Informatics and Multimedia
A THESIS
submitted in partial fulfillment of the requirements for the degree
MASTER OF SCIENCE
DEPARTMENT OF INFORMATICS ENGINEERING
SCHOOL OF APPLIED TECHNOLOGY
TECHNOLOGICAL EDUCATIONAL INSTITUTE OF CRETE
2015
Approved by:
Assistant Professor Kosmopoulos Dimitrios
1
Abstract
Stereo vision in the extraction of 3D information from a pair of images
depicting the same scene viewed from different angles. It happens in nature in
creatures that possess two eyes. It is also a very active field in Computer Vision where
the pair of images is digital and instead of eyes, it is obtained by cameras.
In order to achieve that, there are several methods and algorithms. In this Master
Thesis there is a theoretical and experimental comparison of a few of them that are
open source and can be found implemented online.
2
Acknowledgements
I would like to thank Professor Dimitris Kosmopoulos for the assistance in the
completion of the thesis. I would also like to thank my family for their patience and
support during my studies for this MSc degree. Finally special thanks to the
examiners committee and to all fellow stereo vision researchers and engineers whose
work contributed to the completion on this work.
3
Table of Contents
Abstract ...... 2 Acknowledgements ...... 3 Table of Figures ...... 6 Chapter 1: Introduction and Goals ...... 7 Chapter 2: Stereo Vision Basics ...... 9 2.1: Overview...... 9 2.2: History of Stereo Vision ...... 10 2.3: Pinhole Camera Model ...... 14 2.4: Camera Resectioning/Parameters ...... 14 2.5: Epipolar geometry ...... 15 2.6: Fundamental matrix ...... 16 2.7: Image rectification ...... 17 2.8: Disparity map ...... 17 2.9: Stereo Matching ...... 18 2.10: Matching Cost...... 20 2.11: Normalized Cross Correlation ...... 21 2.12: Ground truth ...... 21 2.13: Applications of Stereo Vision ...... 21 2.14: Challenges and difficulties ...... 24 Chapter 3: Stereo Algorithms Evaluation Process ...... 26 3.1: Previous Work ...... 26 3.2: State-of-the-Art Middlebury Evaluation ...... 27 3.3: Thesis Evaluation Process ...... 28 Chapter 4: Testing and Comparison of Stereo Algorithms ...... 31 4.1: Intro ...... 31 4.2: Common Algorithm Parameters ...... 31 4.3: Semi Global (Block) Matching ...... 32 4.3.1: Algorithm Overview...... 32 4.3.2: Pixelwise Cost Calculation ...... 33 4.3.3: Aggregation of Costs ...... 34 4.3.4: Disparity Computation ...... 35 4.3.5: Implementation Details ...... 35
4
4.3.6: Testing and Results ...... 37 4.4: Block matching ...... 37 4.4.1: Algorithm overview ...... 37 4.4.2: Algorithm Analysis ...... 38 4.4.3: Implementation Details ...... 41 4.4.4: Testing and Results ...... 42 4.5: Loopy belief propagation ...... 42 4.5.1: Overview ...... 42 4.5.2: Markov Random Fields ...... 43 4.5.3: MRF Formulation ...... 43 4.5.4: DataCost ...... 44 4.5.5: SmoothnessCost ...... 45 4.5.6: Loopy Belief Propagation main part ...... 46 4.5.7: Implementation Details ...... 48 4.5.8: Testing and Results ...... 49 4.6: Fast stereo matching and disparity estimation ...... 49 4.6.1: Overview ...... 49 4.6.2: Algorithm Analysis ...... 50 4.6.3: Implementation Details ...... 52 4.6.4: Testing and Results ...... 53 4.7: Probability-Based Rendering for View Synthesis ...... 54 4.7.1: Algorithm Overview...... 54 4.7.2: SSMP with RWR ...... 55 4.7.3: PBR with SSMP ...... 56 4.7.4: Implementation details ...... 58 4.7.5: Testing and Results ...... 58 4.8: Results Analysis and Conclusion ...... 59 Chapter 5: Discussion and future work ...... 62 References ...... 65
5
Table of Figures
Figure 1: Simple stereo vision illustration ...... 9 Figure 2: Wheatstone's Stereocope ...... 10 Figure 3: Brewster's Stereoscope ...... 11 Figure 4: A typical ViewMaster device ...... 12 Figure 5: Nintendo's Virtual Boy ...... 13 Figure 6: Pinhole Camera Model ...... 14 Figure 7:Epipolar Geometry Illustration ...... 16 Figure 8: A groundtruth disparity map ...... 18 Figure 9: Simple window matching illustration ...... 19 Figure 10: Nintendo's 3DS ...... 22 Figure 11: The NASA STEREO Project ...... 23 Figure 12: Cones, teddy, tsukuba and venus left view ...... 30 Figure 13: SGBM matching costs aggregation ...... 34 Figure 14: SGBM Results ...... 37 Figure 15: Block Matching Algorithm for SVM ...... 38 Figure 16: Block Matching Sample Images ...... 39 Figure 17: Block Matching Results...... 42 Figure 18: Markov Random Field illustration ...... 43 Figure 19: Various cost functions ...... 45 Figure 20: LBP message passing ...... 47 Figure 21: Loopy Belief Propagation Results ...... 49 Figure 22: Fast stereo Matching and Disparity Estimation ...... 50 Figure 23: Fast stereo Matching and Disparity Estimation Results ...... 53 Figure 24: SSMP Results ...... 58 Figure 25: Percentage of bad matching pixels (lower value is better) ...... 60 Figure 26: Current Virtual Reality Devices ...... 62 Figure 27: Stereoscopic Endoscope ...... 63
6
Chapter 1: Introduction and Goals
Stereoscopic vision (called binocular vision in nature) is the extraction of 3d
information about a scene from a pair of images depicting different views of that
scene. The process in nature is called stereopsis and it occurs in the brain which
combines the images received from the 2 eyes.
Computer stereo vision is the extraction of 3D information from a pair of digital
images usually obtained by 2 CCD cameras. This is made possible using various
techniques and algorithms. Over the years many algorithms have been proposed to
combine 2 images into a final image and their performance is measured by efficiency
and speed.
The primary goal of this work is to illustrate a simple process of comparing methods
used for stereo matching. The secondary goal is to compare experimentally the
relevant algorithms that can be found implemented on various sources online by using
the aforementioned process and some of the state of the art datasets. The steps
followed in this work can be summarized here:
1. Extensive study of computer stereo vision resources and literature to gain as
thorough understanding as possible of all related terms and methodology.
2. Study any previous work related to the topic of this master thesis and determine the
state of the art.
3. Define the process for comparison, choosing a simple and comprehensible one.
4. Research online for implemented stereo algorithms and methods. Make any
changes to the code for optimal run, without affecting the core algorithm.
7
5. Test the said methods using all 4 stereo sets of the Middlebury platform and select the ones giving usable results.
5. Make the comparison, document, study and analyze the result.
6. Write the thesis document including terminology, process description, algorithm and result analysis.
Any reference to existing work is of course acknowledged and documented here. Also the results displayed here are produced by executing the relevant code with the optimal parameters for each method and none is found online and just used as is.
8
Chapter 2: Stereo Vision Basics
2.1: Overview
In traditional stereo vision [1], two cameras, displaced horizontally from one another are used to obtain two differing views on a scene, in a manner similar to human binocular vision . By comparing these two images, the relative depth information can be obtained, in the form of disparities, which are inversely proportional to the differences in distance to the objects. To compare the images, the two views must be superimposed in a stereoscopic device, the image from the right camera being shown to the observer's right eye and from the left one to the left eye.
Figure 1: Simple stereo vision illustration In real camera systems however, several pre-processing steps are required.
1. The image must first be removed of distortions, such as barrel distortion to
ensure that the observed image is purely projectional.
2. The image must be projected back to a common plane to allow comparison of
the image pairs, known as image rectification.
3. An information measure which compares the two images is minimized. This
gives the best estimate of the position of features in the two images, and
creates a disparity map.
9
4. Optionally, the disparity as observed by the common projection is converted
back to the height map by inversion. Utilizing the correct proportionality
constant, the height map can be calibrated to provide exact distances.
2.2: History of Stereo Vision
In this chapter, the most important moments in the history [2] of stereoscopic vision are described.
Stereopsis was first explained by Charles Wheatstone in 1838. "...the mind perceives an object of three dimensions by means of the two dissimilar pictures projected by it on the two retinae..." was his exact definition. He recognized that each eye views the world from slightly different horizontal position. As a result of that, each eye views a different image.
Also objects at different distance seem to have a different horizontal position for each eye (horizontal disparity) leading to the concept of depth. Wheatstone created the illusion of depth from flat pictures that differed only in horizontal disparity. To display his pictures separately to the two eyes, Wheatstone invented the stereoscope.
Figure 2: Wheatstone's Stereocope
10
Although Wheatstone was the first man to explain and showcase stereoscopic vision, he was not the first to notice it and try to understand it. Leonardo Da Vinci had also realized that objects at different distances project images to the eyes that differ in their horizontal positions. Despite his efforts he concluded that it is impossible for a painter to portray a realistic depiction of depth in a scene with a single canvas. Da Vinchi chose for the near objects a column with a circular cross section and for his far object a flat wall. His column projects identical images of itself in the two eyes.
Stereoscopy became popular during victorian times with the invention of the Prism
Stereoscope by David Brewster. Combined with the advances of photography, tens of thousands stereograms were produced.
Figure 3: Brewster's Stereoscope
In 1939 the View Master [3] line was introduced. It is a series of special stereograms that are loaded with proprietary discs (namely reel) containing a film of stereoscopic scenes. Transition between scenes happens with a switch that rotates the reel. The viewer looks inside the two lenses to view the scene. In order to put light into the film, most models feature 2 opaque white films in front of the reel. The viewer needs to look into the View Master against an light source, although some models with self
11 illumination were introduced. View master is best known as a toy for children and is still available, although it is less popular than it used to be. Mattel Corporation currently owns the rights for its production.
Figure 4: A typical ViewMaster device
In the 1960's Bela Julesz invented the Random Dot Stereogram. Unlike previous stereograms, in which each half image showed recognizable objects, each half image of the first random-dot stereograms showed a square matrix of about 10,000 small dots, with each dot having a 50% probability of being black or white. No recognizable objects could be seen in either half image. The two half images of a random-dot stereogram were essentially identical, except that one had a square area of dots shifted horizontally by one or two dot diameters, giving horizontal disparity. The gap left by the shifting was filled in with new random dots, hiding the shifted square.
Nevertheless, when the two half images were viewed one to each eye, the square area was almost immediately visible by being closer or farther than the background. Julesz called it Cyclopean image in the notion that each eye was seeing part of an object and it was combined into one in the brain.
In the 70’s Christopher Tyler invented autostereograms. That is random dot stereograms that can be viewed without a stereoscope. A famous example is the
12
Magic Eye pictures, a series of books featuring images which allow people to view
3D images by focusing on 2D patterns.
In 1989 Antonio Medina Puerta demonstrated with photographs that retinal images with no parallax disparity but with different shadows are fused stereoscopically, imparting depth perception to the imaged scene. He named the phenomenon "Shadow
Stereopsis". He showed how effective the phenomenon is by taking two photographs of the Moon at different times, and therefore with different shadows, making the
Moon to appear in 3D stereoscopically, despite the absence of any other stereoscopic cue.
In 1995 Nintendo Corporation introduced the Virtual Boy [4]. It was a table top console and a first step into virtual reality devices for consumers. The goal was to
"immerse players into their own private universe", according to Nintendo. The goal was never achieved though due to large number of factors. Limited technology to keep the cost down, bad design causing eye and neck strain and a small number of games catalog led to the commercial failure of the device that was discontinued from production only a year later. Nevertheless, it was the first step in the use of stereo vision and virtual reality for entertainment purposes.
Figure 5: Nintendo's Virtual Boy
13
2.3: Pinhole Camera Model
The pinhole camera model [5] describes the mathematical relationship between the coordinates of a 3D point and its projection onto the image plane of an ideal pinhole camera , where the camera aperture is described as a point and no lenses are used to focus light. The model does not include, for example, geometric distortions or blurring of unfocused objects caused by lenses and finite sized apertures. It also does not take into account that most practical cameras have only discrete image coordinates. This means that the pinhole camera model can only be used as a first order approximation of the mapping from a 3D scene to a 2D image. Its validity depends on the quality of the camera and, in general, decreases from the center of the image to the edges as lens distortion effects increase.
Figure 6: Pinhole Camera Model
2.4: Camera Resectioning/Parameters
Camera resectioning [6] is the process of estimating the parameters of a pinhole camera model approximating the camera that produced a given photograph or video.
Usually, the pinhole camera parameters are represented in a 3 × 4 matrix called the
14 camera matrix. There are typicall y two types of camera parameters, intrinsic and extrinsic.
Intrinsic parameters encompass focal length, image sensor format and principal point.
Those parameters are contained in the intrinsic matrix.
Extrinsic parameters denote the coordinate system tr ansformations from 3d word coordinates to 3d camera coordinates. Equivalently, the extrinsic parameters define the position of the camera center and the camera’s heading in world coordinates.
T is the position of the origin of the world coordinate system expressed in coordinates of the camera-centered coordinate system (and NOT the position of the camera as is often mistaken). C is the position of the camera expressed in world coordinates . R is a rotation matrix.
2.5: Epipolar geometry
Epipolar geometry [7] is the geometry of stereo vision . When two cameras view a 3D scene from two distinct positions, there are a number of geometric relations between the 3D points and their projections onto the 2D images that lead to constraints between the image points. These relations are derived based on the assumption that the cameras can be approximated by the pinhole camera model .
15
The figure below depicts two pinhole cameras looking at point X. In real cameras, the image plane is actually behind the center of projection, and produces an image that is rotated 180 degrees. Here, however, the projection problem is simplified by placing a virtual image plane in front of the center of projection of each camera to produce an unrotated image. OL and OR represent the centers of projection of the two cameras. X represents the point of interest in both cameras. Points xL and xR are the projections of point X onto the image planes.
Each camera captures a 2D image of the 3D world. This conversion from 3D to 2D is referred to as a perspective projection and is described by the pinhole camera model.
It is common to model this projection operation by rays that emanate from the camera, passing through its center of projection. Note that each emanating ray corresponds to a single point in the image .
Figure 7:Epipolar Geometry Illustration
2.6: Fundamental matrix
The fundamental matrix [8] F is a 3×3 matrix which relates corresponding points in stereo images. In epipolar geometry, with homogeneous image coordinates, x and x′, of corresponding points in a stereo image pair, Fx describes a
16 line (an epipolar line) on which the corresponding point x′ on the other image must lie. That means, for all pairs of corresponding points holds
2.7: Image rectificati on
Image rectification [9] is a transformation process used to project two -or -more images onto a common image plane. This process has several degrees of freedom and there are many strategies for transforming images to the common plane.
It is used in computer stereo vision to simplify the problem of finding matching points between images. It uses triangulation based on epipolar geometry to determine distance to an object. More specifically, binocular disparity is the process of relating the depth of an object to its change in position when viewed from a different camera, given the relative position of each camera is known.
2.8: Disparity map
Disparity [10] refers to the difference in location of an object in corresponding two
(left and right) images as seen by the left and right eye which is created due to parallax (eyes’ horizontal separation). The brain uses this disparity to calculate depth information f rom the two dimensional images.
In computer vision, the disparity map is an image that depicts how far from the viewing source are the objects of the scene. The depiction is based on intensities.
Brighter objects are closer to the source (they have larger apparent distance between
17 the left and right image) and darker objects are further (they have smaller apparent distance between the left and right view).
Figure 8: A groundtruth disparity map
In short, the disparity of a pixel is equal to the shift value that leads to minimum sum- of-squared-differences for that pixel.
Disparity map calculation is essentially the result of stereo matching. This work revolves around some of its calculation methods.
2.9: Stereo Matching
Stereo matching [11] is used for finding corresponding pixels in a pair of images, which allows 3D reconstruction by triangulation, using the known intrinsic and extrinsic orientation of the camera. There are two methods [12] for stereo matching, local and global. Local methods attempt to match small regions of one image to another based on intrinsic features of the region. Global methods are supplementary to local methods by considering physical constraints such as surface continuity or base- of-support. Local methods can be further classified by whether they correlate a small area patch among images (called correlation or area based) or match features (called feature based).
18
Correlation based (or area based) stereo matching considers a certain area on the left image (usually) and tries to find an equally sized area on the right image that is the closest match. That area is called matching or correlation window. Since the images are rectified, the algorithm searches only horizontally by a predefined offset. They produce dense disparity maps. Generally a smaller matching window will give more detail but more noise and a larger window will produce a smoother disparity map but with less captured detail. Such algorithms are by default fast and memory efficient so they are usually preferred. However, finding the combination of optimal window size and other algorithm parameters can be challenging and requires a lot of testing.
To find a match for a pixel in the left image, the left window is drawn centered on that pixel. It is then compared to several windows in the right image, beginning with a window at the same location (zero disparity) and moving left in increments of one pixel (increasing disparity by one with each move). Whichever window in the right image gives the lowest cost is said to match the left image. The difference in x coordinates between the center of this match and the center of the left window is the disparity value of the pixel in question on the left image.
Figure 9: Simple window matching illustration Feature based stereo matching computes the corresponding pixels by using the extracted features from the images. Features are usually chosen to be lighting and
19 viewpoint independent. As a result, they compensate for viewpoint changes and camera differences. Techniques used to find the image features include edge, corner and blob and ridge detection. Such features include
• edge elements
• corners
• line and curve segments
• circles and ellipses
• regions defined either as blobs or polygons.
Area based algorithms are simple and efficient in general. Some of those algorithms
are ideal for real-time stereo matching applications. But, as discussed earlier, it can be
challenging to produce robust matching. Feature based on the other hand, can produce
fast and robust matching but usually require expensive feature extraction. Area
methods are more commonly used by researchers and this work will focus on them for
the most part.
2.10: Matching Cost
At the base of any matching algorithm is the matching cost which measures the (dis-) similarity of a pair of locations, one in each image. Matching costs can be defined at the pixel level or over a certain area. Common examples of this are Absolute intensity
Difference, Squared intensity Difference, Filter-Bank responses and gradient based measures. Binary matching costs area also possible, based on binary features such as edges.
20
2.11: Normalized Cross Correlation
The higher the normalized cross correlation [13] of two windows, the better they match. Normalized cross correlation is calculated by computing the mean and standard deviation of intensity in each window. Then, the mean intensity is subtracted from each pixel's intensity. Corresponding values of intensity - mean intensity from the left and right windows are multiplied together. These multiplied values are summed over the entire window. Finally, this sum is divided by the number of pixels in either window and divided by each standard deviation.
2.12: Ground truth
Ground truth [14] is a term used in various fields to refer to the absolute truth of something and is used to test the efficiency of an algorithm or a system. Used widely in machine learning it denotes a set of measurements that are much more accurate than the system being tested.
In the case of stereo vision systems the question is how well they can estimate 3D positions. The ground truth disparity map is composed by = the positions given by a laser range finder which is known to be much more accurate than any camera system.
Practically, it describes the perfect disparity between left and right image and it is compared to the produced disparity map using various metrics.
2.13: Applications of Stereo Vision
Stereo vision is well known to the general public mainly by 3D movies. But that is only a small part of a wide range of applications not only in everyday life, but also in industry and research and even space exploration.
21
• Robotics to extract information about the relative position of 3D objects in the
vicinity of autonomous systems, object recognition where depth information
allows for the system to separate occluding image components. Such robotic
systems are primarily used in industrial applications .
• 3D displays [15] and head mounted di splays to provide stereoscopic imaging
to the human eyes. In such applications where the goal is depth perception, the
basic requirement is to display offset images that are filtered separately to the
left and right eye.
There are two methods to accomplish that. One method is when the user wears
glasses to filter the offset images to each eye and another where no glasses are
required.
In the case where glasses (or filters) are used there are two types used, passive
and active filters. Passive filters do not require power and could be either color
filters or polarization filters. Active shutter filters have active shutters to filter
the image as their name suggests and require power.
In the glass free case the light source splits the images dire ctionally into the
viewer's eyes. Such displays are called autostereoscopic. Maybe the most
famous example of such technology is the Nintendo 3DS game console.
Figure 10: Nintendo's 3DS
22
Finally there are the head mounted displays where a separate display is
positioned in front of each and the image is projected through lenses to assist
the eye focusing. Such devices are used in a plethora of concepts like military
to provide augmented reality applications, engineering to provide stereoscopic
views of CAD schematics and of course entertainment, like 3D gaming and
movies or tours in virtual environments.
• Calculation of contour maps and geometry extraction for 3D building mapping
mainly from aerial surveys [16]. Those surveys are usually conducted with the
use of unmanned aerial vehicles or UAV's for short. The large number of
aerial images captured is then converted into geo-referenced 2D high
resolution orthophotos and 3D surface models and point clouds. Various
automated systems.
• The NASA STEREO project [17] one of the most important and high scale
projects ever. It stands for Solar Terrestrial Relations Observatory and in
essence it is a solar observation mission. Two nearl identical spacecraft were
launched in 2006 into orbits around the sun in a manner that enables
stereoscopic imaging of the sun.
Figure 11: The NASA STEREO Project
23
The goal is to study solar phenomena (principally coronal mass erections-
massive bursts of solar wind, plasma and magnetic fields ejected into space
that can disrupt earth communications and power networks) in the far side of
the sun. This practically enables solid forecasts of solar activity through 360
degree view of the sun at all times.
• Driverless cars [18] is a hot topic at the time of writing. Driverless cars as the
name suggests, can drive to their destination without requiring human
intervention. Stereo vision is the way the car can "see" the world in front of it.
Of course, a plethora of sensors are used, including laser, ultrasonic, GPS etc,
so a 360 degree "map" of the surrounding world can be formed by the car.
There are quite a few similarities with robot navigation in the process. Of
course this technology is still not in the reliable stage, especially due to the
risk for traffic accidents in case of errors/inaccuracy.
2.14: Challenges and difficulties
The correspondence problem refers to the problem of ascertaining which parts of one image correspond to which parts of another image, where differences are due to movement of the camera, the elapse of time, and/or movement of objects in the photos.
Other major pitfalls include reflections and transparency. It is usually very hard for a machine to distinguish whether it is looking at an object or the reflection of that object. Similarly, it is hard for a computer vision system to recognize the existence of transparent objects between the view source and the target scene.
24
The third pitfall is continuous and textureless regions. It is very difficult to determine which point on the left image corresponds to which point on the right image. Finally there can be technical difficulties like sensor noise and calibration noise to the cameras.
25
Chapter 3: Stereo Algorithms Evaluation Process
In this chapter there is a short analysis of previous work related to the topic of this thesis. Also, there is analysis of the state of the art evaluation method as well as the process followed for the thesis.
3.1: Previous Work
• A Taxonomy and Evaluation of Dense Two Frame Stereo Correspondence
Algorithms [19] by D. Scharstein and R. Szeliski. It is the state of the art
evaluation and is analyzed in the next paragraph.
• An Experimental Comparison of Stereo Algorithms by R. Szeliski and R.
Zabih [20]. In this work by Szeliski and Zabih there is an effort to compare
experimentally a few stereo vision algorithms. They make use of two stereo
pairs, the well known set from Tsukuba university and another produced by
them ( a simple scene with a slanted surface). Their methodology consists of
comparison with ground truth depth maps and the measurement of novel
prediction errors.
• Review of Stereo Matching Algorithms for 3D Vision by L. Nalpantidis, G.
Sirakoulis and A. Gasteratos [21]. In this work there is a theoretical
comparison and summary of various methods. It considers both local and
global methods, computational intelligence techniques and the speed and
accuracy of those. Also, some hardware implementation techniques are
presented.
• Overview of Stereo Matching Research , by R.A.Lane and N.A. Thacker [22].
This is a literature survey of a few area and feature based methods. It includes
26
a short description of those methods and some conclusions drawn. It is a
relatively old paper and part of a large series of stereo vision journals.
3.2: State-of-the-Art Middlebury E valuation
The state of the art evaluation method for stereo vision algorithms is offered by the
Middlebury College. The creators are Daniel Scharstein and Richard Szeliski and the evaluation process is documented in their publication titled "A Taxonomy and
Evaluation of Dense To Frame Stereo Correspondence Algorithms [19] .
The goal of the creators of this evaluation process was to comp are a large number of methods within on e common framework. For that reason they have focused on techniques that produce a univalued disparity map.
The evaluation process is very detailed and quite complicated. In essence, t here are 2 error measurements, RMS error and percentage of bad matching pixels in 3 different image areas.
RMS (root-mean-squared) error (measured in disparity units) between the computed disparity map dC (x, y ) and the ground truth map dT (x, y ) is computed by the following formula:
Percentage of bad matching pixels is computed by this formula:
27
Also the images are segmented into three different areas.
• textureless regions T : regions where the squared horizontal intensity gradient
averaged over a square window of a given size is below a given threshold.
Essentially, it is areas on the scene with little to no texture.
• occluded regions O: regions where the left-to-right disparity lands at a location
with a larger (nearer) disparity. This means that an occluded region is visible
on one of the images and not visible on the othe r.
• depth discontinuity regions D: regions where neighboring disparities differ by
more than a predefined gap, dilated by a window of a given width . It is
practically the areas in the scene where there is a sudden change in the depth
between the objects.
Th ese regions were selected to support the analysis of matching results in typical problem areas.
The Middlebury College offers an online evaluation tool for computer vision researchers to upload and test their algorithms and compare them against many others.
There are also a few datasets offered in various resolutions for testing. The online evaluation tool utilizes 4 certain image pairs and compares the user submitted disparity maps with the groun d truth maps for the four pairs. Note that the online evaluation tool is of version two at the time of writing this thesis.
3.3: Thesis E valuation Process
The first step was to collect all the open source algorithms that can be found implemented on various sources online. They were all tested and the ones producing
28 unusable results were discarded. The ones that gave meaningful results are analyzed here.
The comparison of the algorithms is based partly on the state of the art evaluation, namely the percentage of bad matching pixels (sum of absolute differences of the disparity map and the ground truth image matrices, BadMatchPercent formula from the previous paragraph). The focus is to find how good the algorithm has performed in estimating the disparity map. Mean elapsed time is measured in all cases, but it is not directly comparable, since two different tools are used and OpenCV is vastly faster than Matlab because it is written in C++ language. Also, elapsed time depends on more factors, like the parameters of each algorithm, image size and of course the hardware it is running on. The algorithms presented here were tested on an Intel
Celeron G1620 CPU with 4GB of RAM and an AMD HD6450 GPU.
A pixel by pixel subtraction is conducted between the result and the ground truth image matrices. There is a 30 pixel margin on all sides to eliminate empty image borders, since some algorithms produce disparity maps with black borders that could lower the score for no reason. Any result that is larger than the predefined threshold
(which is around 1.0 traditionally) is considered a bad matching pixel. The threshold is the same for all images so the comparison is fair. When a bad matching pixel is found it is added to the previous sum. In the end the sum is divided by the total number of pixels. The final result is a percentage of the bad matching pixels in the disparity map.
The images used are the widely popular stereo datasets from Middlebury College.
They contain several right and left views of the same scene, as well as a ground truth
29 image for evaluation. As mentioned before, the Middlebury online evaluation platform uses 4 standard image sets, a total of 8 images (cones, teddy, tsukuba, venus). Those images along with their ground truth will be used. Note that those four image sets are used in the second version of the online evaluation tool of Middlebury, which is still online at the time of writing. Version three should be online soon after this work is completed and it will use different image sets for the evaluation of algorithms.
Figure 12: Cones, teddy, tsukuba and venus left view
The method described above is summed up to a mathematical equation and the code that implements it is written by the author of this thesis. The platform chosen for the comparison is Matlab, due to the simplicity of matrix operations.
30
Chapter 4: Testing and Comparison of Stereo Algorithms
4.1: Intro
Stereo vision is still a popular topic when it comes to research. It is very active and there is a large number of algorithms being evaluated in the Middlebury platform.
Unfortunately, finding implementation of the various algorithms can be difficult because communication with creators is rarely successful and many of them refuse to help. Furthermore, available implementations most of the times do not function as expected and produce unusable results.
In this work the algorithms compared can be found implemented on the internet and their implementation is correct. That means, it gives satisfactory results not only by visual examination but also in comparison with the ground truth disparity maps. All the methods described here give a bad matching pixel percentage of less than 50%.
4.2: Common Algorithm Parameters
Each stereo algorithm is unique and features a certain number of inputs and parameters. But there are some common parameters among the ones analyzed in this work that apply also to the majority of existent stereo algorithms.
As expected, the input to all the algorithms is the stereo pair, traditionally left-right views of the scene in that order. Some algorithms accept as input the image matrix
(RGB or grayscale) and others accept plain images reading the matrix afterwards. The rest of the algorithm inputs are actually the parameters.
The first parameter is window size, when window/block matching is used for the search of similarities. Smaller window size usually means more detailed but coarse
31
(with noise) disparity map. Larger window size gives a smoother disparity map overall, but with less detail captured. Of course this parameter should be odd number since there is always a "center" pixel on the matching window.
The second parameter is the maximum disparity range. This means the minimum and maximum disparity value that the matching algorithm will search for similarities between the blocks of the image pair. Disparity values outside that range will be ignored. The minimum disparity value can be a negative number. In the case of
Middlebury test images the minimum value is always 0 and the maximum varies depending on the image.
4.3: Semi Global (Block) Matching
4.3.1: Algorithm Overview
Semi-Global (Block) Matching [23] successfully combines concepts of global and local stereo methods for accurate, pixel - wise matching at low runtime. This is probably the most popular algorithm for stereo matching. It has spawned many other algorithms and has been widely used by stereo vision researchers. As it is evident from the results in this work, it gives results with a relatively high number of bad matching pixels and is surpassed by other algorithms. Despite that, it is fast and very effective for real time stereo applications since the number of bad matches in its output is not prohibitive.
The core algorithm considers pairs of images with known intrinsic and extrinsic orientation. The method has been implemented for rectified and unrectified images. In the latter case, epipolar lines are effectively computed and followed explicitely while
32 matching. Of course in this work only rectified images are used (with known epipolar geometry).
The whole method is based on the idea of pixelwise matching of Mutual Information and approximating a global, two dimensional smoothness constraints by combining many single dimensional constraints. In a nutshell, the main algorithm has the following processing steps: 1) Pixelwise cost calculation 2) Implementation of the smoothness constraint 3) disparity computation with sub-pixel accuracy and occlusion detection.
4.3.2: Pixelwise Cost Calculation
In step 1 (pixelwise cost calculation ) the matching cost is calculated for a base image
pixel (the left one usually) from its intensity and the suspected correspondence of the
match image. An important aspect is the size and shape of the area that is considered
for matching. The robustness of matching is increased with large areas.
One way to perform pixelwise cost calculation is to use the Birchfield-Tomasi
subpixel metric. The cost is calculated as the absolute minimum difference of
intensities in the range of half pixel in each direction (8 directions) along the epipolar
line.
Another way to calculate the pixelwise cost is based on mutual information
(abbreviated as MI) which is insensitive to recording and illumination changes. It is
defined as the sum of the entropies of the two images minus their joint entropy
according to the following formula:
MI I1,I2 = H I1 + H I2 – HI1,I2
33
H. Hirshmueler in his work favors the Mututal Information approach, contrary to the
OpenCV implementation that uses Birchfield - Tomasi.
4.3.3: Aggregation of Costs
Pixelwise cost calculation is generally ambiguous since wrong matches can easily have a lower cost than correct matches due to factors like noise etc. Therefore, an additional constraint is added that supports smoothness by penalizing changes to neighboring disparities.
A global, 2D smoothness constraint is approximated by combining several 1D constraints.
Figure 13: SGBM matching costs aggregation
The matching costs in 1D are aggregated from all eight directions equally as illustrated on the figure above. The aggregated (or smoothed) cost for a pixel p and disparity d is calculated by summing the costs of all 1D minimum cost paths that end in pixel p at disparity d.
34
4.3.4: Disparity Computation
The disparity image D that corresponds to the reference image I is determined as in local stereo methods by selecting for each pixel p the disparity d that corresponds to the minimum cost.
For sub pixel estimation, a quadratic curve is fitted through the neighboring costs and the position of the minimum is calculated.
4.3.5: Implementation Details
SGBM is implemented in OpenCv and is embedded in the library. Also it is included in Matlab since version 2011b. The Matlab implementation did not produce usable results despite the extensive experimentation. Consequently, only the OpenCV version will be used.
OpenCV uses a modified version of the original Hirschmuller algorithm. Contrary to the original algorithm that considers 8 directions, this one considers only 5 (single pass). Also this variation matches blocks, not individual pixels thus the Semi Global
Block Matching naming. The parameters of this modified version can be tuned so the algorithm will behave like the original one.
Also, mutual information cost function is not implemented. Instead a simpler
Birchfield-Tomasi sub-pixel metric is used. Finally, some pre and post processing steps from the Konolige Block Matching implementation are included, for example pre and post filtering. It is evident from the few identical parameters between the two algorithms.
35
The OpenCV SGBM implementation features the common parameters and a few more that are listed here (the OpenCV documentation is insufficient so the explanation is based on experimentation):
• preFilterCap: Clips the output to [-preFilterCap, preFilterCap].
• uniquenessRatio: Computed disparity d* is accepted only if
SAD(d)>=SAD(d*)*(1+uniquenessRatio/100) for any d!=d+/-1.
• speckleRange, speckleWindowSize: Parameters of the OpenCV function
filterSpeckles which is used to post process the disparity map. It replaces
blobs of similar disparities (the difference of two adjacent values does not
exceed speckleRange) whose size is less or equal to speckleWindowSize (the
number of pixels forming the blob) by the invalid disparity value.
• disp12MaxDiff: A left-right check is performed. Pixels are matched from left
to right image and then from the right back to the left. The disparity value is
accepted only if the distance of the first match and the distance of the second
match have maximum difference of disp12MaxDiff.
• fullDP: If set to true, the algorithm considers eight directions instead of five
(like the original) but with higher memory consumption.
• P1: Penalty for small disparity changes.
• P2: Penalty for higher disparity changes.
It should also be noted that the disparity range consists of two parameters, minDisparity and numberofDisparities. The first value is the minimum disparity for the search window. The second shows the maximum difference from the minimum disparity. It works the same way as with the next algorithm, Block Matching.
36
4.3.6: Testing and Results
The produced disparity maps for Cones, Teddy, Tsukuba and Venus image pairs are the following:
Figure 14: SGBM Results
The percentage of bad matching pixels according to the Sum of Absolute Differences metric is the following:
Cones Teddy Tsukuba Venus Avg time
SGBM 0.4943 0.4977 0.3923 0.4982 ~0.065 sec
4.4: Block matching
4.4.1: Algorithm overview
Th is method is based on the block matching al gorithm. It is mainly used in video frames for motion estimation, but its principles can apply succes sfully to stereo matching also.
Block matching algorithm [24] involves dividing the current frame of video into
‘macro blocks’ and comparing each of the macro -block with corresponding block and its adjacent neighbors in the previous frame of the video. A vector is created that captures the movement of macro -block fr om one location to another in the previous
37 frame. This movement calculated for all the macro blocks comprising a frame , constitutes the motion estimated in the current frame .
The search area for a good macro-block match is decided by the ‘search parameter’, p, where p is the number of pixels on all four sides of the corresponding macro-block in the previous frame . The search parameter is a measure of motion. Larger the value of p, larger is the motion; however this becomes a computationally extensive task.
Usually the macro-block is taken to be of size 16 pixels and the search parameter is set to 7 pixels.
4.4.2: Algorithm Analysis
The tested implementation was submitted to OpenCV library by Kurt Konolige and is partly based on his work Small Vision Systems: Hardware and Implementation [12].
The paper revolves around the Small Vision Module or SVM, a compact, inexpensive real-time device for computing dense stereo range images.
In the case of stereo matching, the adjacent neighbor is the second image of the stereo pair.
Figure 15: Block Matching Algorithm for SVM
38
The algorithm that is implemented here has the following features:
• Laplacian of Gaussian transform (LOG for short), L1 norm(absolute
difference) correlation.
• Variable disparity search in pixel unit.
• Postfiltering with an interest operator and left/right check.
• x4 range interpolation.
The LOG transform and L1 norm were chosen because they give good quality results and can be optimized on standard instruction sets available on DSPs and microprocessors.
The following images are copied directly from the paper and help in the explanation of the algorithm. The disparity maps are green on the paper but they are converted to grayscale here for uniformity reasons (this work examines grayscale disparity maps).
Figure 16: Block Matching Sample Images
39
Image (a) shows the grayscale input image. Figure (b) depicts the typical disparity map produced by the algorithm. Brighter areas indicate higher disparities (closer objects) while darker areas indicate lower disparities(further objects). There are 64 possible levels of disparity total. In figure (b) the highest level is around 40 while the lowest is about 5. It is obvious that there is significant error in the upper left and right corners of the image. That is due to the uniform areas without enough texture to determine the disparity.
In figure (c) the interest operator is applied as a post filter. Areas with insufficient texture are rejected and appear black in the produced image. Even after using this filter, some errors still remain in portions of the image with disparity discontinuities, in this case the side of the person's head. Those errors are caused by overlapping the correlation window on areas with very different disparities.
One way to eliminate those errors is by applying left/right check. The left/right check can be implemented efficiently by storing enough information when doing the original disparity correlation. As the author concludes, the combination of interest operator and left/right check has proven to be the most effective at eliminating bad matches. As mentioned by the author, correlation surface checks were not used, since they do not add to the quality of the range image and can be computationally expensive.
As mentioned earlier, the algorithm described in the paper was intended to be used with the Small Vision Module, which is a small programmable device with limited resources, so it was designed with storage efficiency in mind.
40
4.4.3: Implementation Details
This algorithm is implemented in OpenCV and is embedded in the library. It is also part of Matlab 2011b onwards. The Matlab implementation strangely, gave no usable results even after extensive experimentation with the parameters (similarly to the
SGBM), so only the OpenCV version is presented here.
The input and parameters include of course the common ones and some more. The
OpenCV documentation does not sufficiently explain the parameters, so the analysis here is based mainly on experimentation. Most of them are optional and only the ones used for the testing are analyzed. The disparity range is actually 2 parameters, one for minimum disparity and one for maximum disparity range. (minDisparity and numberofDisparities respectively, final disparity range is [minDisparity, minDisparity+numberofDisparities]). The rest of the parameters are the following:
• preFilterSize: Window size of the prefilter.
• preFilterCap: Clips the output to [-preFilterCap, preFilterCap].
• textureThreshold: Calculates the disparity only at locations where the texture
is larger than or equal to this threshold.
• UniquenessRatio: Computed disparity d* is accepted only if
SAD(d)>=SAD(d*)*(1+uniquenessRatio/100) for any d!=d+/-1.
• speckleRange, speckleWindowSize: Parameters of the OpenCV function
filterSpeckles which is used to post process the disparity map. It replaces
blobs of similar disparities (the difference of two adjacent values does not
exceed speckleRange) whose size is less or equal to speckleWindowSize (the
number of pixels forming the blob) by the invalid disparity value.
41
• disp12MaxDiff: A left -right check is performed. Pixels are matched from left
to right image and then from the right back to the left. The disparity value is
accepted only if the distance of the first match and the distance of the s econd
match have maximum difference of disp12MaxDiff.
4.4.4: Testing and Results
Produced disparity maps (cones, teddy, tsukuba, venus respectively):
Figure 17: Block Matching Results
The percentage of bad matching pixels according to the Sum of Absolute Differences metric is the following:
Cones Teddy Tsukuba Venus Avg time
BM 0.4450 0.4583 0.3556 0.4707 ~0.028 sec
4.5: Loopy belief propagation
4.5.1: Overview
This method [25] by Ngia Kien Ho focuses on solving the stereo problem using
Markov Random Fields and Loopy Belief Propagation. The method is heavy on mathematics and quite complicated. The creator offers extensive analysis of the algorithm as well as an OpenCV implementation on his w ebsite.
42
4.5.2: Markov Random Fields
Markov Random Fields (abbreviated as MRF) are undirected graphical models that can encode spatial dependences. They consist of nodes and links as all graphical models, but also feature cycles/loops. Given a 3x3 image, the stereo problem can be modeled using MRF as follows:
Figure 18: Markov Random Field illustration
The blue nodes are observed variables and represent pixel intensity values in this work. The pink nodes are the hidden variables and represent the unknown disparity value. The hidden variable values are referred to as labels. The links between the nodes represent a dependency. For example, the center node depends only on the four nodes it is connected to. This rather strong assumption that each node depends only on the nodes it is connected to, is called Markov assumption.
4.5.3: MRF Formulation
The stereo problem can be formulated in terms of MRF as the following energy function:
43
Where Y is the observed node, X is the hidden node, i is the pixel index and j are the neighboring nodes of node xi (see above diagram).
This energy function sums up all the costs at each link given an image Y and a labeling X. The aim is to find a labelin g for X that produces the lowest energy. This is essentially the disparity map. The energy function contains two other functions,
DataCost and SmoothnessCost.
4.5.4: DataCost
The DataCost function returns the cost/penalty of assigning a label value of x i to data yi. Good matches require a low cost and bad matches a high cost. Usually, sum of absolute differences or sum of squared differences are ideal to serve as cost metrics.
Practically, the function calculates the SAD (or any other metric chosen) betwe en blocks (or even single pixels) in the two images of the stereo pair taking into account the different tested disparity values. The following pseudo code illustrates all this:
Naturally, the direction of the matching window on the right image depends o n the stereo pair.
44
4.5.5: SmoothnessCost
The SmoothnessCost function enforces smooth labeling across adjacent hidden nodes.
To achieve that, a function that penalizes adjacent labels that are different is needed.
The following table shows some commonly used cost functions.
Figure 19: Various cost functions
The Potts model is a binary penalizing function with a single tunable lambda ( λ) variable. This value controls how much smoothing is applied. The linear and
45 quadratic models have the extra parameter K which is a truncation value that caps the max penalty.
As the creator of the method comments, the choice of DataCost and SmoothnessCost functions is vague and should be based on experimentation.
4.5.6: Loopy Belief Propagation main part
When the DataCost and SmoothnessCost functions have been chosen and the parameters tuned, next step is to solve the energy function. Trying all possible combinations (brute-forcing) would require a quantum computer. So finding an exact solution is definitely not easy and should not be expected. Instead finding an approximate solution is a more viable approach.
The Loopy Belief Propagation (LBP) algorithm was chosen among others (Graph Cut,
ICM etc) to find an approximate solution for the MRF. The original Belief
Propagation algorithm [26] was proposed by Pearl in 1982 for finding exact marginals on trees. Trees are essentially graphs that contain no loops, but as it turned out the same algorithm can successfully be applied to general graphs that contain loops. The
“loopy” word in the naming originates from there.
LBP is a message passing algorithm. A node passes a message to an adjacent node only when it has received all incoming messages, excluding the message from the destination node to itself. The following figure illustrates the process:
46
Figure 20: LBP message passing
Node x 1 wants to send message to x 2. So it waits for messages from all other nodes
(A, B, C, D) before sending it. As explained earlier, it will not send the message from x2 to x 1 back to x 2. Node x 1 maintains all possible beliefs about node x 2. The choice of using cost/penalty or probabilities is dependent on the choice of the MRF energy formulation.
This pseudo code can illustrate the process discussed above. The first step is always the initialization of the messages. As mentioned earlier, each node has to wait for all incoming messages before sending its message to the target node. This means that at
47 the start of the algorithm, each node will wait forever and receive nothing, so no message can be sent from it. To overcome that problem all messages are initia lized to some constant so the algorithm can proceed. The initialization is typically 0 or 1.
The main part of LBP is iterative . By adjusting the respective parameters, the algorithm can run for a chosen number of iterations or until the change in energy dr ops below a threshold. For each iteration, messages are passed around the MRF.
The passing scheme is arbitrary and any sequence is valid (the algorithm creator chooses right, left, up and down). As it is mentioned, different sequences will produce different results.
Once the LBP iteration completes, the best label at every pixel can be found by calculating its belief using the following formula, where msg is the message sent to node I from k with label l:
4.5.7: Implementation Details
As mentioned earlier, there is an OpenCV implementation available at the author’s website . The common parameters and a few more are featured. The disparity range is called labels and the window size is controlled by the variable wradius which accepts even numbers. Afterwards one is added to the selected number so the window size is odd as it should. The rest of the parameters are the following:
• BP_ITERATIONS: An integer that defines how many iterations/loops the
algorithm will run for.
48
• LAMBDA: This value controls how much smoothing is applied in the
SmoothnessCost function.
• SMOOTHNESS_TRUNC: Truncation value for the truncated linear model
that is used in the implementation.
4.5.8: Testing and Results
The produced disparity maps for Cones , Teddy, Tsukuba and Venus image pairs are the following (note that the algorithm ran for 5 loops) :
Figure 21: Loopy Belief Propagation Results
The percentage of bad matching pixels according to the Sum of Absolute Differences metric is the following (note that the time is measured for 5 loops) :
Cones Teddy Tsukuba Venus Avg time
LoopyBP 0.0953 0.4206 0.0410 0.4203 ~155.88 sec
4.6: Fast stereo matching and disparity estimation
4.6.1: Overview
This method is based on the paper "A hybrid Algorithm for Disparity Calculation from Sparse Disparity Estimates Based on Stereo Vision " [27] .
49
This excellent work proposes a hybrid method for disparity estimation by combining the existing methods of block based and region based stereo matching. It utilizes image segmentation through K-Means clustering, morphological filtering and connected component analysis, SAD cost function and disparity map reconstruction.
The process is very clearly documented by the authors and will be analyzed here step by step. The following diagram depicts an overview of the whole algorithm.
Figure 22: Fast stereo Matching and Disparity Estimation
4.6.2: Algorithm Analysis
The first step is color conversion from RGB color to Lab color. The majority of imaging equipment captures images in RGB format. This format though does not properly approximate human vision. To overcome that difficulty the lab color space was developed to better approximate human vision. The lightness component, abbreviated L matches closely the human perception of lightness and is widely used by image processing algorithms. This algorithm only retains the L values of the pixels for further processing.
50
Step two is image segmentation . It is performed on the L values of the left image pixels using fast implementation of the K-Means algorithm. The image pixels are represented using one dimensional feature, namely a vector containing the L value for each pixel. Next a histogram of the L values is built and used instead of the actual pixel values for the subsequent iterations of the K-Means clustering. A histogram has a smaller fixed number of bins than the actual pixels thus the runtime is significantly reduced.
Step three is segment boundary detection and refinement. Segment boundary detection is achieved by comparing the cluster assignment of each pixel with that of its 8 neighboring pixels. If any of them is found to be different, the pixel is marked as one (belongs to a segment boundary), or else it is marked as zero. Thus, the boundary map is generated from the segmented left image. Since the clustering in step two is based only on the pixels' lightness values there are limitations in the accuracy of the said clustering. Consequently, many pixels can be falsely identified as belonging to segment boundaries. To overcome that, the authors of this work apply two morphological filters to refine the boundary map by removing such noisy pixels.
There are two types of morphological filters, Fill and Remove. Fill isolates interior zero pixels that are surrounded by ones and sets them to one also. Remove sets individual zero pixels to one if all of its four connected neighbors are one leaving only the boundary pixels on. Furthermore, they use connected components analysis and remove small artefacts in the boundary map due to segmentation errors. If disparity is calculated for those artefacts, it will most probably be false. Finally the smallest connected components that contribute about 4% of the total number of pixels are removed.
51
Step four is disparity calculation of the boundary map . The well known SAD (Sum of
Absolute Differences) cost function is used to determine only the disparities of the boundary pixels, using the L values of the left and right image pixels. A partial disparity is map is built, considering the sparse disparity measurements.
The fifth and final step is disparity map reconstruction from boundaries . The algorithm scans through each row of the partial disparity map and computes the remaining disparities based on the ones that have already been calculated. It operates in two stages:
-Disparity propagation ('fill' stage): In this first stage the disparity map is scanned row-wise, left to right. Whenever two boundary pixels with identical disparity values are encountered, the intermediate pixels of that row (aka the pixels between the boundaries with the same value) are 'filled' with that disparity value. An exception is made near the left and right end of each row. The left and right ends of each row are filled with the disparity value of the nearest border pixel until a boundary pixel is encountered.
-Estimation from known disparities ('Peek" Stage): In the second stage the algorithm searches for the pixels whose disparity is not determined yet and estimates it based on the disparity value of their neighboring pixels. When such a pixel is found, the known disparities of its neighbors are stored in an array and the unknown disparity is computed using statistical analysis.
4.6.3: Implementation Details
The algorithm input parameters except the common ones are:
52
• K representing the number of intensity based clusters for K -Means clustering
in the second step.
• Disp_scale represents the factor by which calculated disparities will be
multiplied. Value s hould be such that, max_disparity * disp_scale <+ 255.
It should be noted that all the parameters except the image pair are optional. A random value will be used if they are not specified and the results might not be optimal.
4.6.4: Testing and Results
The produced disparity maps for Cones, Teddy, Tsukuba and Venus image pairs are the following:
Figure 23 : Fast stereo Matching and Disparity Estimation Results
The percentage of bad matching pixels according to the Sum of Absolute Differences metric is the following:
Cones Teddy Tsukuba Venus Avg time
FSM 0.0770 0.1338 0.0320 0.1202 ~10.14 sec
53
4.7: Probability-Based Rendering for View Synthesis
4.7.1: Algorithm Overview
The main objective of this work [28] is to synthesize a virtual view, given two reference images, without deterministic correspondences. The first challenge that occured was to construct the probability of all probable matching points. The second was to render an intermediate view using a set of all matching candidate points with the probability.
To address the aforementioned challenges the authors of the paper presented the probability-based rendering (PBR) approach that robustly reconstructs an intermediate view with the steady- state matching probability (SSMP) density function.
SSMP: In this particular work the matching cost, typically referred to as a cost volume in the correspondence matching literature, is re-defined as the probability of being matched between points, enabling random walk with restart (RWR) to be applied to optimize the matching probability. The RWR uses edge weights between neighboring pixels to enhance the matching probability similar to aggregation methods for local stereo matching.
PBR: The rendering process is re-formulated as an image fusion, so that all probable matching points represented by the SSMP can be considered together. This approach has a couple of significant advantages. First it suppresses existing flicker artifacts.
Second the intermediate view is free from the hole filling problem since the SSMP considers all positions of probable matching points.
54
4.7.2: SSMP with RWR
First of all, the SSMP is defined. Two images are assumed, left and right. The probability p measures how likely I l(m 1,m 2) (a point on the left image) is to be matched to I r(m 1-d,m 2) (a point on the right image with disparity) or the opposite.
Also, the probability is inversely proportional to the cost, since smaller matching cost means higher matching probability. The above can be summarized in the following formulas:
Where p 0 is an initial calculated matching probability based on an initial matching
0 cost e . Z(m) represents a normalization term. The variable m denotes coordinates m 1 and m 2 and d denotes the disparity.
Next step is SSMP estimation using RWR. The random walk has been widely used to optimize probabilistic problems as the authors suggest. A random walker iteratively transits to its neighboring points according to an edge weight. Also, the random walker goes back to the initial position with a restarting probability a (0<=a<=1) at each iteration. A matching probability in the SSMP can be obtained by the RWR in an iterative fashion as follows:
55
Where N m denotes the four -neighborhood of a reference pixel m. No te that the above formula becomes the random walk when the restarting probability is zero. With an assumption that neighboring pixels tend to have similar matching probability when the range distance between the reference pixel m and its neighboring pixel n is small, an edge weight w(m,n) is computed by the following formula:
where y represents the bandwidth parameter, typically set to the intensity variance and
|| . || 2 denoted l 2 norm. Then a steady state solution p s(m,d) which is reffered the SSMP in this work, can be obtained by iteratively updating p t+1 (m,d) until p t+1 (m,d)=pt(m,d).
According to the authors, this work presents significant advantages. First of all, it does not require specifying a window size for reliable matching, contrary to the conventional methods due to the small number of adjacent neighbors. Second, there is no need to specify the number of iterations, since it gives a non -trivial solution in the steady state. Third, this method gives the optimal solution for given energy functional.
4.7.3: PBR with SSMP
Now the two reference images and the sets of their corresponding SSMPs are given.
The rendering process is cast into the probabilistic image fusion. A baseline between the left and right cameras is assumed to b e normalized to 1. Beta ( β) denotes the location of a virtual camera where 0<= β<=1. Also, P l(m,d) and P r(m,d) encode the
u u matching probability of a pixel on the left and right image (I l(m,d) and I r(m,d) ) respectively as follows:
56
where Zl(m) and Z r(m) are:
And < . > represents a rounding operator. The virtual view is the synthesized via an image fusion process. Specifically, a probabilistic average, E l(I l(m)) and E r(I r(m)), for two reference images is computed with corresponding probability P l(m,d) and P r(m,d)
u u and the textures, I l(m,d) and I r(m,d) along with the disparity hypothesis d and then blended as follows:
w w Left and right disparity maps can be denoted as d l(m) and d r(m) respectively. The
u u u sampled points I l(m,d) and I r(m,d ) are then converted as functions of m, I l(m) and
u I r(m) respectively. Furthermore, the matching probability functions P l(m,d) and
Pr(m,d) are simplified as a set of shifted Dirac delta function as follows:
and
57
Then, the PBR on the previous equation b ecomes:
For a given fixed point m* , the PBR synthesizes the intermediate view I u(m*) with
u the function of reference view I l(m*,d) and the probability P l(m*,d) as follows:
Finally, the PBR is able to handle occlusion and dis -occlusion (hole) regions by assuming that the background texture varies smoothly. The problematic regions have their textures synthesized in a probabilistic manner.
4.7.4: Implementation details
The implementation of this algorithm runs on Matlab. As discussed earlier, it does not require one of the common parameters (window size). The most important parameter the user has to modify is the disparity range. Apart from that there is a large number of parameters that can be tuned to control various aspects of the algorithm, but none ar e necessary to be changed and could be left to their default values.
4.7.5: Testing and Results
Figure 24: SSMP Results
58
The percentage of bad matching pixels according to the Sum of Absolute Differences metric is the following:
Cones Teddy Tsukuba Venus Avg time
SSMP 0.0883 0.1087 0.0379 0.4993 ~231.1 sec
4.8: Results Analysis and Conclusion
Below is a table that summarizes the percentage of bad matching pixels for all the algorithms. Highlighted in blue is the best (lowest) score for each image pair and in red the worst (highest).
OpenCV Cones Teddy Tsukuba Venus Avg. Time
BM 0.4450 0.4583 0.3556 0.4707 ~0.028 sec
SGBM 0.4943 0.4977 0.3923 0.4982 ~0.065 sec
LoopyBP 0.0953 0.4206 0.0410 0.4203 ~155.88 sec
Matlab
SSMP 0.0883 0.1087 0.0379 0.4993 ~231.1 sec
FSM 0.0770 0.1338 0.0320 0.1202 ~10.14 sec
The following diagram depicts a visual representation of the performance of the five algorithms. Since the measured amount is percentage of bad matching pixels, it means that lower value shows better performance and higher value shows more performance mismatches.
59
Figure 25: Percentage of bad matching pixels (lower value is better) The most "difficult" image pairs to match seem to be Venus and Teddy. All the algorithms give the highest number of bad matching pixels in those two image sets. A possible reason is that those two sets exhibit small variation in the scene and higher uniformity on some regions which could "confuse" an algorithm and give a high number of candidate matching points. The other two image sets, Cones and Tsukuba give better scores in all the algorithms. Those images exhibit a greater variation on the scene and matching points are easier to identify, with Tsukuba giving the best result in all of the algorithms.
Overall, the most efficient algorithm of the ones tested here is the Fast Stereo
Matching and Disparity Estimation from professor GREDDY and S. Mukherjee. It gives a small number of bad matching pixels under all circumstances and has a low running time. The SSMP gives excellent results also but it is highly complicated and exhibits a high running time due to the large number of steps required for its completion. Also, as is evident from the Venus result, it cannot handle uniformity in
60 images effectively in all cases. The Loopy Belief Propagation also handles uniformity in scenes poorly as the results from Teddy and Venus show. Additionally if a higher disparity range is selected the running time will be quite high even in a smaller number of loops. Finally, the SGBM and BM have many common points in their
OpenCV implementation with BM giving slightly better scores. Both those algorithms performed poorly when it comes to the matching process itself, but have a very small runtime. They seem ideal for real-time applications or where speed is more important than robust stereo matching.
61
Chapter 5: Discussion and future work
Stereo vision is implemented as mentioned earlier in scientific, industrial, military and even consumer fields. Although it is still considered as a gimmick by many people, it steadily gains traction and acceptance.
Maybe the most obvious research field that will employ stereo vision in the near future is virtual reality applications. They have already existed for a few years but mostly for educational/entertainment purposes with limited use, mainly in virtual tours in rather small 3D environments. Nowadays, with increased computational power virtual reality can also be used in immersive and interactive applications, like video games.
Several consumer virtual reality devices like Google Cardboard [29] and Oculus Rift
[30] have started to make their way to consumers. More such devices are expected from many manufacturers in the near future.
Figure 26: Current Virtual Reality Devices
Another interesting project scheduled to be released commercially in October 2015 is a device dubbed as virtual reality toy and is intended to be the greatest remodeling of the abundantly famous View Master. Mattel toy corporation works with Google for the project, which is largely based on the Google Cardboard. The traditional reels are
62 replaced by plastic cards and a smartphone. The user slides the smartphone inside the headset and scans the cards. A 3D image based on the theme of the cards is depicted.
The trademark switch on the future View Master is now used to zoom or to focus on objects in the virtual scene.
Also, another field that has recently started to employ stereo vision is medicine and more specifically endoscopy. Traditional endoscopes feature a single camera that provides a two dimensional image of the patient's examined internal organ. The stereoscopic endoscopes feature two cameras that provide three dimensional imaging thus allowing more thorough visual examination by extracting information about the internal surface of the organs.
Figure 27: Stereoscopic Endoscope
Research is also very active in the driveless cars, discussed in the first chapter and is expanding to other vehicles, mainly autonomous drones and Unmanned Ground
Vehicles. Also there are several space exploration projects that employ stereo vision, such as an innovative planetary landing algorithm [31] used to extract planet surface
63 information and safely guide the space vessel to touch down as proposed by S.
Woicke and E. Mooij.
Finally it is the autonomous or not robotic systems that use stereo imaging along with many other sensors. Robots are becoming more efficient and intelligent and their use is set to be expanded in the near future in almost every sector imaginable.
64
References
1. http://en.wikipedia.org/wiki/Computer_stereo_vision.
2. https://en.wikipedia.org/wiki/Stereopsis#History_of_investigations_into_stereopsis.
3. https://en.wikipedia.org/wiki/View-Master#History.
4. https://en.wikipedia.org/wiki/Virtual_Boy.
5. http://en.wikipedia.org/wiki/Pinhole_camera_model.
6. http://en.wikipedia.org/wiki/Camera_resectioning.
7. http://en.wikipedia.org/wiki/Epipolar_geometry.
8. http://en.wikipedia.org/wiki/Fundamental_matrix_%28computer_vision%29.
9. http://en.wikipedia.org/wiki/Image_rectification.
10. http://www.jayrambhia.com/blog/disparity-maps/.
11. http://www.cs.stolaf.edu/wiki/index.php/Stereo_Matching.
12. Konolige, K. Small Vision Systems: Hardware and Implementaion. Springer. 1998.
13. https://en.wikipedia.org/wiki/Cross-correlation#Normalized_cross-correlation.
14. https://en.wikipedia.org/wiki/Ground_truth.
15. http://techcrunch.com/2010/06/19/a-guide-to-3d-display-technology-its-principles- methods-and-dangers/.
16. http://www.self.gutenberg.org/articles/Aerial_survey.
17. http://www.nasa.gov/mission_pages/stereo/main/index.html.
65
18. https://en.wikipedia.org/wiki/Autonomous_car.
19. D. Scharstein, R. Szeliski. A Taxonomy and Evaluation of Dense Two Frame Stereo
Correspondence Algorithms. 2001.
20. R. Szeliski, R. Zabih. An Experimental Comparison of Stereo Algorithms. Springer. 2000.
21. L. Nalpantidis, G. Sirakoulis, A. Gasteratos. Review of Stereo Matching Algorithms for 3D
Vision . 2007.
22. R.A. Lane, N.A. Thacker. Overview of Stereo Matching Research. 1998.
23. Hirschmuller, H. Semi-global Matching - Motivation, Development and Applications.
2011.
24. http://en.wikipedia.org/wiki/Block-matching_algorithm.
25. Ho, Ngia Kien. http://nghiaho.com/?page_id=1366#LBP. [Online]
26. https://en.wikipedia.org/wiki/Belief_propagation.
27. s. Mukherjee, G.R.M. Reddy. A Hybrid Algorithm for Disparity Calculation From Sparse
Disparity Estimates Based on Stereo Vision. IEEE. 2014.
28. B. Ham, D. Min, C. Oh, M.N. Do, K. Sohn. Probability-Based Rendering for View
Synthesis. IEEE. 2014.
29. https://www.google.com/get/cardboard/.
30. https://www.oculus.com.
31. S. Woicke, E. Mooij. A Stereo-Vision Based Hazard-Detection Algorithm for Future
Planetary Landers. 2014.
66