<<

AUTOMATED OF LASER BASED

RANGE IMAGES

by

RATTASAK SRISINROONGRUANG, B.S.

A THESIS

IN

COMPUTER SCIENCE

Submitted to the Graduate Faculty of Texas Tech University in Partial Fulfillment of the Requirements for the Degree of

MASTER OF SCIENCE

Approved

Eric D. Sinzinger Chairperson of the Committee

Gopal D. Lakhani

Hector J. Hernandez

Accepted

John Borrelli Dean of the Graduate School

August, 2005 ACKNOWLEDGEMENTS

I would like to thank Mom and Dad for everything they’ve done and had to go through to give their children the best kind of life that they could. I only hope I will be as kind and giving as you throughout my life. I would like to thank Dr. Eric Sinzinger for all the help he has given me during my research. The suggestions and advice made the completion of this work possible.

ii CONTENTS

ABSTRACT ...... v LIST OF FIGURES ...... vi LIST OF TABLES ...... vii 1 INTRODUCTION ...... 1 2 RELATED WORK ...... 3 2.1 Texture Mapping ...... 3 2.2 Laser Range Data ...... 3 2.3 Data Registration ...... 4 2.4 Image Segmentation ...... 5 3 BACKGROUND ...... 7 3.1 3D Transformations ...... 7 3.2 Camera Model ...... 8 3.3 Texture Mapping Overview ...... 10 3.4 Texture Mapping Techniques ...... 11 3.5 Texture Mapping Types ...... 12 3.6 Texture Mapping Effects ...... 14 3.7 Aliasing and Filtering ...... 16 3.8 Image Segmentation ...... 18 4 METHODOLOGY ...... 21 4.1 Automated Mesh Alignment ...... 21 4.2 Stencil Calculation ...... 21 4.2.1 Translation Alignment ...... 22 4.2.2 Scale Alignment ...... 23 4.2.3 Rotation Alignment ...... 23 4.2.4 Alignment Metric ...... 25 4.2.5 Field of View Alignment ...... 26 4.2.6 Combined Transform Alignment ...... 26

iii 4.3 Texture Coordinate Mapping ...... 28 4.3.1 Orthographic Projection ...... 28 4.3.2 Projection ...... 28 5 RESULTS ...... 31 6 CONCLUSION AND FUTURE WORK ...... 42 6.1 Advantages and Disadvantages ...... 42 6.2 Future Work and Improvements ...... 43 6.2.1 Stencil calculations ...... 43 6.2.2 Extended borders ...... 44 6.2.3 Lighting and shading ...... 44 REFERENCES ...... 45

iv ABSTRACT

Texture mapping is the process of applying a 2D image onto a 3D planar surface. This requires the generation of a mapping that defines the relationship between the 2D coordinates of the image and the 3D coordinates of the surface. The goal of this research is to provide a method of automatically generating this mapping given a 3D object at arbitrary orientation and a 2D image that may contain unwanted background information. A review of the current methods of texture mapping, image segmentation, and basic 3D viewing transforms is given. An algorithm to compute this alignment given the proper segmentation of the 2D image is then proposed and tested with five different models. The results of the generated alignment and mapping are then discussed, showing the level of accuracy of the final texture mapped model.

v LIST OF FIGURES

3.1 Viewing frustum ...... 8 3.2 Texture Mapping Example ...... 11 3.3 Segmentation using clustering with color. Reprinted from ”Computer Vision: A Modern Approach,” by Forsyth and Ponce, Prentice Hall, 2003...... 20 3.4 Segmentation using clustering with color and position. Reprinted from ”Computer Vision: A Modern Approach,” by Forsyth and Ponce, Pren- tice Hall, 2003...... 20

5.1 Segmentation Images ...... 32

◦ 5.2 Teapot Alignment Results for δr = 45 ...... 35 ◦ 5.3 Teapot Alignment Results for δr = 20 ...... 35 ◦ 5.4 Teapot Initial Orientation and Result Error for δr = 20 ...... 35 ◦ 5.5 Face Initial Orientation and Result Error for δr = 20 ...... 36 ◦ 5.6 Face Alignment Results for δr = 45 ...... 36 ◦ 5.7 Face Alignment Results for δr = 20 ...... 36 ◦ 5.8 Mechanical Part Initial Orientation and Result Error for δr = 20 . . 37 ◦ 5.9 Mechanical Part Alignment Results for δr = 45 ...... 37 ◦ 5.10 Mechanical Part Alignment Results for δr = 20 ...... 37 ◦ 5.11 Cessna Initial Orientation and Result Error for δr = 20 ...... 38 ◦ 5.12 Cessna Alignment Results for δr = 45 ...... 38 ◦ 5.13 Cessna Alignment Results for δr = 20 ...... 38 ◦ 5.14 Will Rogers Initial Orientation and Result Error for δr = 20 . . . . . 39 ◦ 5.15 Will Rogers Alignment Results for δr = 45 ...... 40 ◦ 5.16 Will Rogers Alignment Results for δr = 20 ...... 40

vi LIST OF TABLES

5.1 Model Sizes ...... 32

◦ 5.2 Alignment Calculations with δr = 45 ...... 33 ◦ 5.3 Alignment Calculations with δr = 20 ...... 33

vii CHAPTER 1 INTRODUCTION

Texture mapping has become an integral component of computer generated scenes, whether in movies, games, or another form of graphical rendering. Before, with meshes composed of thousands of polygons considered complex, manual assignment of texture coordinates to 3D model coordinates, though time consuming, was a man- ageable task. With increased processing power and memory however, 3D models can now be composed of hundreds of thousands of polygons. An automated method of assigning texture coordinates with quick visual feedback would expediate the artistic pipeline. Texture mapping is the process of applying an image (usually 2D) to a planar 2D surface (usually in 3D space). This can be used to increase the visual interest and complexity of a scene without adding increased geometric data. The most common component to add is surface color. However, texture mapping is also used to apply light effects, shadow effects, reflective effects, and surface irregularity effects onto a surface. The process of applying and warping a texture on a planar surface is computationally inexpensive compared to the cost of transforming a geometrically complex scene. This project involves the automated mapping of textures to 3D geometric data, requiring as little user input as possible. The traditional process of applying a non- tiled texture map to a 3D model requires extensive user input to select the proper ’binding’ of texture coordinate points to 3D model vertices. Traditionally, the process of creating a non-tiled mapping onto a 3D model is accomplished in two ways. One method requires the user to manually assign each vertex in the model to a specific point in the 2D texture. For increasingly complex 3D models, this process takes a long time. The other method is ”reverse-skinning”. This involves obtaining the unwrapped ”outline” of the 3D model on a 2D texture. The user can then apply the color properties onto the 2D texture as desired, knowing

1 where each point on the 2D texture is mapped on the 3D model. Both methods require significant user input. Both of these processes can take a long time to accomplish. In the former case, the process does not scale well with increasing geometric complexity in terms of the amount of input needed from the user. In the latter case, working around the 3D model outline usually means a stock piece of texture cannot be applied to the model as the outline (usually generated automatically from an unwrapping tool) will have an orientation not in alignment with the texture of the object. If the process of texture mapping can be automated as proposed in this research, the time needed to correctly apply a texture to a 3D model may be significantly re- duced and the problems of the two traditional methods of texture mapping discussed earlier can be reduced. This automated process should scale well with increased geo- metric complexity of models in terms of time because the user will not be required to manually set the mapping between texture points and model points for every vertex. The issue with ”reverse-skinning” where the orientation of the model outline does not align with the stock texture may also be eliminated. Given a stock texture containing a texture of the object whose 3D representation the user wishes to have mapped, this new process of texture mapping would ideally determine the transformation needed to align the 3D model so that its orientation can closely match that present in the texture. From this, the mapping of texture coordinates to 3D model vertices can be automatically computed.

2 CHAPTER 2 RELATED WORK

2.1 Texture Mapping Texture mapping is the process where a 2D texture is applied to a 3D object. The process involves the parametrization of a surface and the subsequent application of the texture onto the surface. [3]. Texture mapping made gains in popularity because of its ability to add realism and interest to a scene without adding increased geometric complexity that would lead to significantly increased processing time. Such effects include color, reflection, shadow, and surface irregularities. The reflection effect is not a true reflection and requires the use of an environment map that is ”wrapped” around the 3D object as rays are cast outward from the object to determine the intersection point with the environment map [3]. Surface irregularities are modeled with the use of a bump map that defines surface normal offsets. These offsets are used during lighting calculations to give the illusion of surface irregularities [2]. Combined transforms of 3D points are facilitated by encoding the transformations as matrix operations. This requires that 3D points be represented as the 4 compo- nents of a homogenous coordinate system so that the 3D points can be manipulated by multiplication with the transformation matrix [15]. With the advent of modern graphic processing units that allow efficient matrix operations, this has the added bonus of allowing the transformations to be done on specialized hardware. The most common type of texture mapping is the perspective texture mapping, a transformation that is related to the perspective transformation used when manip- ulating the point of view [8]. This type of mapping limits the distortions apparent when using an affine texture mapping transformation.

2.2 Laser Range Data Laser range scanner devices emit a beam that is transmitted and recaptured to determine the distance to an object. This is most commonly accomplished by deter-

3 mining the elapsed time until the beam is reflected back towards the device. Accuracy is based upon various factors such as target reflectivity, target size, and target ori- entation. Such range scanners have various uses in the real world. Range scanners are used by the military. They are, for example, integrated into the M1A1 Abrams tank to aid in determining the distance to a target. Range scanners also have uses for reconstructing accurate 3D representations of a real world scene. Discovering the distance to an object also has uses in robotic navigation. Cloud data points obtained from laser range scanners have been used before to construct a 3D virtual representation of a scene. The data obtained from such scans contain no relationship between the points. Thus, there is no underlying structure present in the data. Delaunay triangulation can be used to create surfaces from the points and provide structure to the data [4]. Another option is to ”carve” a volume based upon the range scanner data to produce a volumetric representation instead of only an outer shell [6]. Turk and Levoy employ a method where multiple sets of data are ”stitched” together to form a continuous polygonal surface [18]. To obtain a fully enclosed mesh from range scanner data, multiple scans are needed to account for occlusions. To produce a final model, the scans have to first be aligned into a common coordinate system and then merged to eliminate all holes [13]. In the case of the Digital Michaelangelo project, color data was stored as an RGB reflectance triplet at each vertex [13]. The use of color data at each vertex instead of textures was possible because of the high resolution used during the range scanning process.

2.3 Data Registration A popular method for registration of 3D data is the iterative closest point (ICP) algorithm [1]. This algorithm registers a data set P with a model shape M. The data set must be represented as a point set and the model shape can take any of the following forms: points set, line segment set, implicit curves, parametric curves, tri- angle sets, implicit surfaces, and parametric surfaces. If the set P is not immediately available as a point set, decomposing its representation to a point set is fairly simple.

4 The model shape can be represented in a form different than one of those mentioned previously as long as there is a definition of finding the minimum distance between a 3D point and a point in the model shape M. The ICP algorithm eventually converges to the local minimum. Finding the global minimum requires finding the minimum of all local minimum wells in the search space [1]. This requires a proper partitioning of the registration search space. Determining an objects position in space is known as pose estimation [17]. Pose estimation can be established for a given 3D data set to match either a 2D image or a 3D range model. The majority of methods to determine the pose require at least three correspondences [10]. One popular method to determine pose in order to map a 3D data set to a 2D image requires the use of Newton’s method. This is also done when attempting to fit a model composed of arbitrary curves instead of the more common planar surfaces defined by vertices [14]. This method also requires at least three correspondences between the data set and the 2D image. This may be difficult to find for some data sets and requires that the visible section of 3D data is similar to the visible section in the 2D image. This method also requires that the intrinsic parameters of the camera that generated the 2D image be known. It then proceeds by establishing the correlation between the 3D data set points and how those points would be projected onto a 2D plane to generate an image similar to the one being used for matching. In the case of pose estimation between a 3D data set and a range image, the dimensionality of the two inputs are the same. At least three correspondences are required between the two inputs. The pose and translation are then computed by solving a least squares problem.

2.4 Image Segmentation Image segmentation is the process distinguishes objects from one another in an image. This is done be the grouping of pixels that belong to the same object. This provides a higher level representation of image data relative to the most basic unit, the pixel. This proves useful for certain applications such as image searching by

5 providing the user to search of other images that contain the same object of interest [5]. Image segmentation is currently a very active field of research. The majority of image segmentation algorithms are formed upon the basis of the K-Means algorithm [9]. All that is required is that there is some method to compute a distance between the basic units that a particular segmentation algorithm employs. These units are usually a feature vector containing color, position, or some other property. They are either divisive or agglomerative [9] in nature as will be explained in the background information chapter. Another approach uses the expectation-maximization (EM) algorithm [7] to group together likely ”blobs” of pixels that correspond to the same object [5]. Their method uses a feature vector with eight parameters at each pixel: three for color, three for texture, and two for position. It is similar to the algorithms that use the K-Means algorithm in that they also have a target number of segments they are attempting to converge towards. Image segmentation has also been applied to the segmentation of MRI data and particularly for segmentation of brain tissue images [20]. Their method also employs the EM algorithm, taking into account the color values and tissue properties.

6 CHAPTER 3 BACKGROUND

3.1 3D Transformations All three transforms can be embedded within a 4x4 matrix which in turn can be The three main transformations that are used in 3D are the trans- lation, scale, and concatenated with other transform matrices to produce a combined transform matrix. This matrix then is multiplied with a 3D point to produce a new, transformed 3D point. So that translation can be embedded within a matrix, the use of homogeneous coordinates are required to represent 3D points. This results in the addition of a fourth component, the homogeneous component. This is usually set to one for points and zero for vectors. The translation matrix is defined below.

  1 0 0 tx      0 1 0 ty  T =      0 0 1 tz    0 0 0 1

The scale matrix is defined below.

  sx 0 0 0      0 sy 0 0  S =      0 0 sz 0    0 0 0 1

The rotation matrices for rotation about the x, y, and z axes are respectively defined below.

7   1 0 0 0      0 cos θ − sin θ 0  Rx =      0 sin θ cos θ 0    0 0 0 1   cos θ 0 sin θ 0      0 1 0 0  Ry =      − sin θ 0 cos θ 0    0 0 0 1   cos θ − sin θ 0 0      sin θ cos θ 0 0  Rz =      0 0 1 0    0 0 0 1

These three matrices can be concatenated.

3.2 Camera Model

Figure 3.1: Viewing frustum

The camera model employed in is similar to a camera in the the real world. The image of the projected 3D object onto the 2D screen is modeled

8 after how an image is formed on a sheet of film. A projector is the line that passes through the point to be projected on the 3D object and center of the camera lens, called the center of projection. The intersection of the projector with the projection plane, the sheet of film, is the position of the 3D point projected onto a 2D plane. There are two main types of projections used in 3D computer graphics: orthogonal and perspective projections. Orthogonal projections do not take into account the depth of an object when projecting it to a 2D plane. Thus, an object one foot away from the viewer will have the same size as an object one hundred feet away. Because depth is not taken into account, all projectors for orthogonal projections are perpendicular to the projection plane. The projection matrix for an orthogonal projection is defined below.

  1 0 0 0      0 1 0 0       0 0 0 0    0 0 0 1

Perspective projections do take into account the distance an object is from the viewer. The viewing volume for perspective projections can be visualized as a four- sided pyramid, with the tip corresponding to the viewer position. The base corre- sponds to the far clip plane and the near clip plane is the intersection of a plane with the viewing volume at any point between the tip and the base. This plane is parallel to the base of the viewing volume. See Figure 3.1 for an example of a viewing frustum. The projection matrix for a basic perspective projection is defined below.

  1 0 0 0      0 1 0 0       0 0 0 0    0 0 1/zn 0

9 where zn is the distance to the projection plane and the near clipping plane. However, one would like to convert the viewing frustum to a canonical viewing volume. A canonical viewing volume is volume that ranges from -1 to 1 in all axes. This provides simpler clipping and reduces the perspective projection to an orthogonal projection. This modified perspective projection matrix is defined below.

  1 0 0 0      0 1 0 0       0 0 α β    0 0 1/zn 0

z + z α = f n zf − zn

2 ∗ z ∗ z β = f n zf − zn

where zf is the distance to the far clipping plane.

3.3 Texture Mapping Overview Texture mapping is a method to increase the realism of 3D rendered scenes using less computation time than would be required were the geometric complexity of the scene increased to add more detail. Many different types of images can be mapped onto a planar surface to increase the visual interest and detail without having to increase the geometric detail of the object. Some examples would be bump maps, light maps, environment maps, shadow maps, and texture maps. Bump mapping is used to disturb the surface normals of a flat planar surface, giving the illusion of dips and bumps. A light map is a pre-computed radiosity map that is applied to a flat planar surface to give the illusion of a more detailed lighted surface without have to calculate the Phong model in real-time. Though, since the map is pre-calculated, the lighting on the surface is static during run-time. An environment map is applied to

10 a highly reflective object, giving the illusion of the surrounding scene being reflected off the object. A shadow map is computed during runtime and is usually a ’stencil’ of the shadowed scene, providing the necessary information needed to calculate the areas occluded from light. Texture mapping, the most common mapping type, is used to apply a texture such as wood or brick to a flat planar surface, increasing the visual interest of the object. See Figure 3.2 for an example of texture mapping. All these mapping techniques share a common trade-off: they trade lower computational cost for higher memory cost.

Figure 3.2: Texture Mapping Example

3.4 Texture Mapping Techniques The process of mapping a two-dimensional texture to a three-dimensional object can be viewed as a forward process and an inverted process. The forward mapping process would map from a two-dimensional texture map to the three-dimensional surface. The first calculation to perform in this case is the surface parameterization. This means that for a given (u, v) coordinate in texture space specifying a picture

11 element (or pixel), a function then calculates the resultant (x, y, z) on the object that the texture map pixel would be applied to. The second calculation would be the model and view transformations that project from object space to screen space. Because the user normally wants to map onto a real-world object, a parametric definition for such an object usually does not exist. For this reason, the inverse mapping process has become more common. In this process, the four corners of the screen space projected pixel are used to calculate the pre-image of that pixel in image space. The pre-image is a non-uniform shape that the pixel in the texture map would map to when projected onto the object in screen space. For inverse mapping, the (u, v) texture coordinates are calculated as functions of the (x, y) screen coordinates. Knowing the shape of the pre-image facilitates in anti-aliasing since the weighted average of colors can be taken from the texture map. Point sampling by taking the texture map color at the center of the pre-image pixel in this case would produce undesirable results. Another method of texture mapping is two-part texture mapping. The surface parameterization problem is avoided because an intermediate surface that is readily parameterized is used. The texture map is then projected onto this intermediate sur- face. Since the intermediate surface is three-dimensional, the mapping problem then becomes a simpler problem of finding a correspondence between three-dimensional space objects. Both stages of this process are forward mapping. The first stage, called the S mapping stage, involves the mapping of the two-dimensional texture map to the three-dimensional intermediate object. The second stage, called O mapping, maps from the intermediate object to the target three-dimensional object.

3.5 Texture Mapping Types There are three types of mappings that can be used: affine, bilinear, and pro- jective [11]. Two properties of affine mappings are preservation of parallel lines and equi-spaced points along parallel lines. Affine mappings are linear mappings plus a translation. A linear mapping is any mapping such that the following is true.

12 T (x + y) = T (x) + T (y) T (αx) = αT (x)

An affine mapping is parameterized below (for simplicity, in two-dimensional space).

  a d 0         x y 1 = u v 1  b e 0    c f 1

Bilinear mappings are computed by using linear interpolation. The mapping in- volves linearly interpolating by a fraction of v along the y-direction and then linearly interpolating by a fraction of u across the planar surface. These mappings preserve horizontal and vertical lines and equispaced points along those lines. They do not preserve diagonal lines. Bilinear mappings are parameterized below.

  a e          b f  x y = uv u v 1      c g    d h

Projective mappings are the final type of mapping. Projective mappings preserve all parallel lines but not equispaced points. Projective mappings are parameterized below.

  a d g       0 0 0 0   x y w = u v q  b e h    c f i

In current systems, projective mappings are the desired type of mapping because they do not suffer from the warping and distortion that the other two mapping types suffer from. Given the four corners of both the planar surface and the texture map, one can solve for the eight unknowns in the parameterization since i = 1. For k = 0, 1, 2, 3

13 for each of the selected corners, there are eight equations for the eight unknowns in the parameterization.

auk + bvk + c xk = guk + hvk + 1

duk + evk + f yk = guk + hvk + 1 These 8 equations are represented below.

      u0 v0 1 0 0 0 −u0x0 −v0x0 a x0              u1 v1 1 0 0 0 −u1x1 −v1x1   b   x1               u2 v2 1 0 0 0 −u2x2 −v2x2   c   x2               u3 v3 1 0 0 0 −u3x3 −v3x3   d   x3      =          0 0 0 u0 v0 1 −u0y0 −v0y0   e   y0               0 0 0 u1 v1 1 −u1y1 −v1y1   f   y1               0 0 0 u2 v2 1 −u2y2 −v2y2   g   y2        0 0 0 u3 v3 1 −u3y3 −v3y3 h y3

Since there are 8 equations and 8 unknowns, Gaussian elimination can be used to solve for all 8 unknowns.

3.6 Texture Mapping Effects Bump mapping is a method for perturbing the normals of a three-dimensional planar surface so that the illusion of bumps and dips can be simulated using the lighting model in combination with the altered surface normals. This technique was developed by Jim Blinn [19]. Since the original surface is still flat, a silhouette edge will pass evenly across the dip instead of trailing behind the level surfaced edge’s silhouette. In essence, a bump map is a height map defining the perturbations that should be applied to the surface normals. Given a disturbance vector D, D is added to the original surface normal to perturb the normal in a particular direction.

14 N 0 = N + D

D can be calculated by the following.

D = BuP − BvQ

P and Q, along with N, define the three-dimensional coordinate system located on the planar surface. Bu and Bv are the partial derivatives of the bump map at index (u, v). Environment mapping simulates the reflection of a surrounding scene onto a shiny object [19]. This technique was also first introduced by Blinn [3] [12]. The supposed surrounding environment is stored as a normal texture map and mapped onto the desired object to give the effect of reflectance. This allows the renderer to simulate reflection without going through the costly process of ray-tracing. Since it simulates reflection, the process of environment mapping is a function of the view vector, as opposed to texture mapping. Environment mapping can be viewed as a specific case of two part mapping. The intermediate object is usually a cube or sphere. A cube is used when the environment map is used to model a room because of distortion apparent when using a sphere as the intermediate object. The view vector is reflected around the surface normal of the object and indexed into the surrounding intermediate surface. The resulting intersection is used to calculate the final color. Shadow mapping can be done in real-time but without hardware support may be unfeasible with more than one light source. Shadow mapping requires an additional buffer for each light source. First, the scene is rendered and only the depth information relative to the light source is stored into the buffer. Then the scene is actually rendered onto the screen after all shadow buffer information has been obtained. An extra calculation is required when rendering the scene using the normal Z-buffer algorithm. If a point in screen space from the camera’s perspective is visible, then it is transformed into the coordinate space relative to the light source and indexed

15 into the shadow buffer. If the z-value of the transformed coordinate is greater than the depth value stored in the shadow buffer, then the point is in shadow because it is being occluded by another point closer to the light source.

3.7 Aliasing and Filtering Aliasing are artifacts that occur during the texture mapping process. They occur because textures in the natural world are ’analog’ and represented by a continuous signal, whereas within a computer, the textures are stored as discrete units. One can take the case of one-dimensional signals to discover the cause of aliasing because the theory can be extended to multiple dimensions. When a continuous signal from the natural world is sampled to be stored digitally, the texture is sampled at periodic intervals. Thus, there will always be a ’gap’ of unknown information from the con- tinuous signal that is lost between any two sample points. Attempting to reconstruct the original signal is the root cause of aliasing in our texture maps. Anti-aliasing is a technique used to filter out artifacts that occur when the spatial dimensions of the texture map and the projected planar surface in screen space are far apart. The artifacts result because a pixel of the object in screen space maps to an extremely large (or tiny) area in the texture map space. There are two cases to deal with in anti-aliasing. The first case is the case where an object is very close to the viewer. This results in the pixel pre-image occupying less than one texture element (or texel). This is called magnification. The other case is compression or minification. This results when the projected object is very small in screen space, causing the pixel pre-image to occupy a large number of texels. To handle these cases ideally, one would like to take the pre-image of the pixel and integrate the color value over that area in the texture map. However, as the pre-image changes shape over the course of the objects surface and integration cannot be done efficiently, approximations have been developed. There are various methods of filtering. The easiest method is really no filtering at all. This involves taking the closest texel to the sample point and is known as point

16 sampling. Other types of filtering can be categorized into two groups: space invariant filtering and space variant filtering. Space invariant filtering involves the use of a constant filter shape when determin- ing the sample texels to take into consideration. Thus, this type of filtering does not take into considering the concept of an accurate pixel pre-image. One such popular space invariant method is called mip-mapping. This method takes into account only the size ratio aspect between the projected object surface and the texture map, but the actual filter shape stays the same [19]. Mip-mapping requires several copies of a texture at varying resolutions. Since this is pre-calculated, non real-time algorithms can be used to produce good minification results for the varying resolution texture textures. Also, since most of the calculation is done off-line, this allows mip-mapping to be done quickly. Mip-mapping functions by choosing the copy of the texture that most closely matches the spatial resolution (or frequency) of the projected object in screen space. This mip-mapping process can be further refined at an added computational cost by linear interpolating between two chosen mip-maps to obtain the final color. Mip-mapping can be used to handle magnification by having the original texture resolution be large enough that the problem of one the pixel pre- image mapping to less than one texel never occurs. However, in most cases this may not be feasible because of memory constraints. Space variant filters do take into consideration the pixel pre-image. This causes the shape of the filter to vary across the texture map. Common filter shapes used are the box, triangle, b-spline, and Gaussian. The b-spline filter can be derived from the box filter by convolving the filter by itself. The box filter is defined below.

  1  1 |x| ≤ 2  b1(x) = 1  0 |x| > 2  The Gaussian is defined below.

1 −x2 2 gσ2 (x) = √ e 2σ σ 2 2π

17 The three main categories of space variant filtering methods are direct convolution, pre-filtering, and fourier series. Direct convolution functions by computing a weighted average of the sample texels. One of the most efficient types of direction convolution is the elliptical weighted average (EWA) filter. In this method, pixels are regarded as circles and the filter shape is an ellipse. This is more efficient than the more common curvilinear quadrilateral because the former has five degrees of freedom whereas the latter has eight. Pre-filtering’s advantage is that the cost per pixel is a constant value, as opposed to direct convolution where the cost of filtering depends on the size of the filter shape. This is done by using a pyramid data structure in which multiple copies of the texture map is stored at different resolutions. An appropriate level of the pyramid is chosen based upon the filter size so as to minimize the number of samples required to be taken from the texture map. Various filter shapes from the direct convolution methods may be used, but the addition of multiple texture resolutions helps prevent large samples having to be taken from a texture map. This could occur when the object in screen space is near the horizon.

3.8 Image Segmentation Segmentation involves the recognition of objects within an image at a higher level of representation than the most basic image element, the pixel. This is useful because even moderately high resolution images on modern computers may have a resolution of over 1024x1024, resulting in over 1.2 million pixels. Thus, grouping objects together can significantly reduce the search space for future operations that may be performed on the image, which in turn will increase overall performance. The most common method of computing image segmentation is through the use of clustering pixels. Clustering involves grouping a set of points together that are deemed similar by some metric. The metric used depends on the particular application of image segmentation. This may involve having similar texture properties, similar color, proximity to other pixels, and other attributes.

18 The two basic methods of clustering are divisive clustering and agglomerative clustering [9]. Divisive clustering involves starting from a unified set of data points and iteratively splitting the set into two clusters that would yield the furthest distance. This results in the two least coherent sets of data being split into separate groups. Agglomerative clustering performs by operating in the reverse direction relative to divisive clustering. Agglomerative clustering begins with each basic element as its own separate cluster. It then iteratively merges two clusters with the smallest distance, and thus most coherence, into one cluster. See Algorithms 1 and 2 [9] for algorithmic outlines of their operation.

Algorithm 1 Divisive clustering Put all data points in one cluster while Clustering threshold unsatisfied do Split cluster that produces largest inter-cluster distance end while

Algorithm 2 Agglomerative clustering Put each data points in a unique cluster while Clustering threshold unsatisfied do Merge two clusters with the smallest inter-cluster distance end while

An example of clustering using color as the metric is given in Figure 3.3 [9]. An example using both color and position is given in Figure 3.4. An alternative to those methods is to first state an objective function that explic- itly defines, based on our chosen metric, how correct our image segmentation is. Each

th th i cluster has a center located at ci. Each j element of cluster i is represented as xj. Each element in a cluster is represented as a feature vector consisting of attributes that are relevant to the chosen metric. Optimization of this objective function means that one would like to minimize the distance between elements in a cluster and their cluster center. This can be represented by the following function [9].

T φ(clusters, data) = Σiεclusters(Σjεithcluster(xj − ci) (xj − ci))

19 Figure 3.3: Segmentation using clustering with color. Reprinted from ”Computer Vision: A Modern Approach,” by Forsyth and Ponce, Prentice Hall, 2003.

Figure 3.4: Segmentation using clustering with color and position. Reprinted from ”Computer Vision: A Modern Approach,” by Forsyth and Ponce, Prentice Hall, 2003.

This is known as the K-Means algorithm and iterates between two basic opera- tions.

Algorithm 3 K-Means clustering while Clustering threshold unsatisfied do Allocate each point in the data set to the nearest cluster center Calculate new cluster centers based on the mean of the points allocated to each cluster end while

By iteratively applying these two steps, the K-Means algorithm eventually con- verges to a local minimum. However, a global minimum is not guaranteed and neither are K resultant clusters guaranteed.

20 CHAPTER 4 METHODOLOGY

Given a data set representing a 3D model and an image containing a representation of that model, the output is a texture mapped 3D model using the color data from the 2D image. The main purpose is to obtain 3D laser range data of an object, take a photograph of the same object, and generate a texture mapped 3D model from this data. However, these methods should also apply to any 3D data set representing an object and image containing that object.

4.1 Automated Mesh Alignment Before generating the mapping of 3D model points to 2D image points, the model has to be aligned to overlap the image of the object in the 2D texture as accurately as possible. This consists of applying 3 transforms to the model: rotation, translation, and scale. The calculation of the each transform will first be considered independently of the other two transforms. This means the other two transforms are assumed to already be properly calculated for the mesh. Another parameter that must be accounted for is the focal length of the camera and the method of calculation will also be described. The metric used for quantifying the alignment result will be the overlap percentage and will be described in a later section. Finally, the description of the algorithm combining all transforms will be presented.

4.2 Stencil Calculation To be able to calculate the required transforms, a stencil outlining both the object of interest in the 2D image and the 2D projection of the 3D model has to be obtained. These will be called the image stencil and model stencil throughout the rest of this document. The image stencil is obtained by first running the image through an image segmentation algorithm. This will yield a group of segments of varying color. The user may then select the segment that best denotes the object of interest. From this,

21 a binary segmented image representing the stencil is formed. The resolution of the stencil image is the same as the resolution of the original 2D image. The untextured, flat colored projection of the 2D model is drawn to the video buffer. The video buffer containing the pixel data is then read from video memory. From this, the model stencil is generated. Because the 3D model has to be transformed continuously, the model stencil has to be calculated for every new combination of transforms that is tested. Calculation of the transforms requires comparing the two stencil images for each transform that is tested. This requires that the resolution of the two images be the same. A scaling is done to the image stencil to make both the same resolution. Nearest neighbor scaling is done because no gradients are wanted in the stencil images. The chosen stencil to be scaled is the image stencil. It is chosen over the model stencil because that necessitates only one scaling operation. If the model stencil were chosen, the scaling would have to be done before calculating each transform to be tested. To calculate the proper transforms for alignment, the coordinate frames used in representing the the 2D image and 3D model must be the same. This is done by scaling both representations to a canonical coordinate system(−1 to 1 in each axis direction).

4.2.1 Translation Alignment The first transform to be considered is translation. Translation involves the ap- plication of one motion vector to all points in a set. The translation transform that is calculated is done to center the rendered 3D model over the object of interest in the 2D image. First, the center of the projected 3D model in the model stencil is cal- culated. This is obtained by calculating the mean of all the points within the stencil designating the projected model. The center of the object of interest in the 2D image is calculated by obtaining the mean of all points. Then, the direction of translation to be applied to the 3D model is the difference between the two calculated centers. This difference can be viewed as a direction vector. This direction vector is repeatedly

22 computed until the distance between the two centers is below some threshold deter- mined satisfactory by a particular application. This process is iterative to account for the perspective projection case where the position of a 3D point projected onto the screen varies depending on depth. Note that the depth dimension is not considered during translation. Translating along the z-axis produces the same effect as scaling, which is handled later.

4.2.2 Scale Alignment The next transform to consider is scale. Scaling effectively involves multiplying the component values of each point by the given component scale. For example, scaling the point (3, −4, 8) by the component scales (2, 3, −0.5) gives the scaled point (6, −12, 4). For scale in this particular application, the Z component scale is not considered because there is no correlation to depth in the 2D image based on using one image as the source. The x and y components are scaled independently. First, the standard deviation of the points in the stencil image is calculated. Then the standard deviation from the model stencil image is calculated. Standard deviation is calculated as defined below.

1 σ = pΣ(x − x )2 N i m

The mean is represented as xm and is obtained as done in the translation calculation. The data points used to calculate the standard deviation are the pixels that comprise the stencil images. The ratio of the image stencil standard deviation to the model stencil standard deviation is the scale factor to apply to the 3D model.

4.2.3 Rotation Alignment The final transform to consider is the rotation transform. The rotation transform calculation requires exhaustive checking of rotations from all three axes. The rotation transform calculation consists of two main loops. The first loop generates a list of candidate solutions and the second loop refines the candidate solutions until the best

23 result is achieved. Initially, during the first pass, the search space for all axes are set to 0 to 360. If the chosen δr is 60, then the angles checked during this pass are 0, 60, 120... and so forth. All combination of rotations during the first pass are inserted into a list which is then sorted in descending order. The list allows several candidate solutions to be evaluated. The partition containing the true solution may initially have a fitness less than an incorrect solution because of the large range of the initial partitioning of the search space. After the first pass, the rotation transform calculation enters the refinement loop.

The refinement loop takes the angle, θ, and the range of the search space, δr. The new range to search is (θ − δr, θ + δr). All possible combinations of angles within the bounds at δr increments are checked for the best level of alignment. The value of θ that gave the best alignment is stored. The range of the search space, δr, is halved and the refinement loop is repeated. To quickly discard a bad candidate solution, a heuristic is checked within the main loop. If δr is sufficiently small and its level of alignment is still below some threshold, the processing for that candidate solution is exited and processing begins for the next candidate solution. This heuristic works because if δr has become small (fine tweaking of alignment is occurring), but the degree of alignment is not converging towards a sufficiently optimal level, one can safely discard that candidate as a possible solution. After a candidate solution has finished being searched, the best overlap percentage from that candidate solution is compared to the overall best result achieved from previous candidate solutions searched. If the current candidate produced a better result, the current candidate is assigned as the overall result. See Algorithm 4 for outlines of the first pass of the rotation transform calculation and Algorithm 5 for outlines of the main processing loop of the rotation transform calculation. The thresholds represent the following: τ1 is the minimum overlap per- centage considered a satisfactory alignment, τ2 is the minimum angle range before moving on to the next candidate solution, τ3 is the heuristic threshold on the angle range, τ4 is the heuristic threshold on the overlap percentage.

24 Algorithm 4 Rotation alignment first pass {Set upper and lower bounds} Bl = 0 and Bu = 360 δr = 60

{Do first pass} for pitch = Bl to Bu by δr do for yaw = Bl to Bu by δr do for roll = Bl to Bu by δr do Calculate fitness Store fitness and associated rotation angles Continue to next candidate solution end for end for end for Sort list in descending order

4.2.4 Alignment Metric The metric for calculating the degree of alignment or fit is based upon the per- centage of overlap between the stencil images. The overlap is calculated as follows. Iterate through every point in the stencil images. The bounds of a stencil are the edges denoting the outline of the object of interest depicted in the stencil. If a point in the model stencil is within the bounds of the stencil and the corresponding point in the image stencil is also within the bounds of its stencil, then the count of overlapping points, c1, is incremented by one. If at least one of the points in either stencil images is within its stencil bound, then increment the overall count of points, c2 by one. The percentage of overlap is

c fitness = 1 ∗ 100 c2 Since the overlap calculation is done for every combination of rotation, translation, and scale transforms, it would be beneficial if there were some way to optimize this. Calculating the overlap for the whole image will cause unnecessary pixel comparisons with the background and other irrelevent data that are not a part of the object of

25 interest. The bounding box that encloses the object of interest in the original 2D image is precomputed and is used to reduce the search space when calculating the overlap between the two stencil images. This results in a good performance increase because of the frequency that the overlap percentage has to be calculated.

4.2.5 Field of View Alignment In addition to the three main transforms that are calculated, a final parameter is calculated to fine-tune our final projected 3D model to provide the best coverage over its representation in the 2D image. The field of view represents the horizontal and vertical viewing angle of the camera. This is related to the distance to the focal length. The focal length in the virtual camera model corresponds to the projection plane. Since the parameters of the camera used to take the picture that contains our object of interest, the focal length is not known. The distance to the projection plane affects the final projection of the 3D model onto the 2D screen. Thus, it would be beneficial if a good approximation of the real camera’s focal length is known. But instead of the varying the distance to the projection plane, the field of view is varied until a best fit using the already calculated transforms is found. Varying the field of view is equivalent to varying the distance to the projection plane.

The field of view alignment is done by checking through a range, δfov, around the original field of view used when calculating the three rigid transforms. Because the rigid transform calculations generated a close match, the search space for the field of view does not have to be large. The overlap percentage is calculated for the whole range at a step of 1◦ and the best result is the final field of view chosen.

4.2.6 Combined Transform Alignment The process of aligning a 3D model with its representation in the 2D image will now be discussed. The first operation is to translate the object so that it lies near its representation in the 2D image. This translation is iteratively updated until the dis- tance between the centers of the objects in the stencil images is below some threshold

26 deemed satisfactory for the particular application. Then the scale is calculated from the stencil images as outlined previously. After the initial translation and scale have been calculated to place the projected 3D model in the vicinity of its representation in the 2D image, the rotation transformation is calculated. This process is carried out as outlined in Algorithm 4 and 5. There is the addition of a translate and scale transformation calculation after applying the rotation to the 3D model but before calculating the metric from the stencil images. This calculation is required to appro- priately re-position and re-scale the newly rotated model. This is necessary because the initial translation and scale calculations before entering the rotation calculations have put the projected 3D model within the vicinity of its image representation, but the rotation of the initial 3D model most likely will not be correct. Thus, the scale factor computed only placed the scale near the true scale factors because varying rotations of a anisotropic 3D model will produce differing standard deviation values. After obtaining the model stencil of the translated, rotated, and scaled 3D model, the metric calculation is performed. Along with storing the best current rotation values in the main processing loop of Algorithm 5, the scale and translation used are also recorded. Once the final rotation has been chosen, these stored scale and transla- tion transforms are also applied to produce the final transform aligned 3D model. Algorithm 6 gives an outline of the combined alignment calculations. Some detailed portions of the rotation loops were removed for clarity. Prior to calculating translation and scale transforms in the main processing loop of 5, the model stencil of the whole video frame buffer is required. Using the bound- ing box around the stencil of the 2D image produces incorrect results because the 3D model center and standard deviation will be incorrect if a large portion of the pro- jected 3D model lies outside this bounding box. When calculating the metric, using the bounding box for both stencil images is possible because the only concern is how well the projected 3D model overlaps its image in the original 2D image. Any portion outside the bounding box is of no concern since the overlap calculation penalizes a candidate for having a non-overlapping pixel from either stencil images.

27 4.3 Texture Coordinate Mapping After the 3D model has been aligned with the desired image of the model in the 2D texture, the generation of texture coordinate to vertex coordinate mapping must now be calculated. Under the orthographic projection model, the depth value is not taken into account and, thus, one can simply directly map the x and y component of the (x, y, z) model vertex to the (x, y) texture coordinate. If the coordinate system used internally to represent the 3D model and 2D image are the same, then no scaling is required. Otherwise, the two coordinate systems must first be transformed to the same scale before calculating the mapping.

4.3.1 Orthographic Projection In the orthographic case, the 3D points must be transformed by the model- ing/viewing matrix. The projection matrix is not required because the projection consists of discarding the depth value in the basic case. Thus, the projection ma- trix has no effect on the resultant projected 2D coordinates. However, the mapping to window coordinates may be combined into the orthographic projection matrix to provide a streamlined rendering pipeline.

4.3.2 Perspective Projection In the case of perspective projection, the depth value is taken into account when projecting the final 3D object to the 2D screen. This results in a skew of the x and y coordinates dependent upon the depth. Similar to the case of orthographic projection, the coordinate systems of the 3D model and the 2D image must be the same or first be scaled to the same frame before calculating the mapping. For the perspective case, the 3D points must be transformed first the modeling/viewing matrix and then the projection matrix. The projection matrix will define the viewing frustum that takes into account the depth value when projecting the 3D points to the 2D image plane.

28 Algorithm 5 Rotation alignment refinement loop {Refinement loop} while best overall fitness ≤ τ1 do Select candidate solution to process δr = 30 Calculate upper and lower bounds for rotation search space while δr ≥ τ2 do for pitch = Bl to Bu by δr do for yaw = Bl to Bu by δr do for roll = Bl to Bu by δr do Calculate fitness {Heuristic check} if δr ≤ τ3 and fitness ≤ τ4 then Move onto next candidate solution at beginning of refinement loop end if

if fitness ≥ best local fitness then update best local fitness and its associated rotation values end if end for end for end for δr = δr/2 Calculate upper and lower bounds for rotation search space end while if best local fitness > best overall fitness then Update best overall fitness and its associated rotation values end if Move onto next candidate solution at beginning of refinement loop end while

29 Algorithm 6 Combined alignment algorithm Compute initial translations to bring 3D model within vicinity of its 2D image Compute scale to size 3D model close to its 2D image

{Do first pass of rotation loop} for All angles to test do Obtain stencil of rotated object Compute translation and scale Obtain stencil of rotated, translated, and scaled object Calculate and store fitness Continue on to next candidate solution end for while best overall fitness ≤ τ1 do {Do refinement pass of rotation loop} Choose candiate solution for All angles to test do Obtain stencil of rotated object Compute translation and scale Obtain stencil of rotated, translated, and scaled object Calculate fitness and update best local fitness as necessary end for Update best overall fitness as necessary Continue to next candidate solution end while Calculate field of view

30 CHAPTER 5 RESULTS

The results of testing the alignment algorithm are presented next. The alignment algorithm was tested on five different datasets representing various 3D models. All tests were run on a 1.7Ghz Pentium 4 with 256MB of system memory. The video card was a Nvidia Riva TNT2. The core clock speed of this video card is 150Mhz and possesses one vertex pipeline and two pixel pipelines. For comparison purposes, a current high-end computer as of this writing is a 4Ghz processor with 2GB of system memory. The most recent card as of this writing is the Nvidia Geforce 7800. The core clock speed of this card is 430Mhz with eight vertex pipelines and 24 pixel pipelines. OpenGL was the graphics library used for rendering. The chosen threshold values are described next. For the translation transform, the only threshold that must be determined is the distance between the two calculated object centers in the stencil images. This value was chosen to be 0.1. For the first pass of the rotation transform, the only value to determine is δr, representing the partitioning of the rotation search space. This was chosen to be 45◦ for one batch

◦ of tests and 20 for another run of tests. For the rotation transform, there are τ1,

τ2, τ3, and τ4. They represent, respectively, the minimum overlap percentage the application considers satisfactory, the minimum angle range before moving on to the next candidate solution, the heuristic threshold on the angle range, and the heuristic threshold on the overlap percentage. These values were chosen to give a good balance between accuracy and speed, with an emphasis on accuracy. The speed concern was only to choose values that would still create a good alignment but not require traversing through all possible candidate solutions. The values were chosen to be:

τ1 = 95.0, τ2 = 1.0, τ3 = 6, and τ4 = 70.0. There were five models chosen to be aligned. These were a teapot, a human head, a machine part, a cessna airplane, and the range data of the Will Roger’s statue. The segmentations of these images used when calculating the alignment was done

31 manually so as to validate the alignment algorithm. The segmented images used when calculating the alignment are given in Figure 5.1. Table 5.1 lists the models used during testing and the number of points each contains.

Figure 5.1: Segmentation Images

Table 5.1: Model Sizes

Model Number of Points Teapot 620 Face 689 Machine Part 163 Cessna 6795 Will Rogers 46438

The alignment calculations were run for all models and the percentage of overlap between each pair of stencil images was recorded. Also recorded was the time taken from the beginning of the alignment process to the completion of the first pass of the rotation alignment along with the total time taken from the beginning of the alignment process to the completion of the refinement loop. Table 5.2 lists the results

◦ of the alignment tests run using an initial δr of 45 and Table 5.3 lists the results using ◦ an initial δr of 20 . One notices that some overlap percentages are below the chosen threshold of 95.0. For these cases, the alignment calculation was run until the list of candidate solutions was exhausted and the best match found throughout the list was chosen. The timing data was chosen only up until after the list of candidate solutions has been sorted because the number of operations between models is equivalent up

32 until that point. During the main processing loop of the rotation transform, the number of candidate solutions that are tested depends on the initial set orientation of the 3D model and the 3D model’s shape. Thus, the number of candidate solutions tested varies between models.

◦ Table 5.2: Alignment Calculations with δr = 45

Model Pass 1 Time (seconds) Refinement time (seconds) Overlap Percentage Teapot 13.95 16.13 95.342 Face 16.26 17.92 98.287 Machine Part 16.11 19.13 96.311 Cessna 48.01 8388.56 86.924 Will Rogers 262.87 58320.43 81.371

◦ Table 5.3: Alignment Calculations with δr = 20

Model Pass 1 Time (seconds) Refinement time (seconds) Overlap Percentage Teapot 132.27 137.71 96.103 Face 180.34 182.25 98.331 Machine Part 187.51 191.34 96.413 Cessna 584.13 56520.14 85.442 Will Rogers 3210.14 80.327

In Figure 5.4 is an image depicting the initial orientation of the 3D teapot model before the alignment calculations were done. The second image shows the error in the texture mapped result. The first image of Figure 5.2 shows the result of aligning the 3D model with the 2D image of the teapot. The 2D image is based on the same image that was manually segmented. The segmentation image used for aligning the 3D model with the 2D image is exactly what would have been obtained from an ideal segmentation algorithm. The next image of Figure 5.2 show the result of generating the mapping between the 3D model vertices and the 2D image coordinates. Noticeable in the second image of 5.4 is the result of not having a 100% accurate alignment of

33 the 3D model on top of the 2D image using the proposed algorithm. The borders of the model stencil extend beyond the borders of the representation of the teapot in the 2D image. This results in the incorrect inclusion of the background on the mapped 3D teapot. The two images in Figure 5.3 show the result of the alignment algorithm and the mapping using the obtained alignment values. Figures 5.6 and 5.7 show the results of the alignment calculations performed on the 3D face model. Figure 5.5 shows the initial placement of the 3D model and the mapping error after the mapping of vertices to texture coordinates has been generated. Apparent again is the result of not having an absolutely accurate overlap of the projected 3D model over the its representation in the 2D image. Figures 5.9 and 5.10 show the results of the alignment calculations on the machine part 3D model. Depicted in Figures 5.12 and 5.13 are the results of running the alignment algorithm on the cessna model. The initial orientation pre-alignment and the texture mapping errors for these two models are shown in 5.8 and 5.11 respectively. Finally, in Figures 5.15 and 5.16 are the results of running the alignment algorithm on the range data of the Will Rogers statue. Additional problems arise in the use of range data. In this case, there are holes present in the rendered 3D model which can affect the metric calculation because the holes will penalize the current candidate solution as being a non-overlapped pixel. Technically, this is correct, but for the purposes of calculating a fitness of alignment, it is not semantically correct because the interior holes are in reality overlapping the representation of the object in the 2D image. These holes, particularly in the base, are likely also the cause of the low overlap percentage of obtained from running the alignment algorithm on the Will Rogers data. Some of the background ends up visibly mapped to the 3D model in the second image of 5.14.

34 ◦ Figure 5.2: Teapot Alignment Results for δr = 45

◦ Figure 5.3: Teapot Alignment Results for δr = 20

◦ Figure 5.4: Teapot Initial Orientation and Result Error for δr = 20

35 ◦ Figure 5.5: Face Initial Orientation and Result Error for δr = 20

◦ Figure 5.6: Face Alignment Results for δr = 45

◦ Figure 5.7: Face Alignment Results for δr = 20

36 ◦ Figure 5.8: Mechanical Part Initial Orientation and Result Error for δr = 20

◦ Figure 5.9: Mechanical Part Alignment Results for δr = 45

◦ Figure 5.10: Mechanical Part Alignment Results for δr = 20

37 ◦ Figure 5.11: Cessna Initial Orientation and Result Error for δr = 20

◦ Figure 5.12: Cessna Alignment Results for δr = 45

◦ Figure 5.13: Cessna Alignment Results for δr = 20

38 Given that in all but the Will Rogers test data, the segmented image and texture were obtained from an actual projection of the 3D model, the results of matching should have been 100%. Though the δr was set to 95.000, it is still known that

100% was not obtained because some of the alignment results were below the set δr. However, even in these cases, the resulting best solution was always found within the top 10% of candidate solutions, implying that the sorting of the list generated from the first pass is performing its intended function. A cause for this is that even though the alignment process selects the best alignment from the current partitioning of the search space, this may cause the algorithm to miss the global minimum that may be hidden within a partition that at the current δr obtains a lower score. This will result in missing the perfect alignment completely. The process of going through the full 360◦ search space for each axis of rotation at 1◦ increments is an extremely slow and arduous process.

◦ Figure 5.14: Will Rogers Initial Orientation and Result Error for δr = 20

39 ◦ Figure 5.15: Will Rogers Alignment Results for δr = 45

◦ Figure 5.16: Will Rogers Alignment Results for δr = 20

40 A possible cause for non-perfect alignment is that scale retains a ”memory” of the previous scale. This is because the current calculated scale at each iteration is multiplied to the previous scale. The ”correct” way to calculate the scale would be to render the 3D model at its original scale and from the obtained stencil calculate a new scale before calculating the translation and rotation at each iteration. However, in most cases, this only produces a minuscule variation because the difference in the standard deviation of the stencil images between successive rotations will likely be very small itself. The translation is not a likely cause of the non-perfect alignment. The initial translations to bring the center of the 3D model within a very close vicinity of the objects representation in the 2D image would allow the calculated translation at every iteration in the main processing loop to relocate the 3D model to the new correct center. The speed of the algorithm relies significantly on the processing power of the video card. This is evident in the timing results from the large difference in processing time between the Will Rogers data and the three smaller data sets: the teapot, the face, and the machine part.

41 CHAPTER 6 CONCLUSION AND FUTURE WORK

The texture mapping of 3D objects has become common practice with the advent of increased computing power. The increased realism of a rendered object or scene that is provided by the use of texture mapping is substantial. Because of this, many forms of media from movies, games, and simulations are expected to include texture maps in their computer generated scenes. The purpose of this research has been to determine a way to automate the texture mapping process. An algorithm for automatically aligning a 3D object at arbitrary translation, scale, and rotation with its representation in a 2D image has been proposed, tested, and analyzed.

6.1 Advantages and Disadvantages The advantage of having an automated process of mapping 3D objects is that it eliminates the need for constant human input, freeing an individual or group to work on other aspects of a project that requires art content. Also, an automated method of texture mapping allows the use of batch jobs to texture map large data sets comprised of many 3D objects and their corresponding 2D images. An individual would be able to give such a large set of data to a program and return to find a set of texture mapped objects. The proposed algorithm provides good coverage of the target section of the 2D image by the 3D object. There are three main disadvantages. First, the extended borders on the proposed alignment result in the inclusion of incorrect portions of the image. Second, even in the instances of a ”fast” result, it is still not rapid enough to be considered interactive. Possible methods of increasing the speed of obtaining a result are given in the following section. The third disadvantage involves the instances where an overlap percentage greater than the selected threshold is not found, resulting in the whole list being traversed. In these cases, the best solution available was found in the first 10% of the sorted list of possible solutions. This implies that it may be justifiable to cut off

42 the search for solutions after the first 10% without loss of result accuracy even if a solution with a overlap percentage greater than the selected threshold is not found.

6.2 Future Work and Improvements Two possible improvements to consider involve the time taken to obtain the stencil images and the inclusion of incorrect regions of the 2D image in the texture mapped model. A future area of work involves the generation of correct lighting and shading values from novel viewpoints.

6.2.1 Stencil calculations A significant amount of time is spent in comparing the two stencil images and is the first area to consider for improvement. The vast majority of the pixels occupying the stencil images are the interior pixels of the object of interest. If the silhouette of the projected 3D model were obtained instead, the points of this silhouette could be kept in a list. By silhouette, we mean the outline of the projected 3D model and the outline of its representation in the 2D image. A very efficient method for computing the silhouette is described in [16] by using a precomputed search tree to obtain the 2D silhouette for any orientation. This tree is constructed once from a high resolution original mesh and can be used to represent the stencil of the original mesh or even a lower resolution mesh. Extraction of these silhouettes is very fast. The time required to extract the silhouette from a 400,000 face 3D model in [16] is 0.0282 seconds. This was also done on a slow machine (a 550Mhz Pentium III) relative to currently available hardware. Computing the edge points for the image stencil can be done once and stored since that silhouette is static. This scales well with texturing large objects or using high resolution images because the interior points, which make up the majority of the object of interest, are not considered when calculating the level of fitness. One possible method of determining the best fit from these two lists first requires finding a common point. Then the list is traversed and the difference between each

43 pair of points is recorded. The average is taken to quantify the level of fitness of the alignment. Another possible method involves shooting rays out from the center of the object of interest in the 2D image. Then the intersection of this ray with the texture silhouette and the model silhouette is recorded. The difference between these two intersections is recorded. Doing this for all points and taking the average is another method of quantifying the level of fitness.

6.2.2 Extended borders By extended borders, we mean the regions near the edges of the model stencil extrude out beyond the edges of the object’s representation in the image stencil. One possible improvement to the extended borders problem is to take into account the fact that the borders of the obtained model stencil usually will correspond to surfaces that are not directly facing the camera. Not generating a mapping for vertices whose normals are angled too far from the camera’s view vector may help prevent the incorrect mapping of some border vertices. However, more research into this is required to still allow the mapping of interior vertices whose normals happen to be nearly orthogonal to the view vector.

6.2.3 Lighting and shading One area of interest is lighting and shading of the model. Given multiple images under different lighting conditions, an additive or interpolative technique may possibly be used to generate an approximation of how the 3D scene would appear under a new lighting condition. Two possible variables to approximate in the generated images are angle and intensity. Variable intensity can be altered to simulate a different time of day or the amount of light striking the 3D model. Variable angle can be used to simulate moving shadows across the surface of a 3D model as a result of changing the angle of an incoming directional light source(similar to normal mapping). Combining the newly calculated color intensities with the laser range model of an object may produce very realistic renderings.

44 BIBLIOGRAPHY

[1] Paul Besl and Neil McKay. A method for registration of 3d shapes. IEEE Transactions on Pattern Analysis and Machine Intelligence, 1992.

[2] James Blinn. Simulation of wrinkled surfaces. In SIGGRAPH78. SIGGRAPH, 1978.

[3] James Blinn and Martin Newell. Texture and reflection in computer generated images. Communications of the ACM, 1976.

[4] Jean-Daniel Boissonat. Geometric structures for three-dimensional shape repre- sentation. ACM Transactions on Graphics, 1984.

[5] Chad Carson, Serge Belongie, Hayit Greenspan, and Jitendra Malik. Blobworld: Image segmentation using expectation-maximization and its application to image querying. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2002.

[6] Brian Curless and Marc Levoy. A volumetric method for building complex models from range images. Computer Graphics, 1996.

[7] A. Dempster, N. Laird, and D. Rubin. Maximum likelihood from incomplete data via the em algorithm. J. Royal Statistical Society, 1977.

[8] James Foley and Andries van Dam. Fundamentals of Interactive Computer Graphics. Addison-Wesley Reading, MA, 1982.

[9] David Forsyth and Jean Ponce. Computer Vision: A Modern Approach. Prentice Hall Upper Saddle River, NJ, 2003.

[10] R. Haralick, C. Lee, K. Ottenberg, and M. Nolle. Analysis of solutions of the three point perspective pose estimation problem. In Proceedings of IEEE International Conference on Computer Vision and Pattern Recognition. IEEE, 1991.

[11] Paul Heckbert. Fundamentals of texture mapping and image warping. Master’s thesis, Univesity of California at Berkeley, 1989.

[12] Francis Hill. Computer Graphics Using OpenGL. Prentice Hall NY, 2000.

[13] Marc Levoy, Kari Pulli, Brian Curless, Szymon Rusinkiewicz, David Koller, Lucas Pereira, Matt Ginzton, Sean Anderson, James Davis, Jeremy Ginsberg, Jonathan Shade, and Duane Fulk. The digital michaelangelo project: 3d scan- ning of large statues. In SIGGRAPH00. SIGGRAPH, 2000.

[14] D. Lowe. Fitting parametrized three-dimensional models to images. IEEE Trans- action on Pattern Analysis and Machine Intelligence, 1991.

45 [15] Lawrence Roberts. Homogeneous matrix representation and manipulation of n-dimensional constructs. Technical report, Lincoln Laboratory, MIT, 1966.

[16] Pedro Sander, Xianfeng Gu, Steven Gortler, Hugues Hoppe, and John Snyder. Silhouette clipping. In Proceedings of the 27th annual conference on Computer graphics and interactive techniques. SIGGRAPH, 2000.

[17] Emanuele Trucco and Alessandro Verri. Introductory Techniques for 3-D Com- puter Vision. Prentice Hall Upper Saddle River, NJ, 1998.

[18] Greg Turk and Marc Levoy. Zippered polygon meshes from range images. In SIGGRAPH94. SIGGRAPH, 1994.

[19] Alan Watt and Fabio Policarpo. The Computer Image. Addison-Wesley NY, 1998.

[20] W. Wells, R. Kikinis, W. Grimson, and F. Jolesz. Adaptive segmentation of mri data. International Conference on Computer Vision, , and Robotics in Medicine, 1995.

46 PERMISSION TO COPY

In presenting this thesis in partial fulfillment of the requirements for a master’s degree at Texas Tech University or Texas Tech University Health Sciences Center, I agree that the Library and my major department shall make it freely available for research purposes. Permission to copy this thesis for scholarly purposes may be granted by the

Director of the Library or my major professor. It is understood that any copying or publication of this thesis for financial gain shall not be allowed without my further written permission and that any user may be liable for copyright infringement.

Agree (Permission is granted.)

Rattasak Srisinroongruang______08/01/2005_____ Student Signature Date

Disagree (Permission is not granted.)

______Student Signature Date