COLOR FEATURE INTEGRATION WITH DIRECTIONAL RINGLET INTENSITY

FEATURE TRANSFORM FOR ENHANCED OBJECT TRACKING

Thesis

Submitted to

The School of Engineering of the

UNIVERSITY OF DAYTON

In Partial Fulfillment of the Requirements for

The Degree of

Master of Science in Electrical Engineering

By

Kevin Thomas Geary

Dayton, Ohio

December, 2016

COLOR FEATURE INTEGRATION WITH DIRECTIONAL RINGLET INTENSITY

FEATURE TRANSFORM FOR ENHANCED OBJECT TRACKING

Name: Geary, Kevin Thomas

APPROVED BY:

______Vijayan K. Asari, Ph.D. Eric J. Balster, Ph.D. Advisory Committee Chairman Committee Member Professor Associate Professor Electrical and Computer Engineering Electrical and Computer Engineering

______Theus H. Aspiras, Ph.D. Committee Member Research Engineer and Adjunct Faculty Electrical and Computer Engineering

______Robert J. Wilkens, Ph.D., P.E. Eddy M. Rojas, Ph.D., M.A., P.E. Associate Dean for Research and Innovation Dean, School of Engineering Professor School of Engineering

ii

ABSTRACT

COLOR FEATURE INTEGRATION WITH DIRECTIONAL RINGLET INTENSITY

FEATURE TRANSFORM FOR ENHANCED OBJECT TRACKING

Name: Geary, Kevin Thomas University of Dayton

Advisor: Dr. Vijayan K. Asari

Object tracking, both in wide area motion imagery (WAMI) and in general use cases, is often subject to many different challenges, such as illumination changes, background variation, rotation, scaling, and object occlusions. As WAMI datasets become more common, so too do color WAMI datasets. When color data is present, it can offer very strong potential features to enhance the capabilities of an object tracker. A novel color histogram-based feature descriptor is proposed in this thesis research to improve the accuracy of object tracking in challenging sequences where color data is available. The use of a three dimensional color histogram is explored, and various color spaces are tested.

It is found to be effective but overly costly in terms of calculation time when comparing reference features to the test features. A reduced, two dimensional histogram is proposed, created from three channel color spaces by removing the intensity/luminosity channel before calculating the histogram. The two dimensional histogram is also evaluated as a feature for object tracking, and it is found that the HSV two dimensional histogram

iii performs significantly better than other histograms, and that the two dimensional histogram performs at a level very near that of the three dimensional histogram, but an order of magnitude less complex in the feature distance calculation.

The proposed color feature descriptor is then integrated with the Directional Ringlet

Intensity Feature Transform (DRIFT) object tracker. The two dimensional HSV color histogram is enhanced further by making use of the DRIFT Gaussian ringlets as a mask for the histogram, resulting in a set of weighted histograms as the color feature descriptor.

This is calculated alongside the existing DRIFT features of intensity and Kirsch mask edge detection. The distance scores for the color feature and DRIFT features are calculated separately, given the same weight, and then added together to form the final hybrid feature distance score. The combined proposed object tracker, C-DRIFT, is evaluated on both challenging WAMI data sequences and challenging general case tracking sequences that include head, body, object, and vehicle tracking. The evaluation results show that the proposed C-DRIFT algorithm significantly improves on the average accuracy of the

DRIFT algorithm. Future work on the integrated algorithm includes integrated scale change handling created from a hybrid of normalized color histograms and existing DRIFT rescaling methods.

iv

TABLE OF CONTENTS

ABSTRACT ……………………………………………………………………………. iii

LIST OF ILLUSTRATIONS …………………………………………………………… vii

LIST OF TABLES …………………………………………………………………… viii

CHAPTER 1: INTRODUCTION………………………………………………………... 1

CHAPTER 2: LITERATURE SURVEY ………………………………………………... 5

2.1 Image Registration Techniques ………………………………………………… 5

2.2 Tracking Algorithm Feature Extraction …………………………………………. 6

2.3 Color Feature Extraction ………………………………………………………… 8

CHAPTER 3: OVERVIEW OF DRIFT ALGORITHM ……………………………...... 10

3.1 DRIFT Tracking Technique …………………………………………………….. 10

3.2 DRIFT Tracking Results ……………………………………………………….. 19

CHAPTER 4: COLOR HISTOGRAM BASED FEATURES FOR TRACKING ….….. 22

4.1 Three Dimensional Color Histogram Features ………………………………… 22

4.2 Two Dimensional Color Histogram Features ………………………………….. 29

4.3 Color Feature Fusion with DRIFT ……………………………………………... 31

CHAPTER 5: OBJECT TRACKING EVALUATIONS …………………..…………… 36

5.1 Datasets and Testing Setup …………………………………………………….. 36

5.2 Testing Strategies and Results …………………………………………………..39

v

5.3 Discussion …………………………………………………………………... 43

CHAPTER 6: CONCLUSION ………………………….……………………………… 45

REFERENCES …………………………………………………………………………. 48

vi

LIST OF ILLUSTRATIONS

1.1 Comparison of grayscale intensities of colorful cars ………………………………... 3

1.2 Color feature extraction process …………………………………………………….. 4

3.1 DRIFT object tracking method …………………………………………………….. 11

3.2 STTF nonlinear enhancement process ….…………………………………………... 12

3.3 Structure of the DRIFT feature descriptor ……………………..…………………… 14

3.4 Gaussian ring kernel, with rings 휌 = 3 ……….………….………………………….. 15

3.5 Kirsch compass kernels …...………………………………………………………… 16

3.6 Object tracking on CLIF and LAIR datasets ………………………………………. 21

4.1 RGB cube histogram ……………………………………………………………….. 23

4.2 Structure of color space testing tracking algorithm ………………………………... 27

4.3 Sample frame from Egtest01 dataset …….…………………………………………. 28

4.4 Heat map of HSV 2D histogram …………………………………………………… 30

4.5 Diagram of Color DRIFT Structure ………………………………………………… 32

4.6 FAST feature comparison between pair of frames ………………………………… 35

5.1 Frames from object tracking sequences corresponding to Table 5.1 ………………. 37

5.2 Frames from object tracking sequences corresponding to Tables 5.2 and 5.3 …….. 38

5.3 Plot of thresholded overlap success for Visual Tracker sets ……………………….. 40

5.4 Plot of thresholded center error success for Visual Tracker sets …………………... 41

vii

LIST OF TABLES

3.1 Object Tracking Frame Detection Accuracy for CLIF and LAIR sets …...... 20

4.1 3D histogram tracking results by color space ……………………………………… 28

4.2 2D histogram tracking results by color space ……………………………………… 31

4.3 Comparison of tracker time per frame for 3D and 2D histograms ………………… 31

5.1 VIVID object tracking overlap …………………………………………………… 39

5.2 Visual Tracker Benchmark object tracking frame detection accuracy …………… 42

5.3 Visual Tracker Benchmark object tracking average center error ………………… 43

viii

CHAPTER 1

INTRODUCTION

In recent years, wide area motion imagery (WAMI) data has become increasingly common, with an abundance of use cases. These uses can range from surveillance related tasks, to search and rescue operations, to traffic pattern analysis. In many use cases, tracking of objects, especially vehicles, is necessary. Because of the large area covered by such imagery, computer vision techniques for automated tracking become increasingly important. However, object tracking in WAMI data is a challenging task due to many factors. These factors can include camera motion, variances in object illumination, object occlusion, changes in object scale, object rotation, and variations in background.

Additionally, the resolution of individual objects in such imagery tends to be very low.

This is because while the WAMI images themselves are very high resolution, the images are captured from high in the air. Even when multiple sensors are deployed in an array to capture the highest possible resolution image of the area, the resolution of objects such as vehicles will remain relatively low. WAMI data can also contain sequences where, due to complex lighting conditions, the contrast of an object can be very similar to that of the background. When any number of these complicating factors is present in the imagery, it becomes much more difficult to accurately track an object, compromising the purpose of the WAMI data.

1

Many of these challenging conditions are overcome when using the Directional

Ringlet Intensity Feature Transform (DRIFT) tracking algorithm [1-2]. The DRIFT algorithm consists of several stages. The tracker is initialized with the location of the target object and the size of its bounding box. Before reference features are calculated, the image goes through a preprocessing step in which intensity illumination and spatial enhancement is applied. The reference features of the object are then created from the grayscale image using intensity features, and edge features are created using a Kirsch mask. These features are then filtered using Gaussian ringlet filters to create the completed model [3]. Tracking then begins by applying the same image enhancement techniques to the incoming frame.

An enhanced sliding window search approach is used in the search area, with the same intensity and edge features being calculated for each candidate region. Once all candidates have been accumulated, the earth mover’s distance (EMD) is calculated to determine the distance score between each region and the reference. If the lowest distance is below a threshold, the location of that candidate region is recorded as the object location. If the distance score is below a second threshold, the reference model is updated. Lastly, a

Kalman filter is applied to estimate the location of the target object in the next frame. This estimate is used as the center of the search area in the next frame.

Advances in sensor technologies and storage media have allowed for improvements to be made to the wide area images, however. One very notable change is the rise in the availability of color data. Many WAMI tracking algorithms, including DRIFT, only make use of grayscale imagery data. Without the use of color data, it becomes possible for a tracker to confuse two similar-looking objects that are near the same grayscale intensity level, but are actually two distinct colors. This can be seen very clearly in Figure 1.1. By

2

Figure 1.1. Comparison of grayscale intensities of colorful cars. [53]

integrating color data as a new feature, however, a tracker can both overcome this specific difficulty, and become overall more robust.

The main objective of this research is to further develop the DRIFT tracking algorithm into a more robust object tracker. The primary addition to the algorithm is the integration of a color histogram-based feature. The color histogram uses the hue- saturation-value (HSV) image to construct a two-dimensional histogram from the hue and saturation channels of the color model. This color feature has a distance score that is calculated separately from the base DRIFT feature distance score, as it is calculated using the two-dimensional Earth Mover’s Distance (EMD) formula. The total distance score for a candidate region is then calculated as the roughly equally weighted sum of the

DRIFT feature distance and the color histogram distance. The color histogram feature is also used for handling scale changes, where the normalized color feature distance is compared against the normalized color feature of the same region with different bounding box sizes. Lastly, an image registration technique that makes use of the features from accelerated segment test (FAST) [18, 19] is used to update the Kalman filter object location estimate based on the movement of the camera from frame to frame. Figure 1.2 shows the proposed color feature technique.

3

Figure 1.2. Color feature extraction process

Chapter 2 consists of a literature survey of related tracking algorithms and associated techniques. Chapter 3 introduces the DRIFT tracking algorithm and presents tracking results from WAMI datasets. Chapter 4 describes the integration of the color feature with the DRIFT algorithm and other changes to the algorithm. Chapter 5 provides tracking results for the combined tracker, as well as discussion of the results. Chapter 6 consists of the conclusion and future works discussion.

4

CHAPTER 2

LITERATURE SURVEY

A literature survey was conducted on several major areas of interest: image registration techniques, feature extraction for tracking algorithms in general, and color feature based tracking algorithms. Section 2.1 introduces numerous techniques for image registration and the feature extraction methods used. While feature extraction in general is discussed in section 2.2, emphasis is placed on algorithms that are suitable for use with

WAMI datasets. Lastly section 2.3 discusses techniques for extracting color features suitable for object tracking.

2.1 Image Registration Techniques

Image registration is the process of overlaying two or more images of the same scene taken at different times by different sensors or from different viewpoints, with the goal of geometrically aligning the images [12]. This is typically accomplished by detecting common features within the images, then finding a geometric transformation that transforms the points from the sensed image to the reference image. Therefore, the feature detection technique is the primary area of study for image registration. The Harris Corner

Detector [13] uses the eigenvalues of the second moment matrix to find corners within a scene. Combined with edge detection based on local auto-correlation, this allowed for feature detection within an image. This technique is not invariant to scale, however, and

5 was improved first in [14] by using local extrema over different combinations of γ- normalized derivatives to incorporate automatic scale detection. It was further improved in

[15] to create similar scale invariant features by making use of the Hessian matrix and the

Laplacian to generate the points more quickly, with a speed increase of roughly four times.

The Scale Invariant Feature Transform (SIFT) [16] creates a set of feature descriptors that are invariant to scale changes, translation, and rotation by using an approximation of the

Laplacian of Gaussian created with a Difference of Gaussians filter to increase the computation speed. These features showed success in object recognition even in partial occlusion. The Speeded Up Robust Features (SURF) point descriptor [17] makes use of integral images for image convolutions, a Hessian matrix-based measure, and a distribution-based descriptor to achieve similar level results with an increase in performance over the SIFT features. Lastly, the Features from Accelerated Segment Test

(FAST) [18, 19] technique achieves superior performance and feature quality by combining point-based and edge-based techniques. FAST features, as the name suggests, are able to be computed even faster than SURF techniques. The proposed addition of image registration to the DRIFT tracking algorithm makes use of the FAST features.

2.2 Tracking Algorithm Feature Extraction

There exist many different feature extraction techniques and feature descriptors.

The previously mentioned SIFT, SURF, and FAST methods create scale invariant feature descriptors that can be used for object tracking, but they are not suitable for tracking objects that are low resolution within the frame, such as those found in WAMI sets. The Mean- shift [24] method uses center weighted matching from histograms regularized by special masking with an isotropic kernel. This allows for resistance to distortion and rotation, but

6 has susceptibility to similar intensity regions. Patch matching techniques [25] break the target into multiple semi-overlapping areas, and track these areas separately. This method allows for strong handling of partial occlusions. Particle filter trackers [26] use Bayesian theory to track in non-linear and non-Gaussian systems, though with the drawback of model degradation over long periods of tracking.

Many other methods also make predominant use of the intensity data to form their feature models. The texture descriptor Local Binary Pattern [27] (LBP) method works by using the neighbors of each pixel to create a feature descriptor. A rotation invariant version, Rotation Invariant Local Binary Pattern (RILBP) [28], makes use of the magnitude of the discrete Fourier transform of the histogram to form its descriptor. By doing so it is able to maintain feature quality while also achieving the rotation invariance. The commonly-used Histogram of Oriented Gradients (HOG) [20] method uses simple gradient filters to create a gradient image, from which a grid of histograms is created. HOG and variants see success in pedestrian tracking, but tend not to perform as well under low resolution conditions, such as WAMI data. The Rectangular Grid (RECT) [29] method partitions the image using a rectangular grid for feature extraction. A similar method, the equal distance circular grid (CIRC-ED), partitions the image into equal distance circular rings from a center point. By using circles, the method achieves rotation invariance, and is also able to handle partial occlusions. The equal area circular grid (CIRC-EA) adopts the circular grid technique but partitions the image into rings of equal area rather than spacing. Improving on this concept, the Gaussian Ringlet Intensity Distribution (GRID)

[3] constructs its features from Gaussian ringlet partitions with an emphasis on the center ring to obtain a stronger feature while maintaining the illumination and rotation invariance.

7

Lastly, the Directional Ringlet Intensity Feature Transform (DRIFT) [1] takes the GRID intensity feature and fuses it with a Kirsch mask edge detection feature to create a robust feature descriptor. The proposed feature descriptor is a combination of the DRIFT features and a color histogram feature.

2.3 Color Feature Extraction

Various methods exist for extracting features from image color data. Collins and

Liu [32] use different linear combinations of the R G and B channels to create candidate features. These linear combinations are able to form a variety of common features, such as single channel features, image intensity, or two-channel features. Most other algorithms use some form of a color histogram, sometimes using other color spaces. Swain and

Ballard [33] propose the three dimensional (3D) RGB histograms, showing that they are a stable representation of an object both in the presence of occlusion and change of view.

Their proposed feature is used for image indexing in large databases. Nummiaro et al [34] use an elliptical search area and 3D RGB histogram with 8x8x8 bins that use inverse-square weighing to give less weight to pixels further from the center. Combined with a particle filter, the resulting features achieve success in pedestrian, vehicle, and face tracking.

Takala and Pietikainen [35] propose using a combination of the 3D RGB histogram and a color correlogram in order to take into account where within the object a color is located.

Their tracker succeeds in tracking pedestrians in low light conditions. Wang and Yagi [36] create a feature descriptor from histograms in three different color spaces, using linear histograms from the R, G, and B channels of RGB, the hue and saturation channels from

HSV, and the r and g channels from the normalized rg color space. This combination of features allows their tracker to succeed in challenging WAMI data scenarios. Xiao et al

8

[37] use an HSV color histogram combined with a block-sparse representation in order to provide both illumination change handling and occlusion handling as well as handling of object color for increased accuracy. The algorithm’s update strategy also achieves success in sequences with complex backgrounds and object morphological changes. The Color

DRIFT proposed method uses a somewhat similar method to combine an HSV histogram with non-color features in order to create a more robust feature descriptor.

9

CHAPTER 3

OVERVIEW OF DRIFT ALGORITHM

An overview of the directional ringlet intensity feature transform (DRIFT) tracking algorithm [1] is presented in this chapter. First, the DRIFT object tracking technique in general is discussed. Next, an overview of the DRIFT feature extraction technique is presented. This includes details on the intensity feature, the Kirsch Mask edge detection feature, and the use of Gaussian ringlets to achieve rotation invariance and superior tracking results. Lastly, a discussion of tracking using the DRIFT algorithm occurs, as well as a presentation of results obtained by the algorithm when used for tracking with WAMI datasets. The tracking results presented are from the Columbus

Large Image Format (CLIF) and Large Area Image Recorder (LAIR) datasets.

Afterwards, the motivation for a new color feature is laid out.

3.1 DRIFT Tracking Technique

The DRIFT object tracking method can be broken up into five main modules: initialization, search area selection, feature extraction, center point selection, and reference updates. Figure 3.1 provides an overview. The target object’s initial center point and dimensions are provided either by the user, or by a target detection system.

10

Object center selection (initial frame)

Next frame

Search area selection

Feature extraction

Center point selection

Kalman tracker updated

Figure 3.1. DRIFT object tracking method [1].

Before the initial and each subsequent frame is used, the image undergoes a preprocessing stage. This stage uses a nonlinear image enhancement function to improve the illumination of the image. Objects that are overexposed or underexposed are able to be made more visible, which in turn allows for better features to be extracted from the affected regions. The function’s preprocessing method uses the Self-Tunable

Transformation Function (STTF) enhancement algorithm [16-17]. First, if the image is a color image, it is converted to grayscale, because the process is only designed to be used on an intensity image. Then, the inverse sine nonlinear enhancement stage is performed.

This function is given by

2 푞 퐼 (푥,푦) = sin−1 (퐼 (푥, 푦)2) 푒푛ℎ 휋 푛표푟푚 (3.1)

where Inorm (x, y) is the normalized intensity image and q is the control parameter, which is obtained by using a multi-scale Gaussian in the pixel neighborhood. This is calculated using

11

Figure 3.2. STTF nonlinear enhancement process [21]

휋 tan ( 퐼푀푛표푟푚 (푥, 푦)) + 0.162 퐼푀푛표푟푚 (푥, 푦) ≥ 0.3 푞 = { 2.225 (3.2) 1 1 ln ( 퐼 (푥, 푦)) + 0.60 퐼 (푥, 푦) < 0.3 60.6 0.3 푀푛표푟푚 푀푛표푟푚

where 퐼푀푛표푟푚 (x, y) is the normalized Gaussian mean image. Lastly, the image then undergoes Laplacian sharpening and a contrast enhancement algorithm to correct the contrast.

Next the now-enhanced image undergoes search area selection. The basic process of using a sliding window search is improved upon to limit the search area, thus reducing the computation time significantly. This is accomplished by using an intensity-based pixel selection method that removes areas that have large illumination differences when compared to the reference image. This is done by finding the average intensity value of each search area, computed as

12

푑 푑푦 푥+ 푥 푦+ ( ) 1 ∑2 ∑ 2 퐼푎푣푔 푥, 푦 = 푑 푑푦 퐼(푚, 푛) (3.3) 푑푥 ∙ 푑푦 푚=푥− 푥 푛=푦− 2 2 where 퐼푎푣푔(푥, 푦) is the array of averaged object intensities under test, 퐼(푚, 푛) is the image from the search area, dx is the area width, and dy is the area length. The validity of regions within the search area is then determined by comparing the region average intensities to high and low limits that are computed from the reference image as

1 푑푥 푑푦 푙𝑖푚𝑖푡 ℎ푖푔ℎ = ∑푛=0 ∑ 퐼푟푒푓(푛, 푚) + 푙 (3.4) 푑푥 ∙ 푑푦 푚=0 and

1 푑푥 푑푦 푙𝑖푚𝑖푡푙표푤 = ∑푚=0 ∑ 퐼푟푒푓(푚, 푛) − 푙 (3.5) 푑푥 ∙ 푑푦 푛=0 where l is the limiting intensity difference factor. A lower l value will remove more regions, but with the increased risk of removing the best-matched area. Through experimentation, it was found that the l value can be made fairly small, on the order of 15 for an 8 bit image, with minimal accuracy loss [1]. A notable exception is for partially occluded objects. Such objects may require a higher l value in order to not be eliminated as a possible match. Also, if a large global illumination change occurs, this process may eliminate most or all of the search area, resulting in the algorithm falling back on other means of tracking, such as the Kalman tracker.

Once the candidate regions have been determined, features are extracted for each candidate. DRIFT uses two different features, intensity features and Kirsch mask edge features. The structure of the DRIFT feature descriptor is shown in Figure 3.3.

13

Input Reference Frame Object

Kirsch Kernel Filtered Images

Gaussian ringlet mask

2.5 3.5 2.5 1.6

3 1.4 2 2 1.2 2.5

1.5 1.5 1 2 0.8

1.5 1 1 0.6

1 0.4 0.5 0.5 0.5 + + + 0.2

0 0 0 0 5 10 15 20 25 30 35 0 5 10 15 20 25 30 35 0 0 5 10 15 20 25 30 35 0 5 10 15 20 25 30 35

35 8

30 7

6 25

5 20 4 15 DRIFT 3

10 2

5 1

0 0 0 5 10 15 20 25 30 35 || 0 5 10 15 20 25 30 35

Figure 3.3. Structure of the DRIFT feature descriptor [1].

The first segment of DRIFT feature extraction uses intensity-based histograms which are partitioned using Gaussian ringlets. The ring partitions of this method are similar to the rings used in CIRC [23], with the difference being that CIRC uses a discrete filter, where this method uses a Gaussian ring function. Figure 4.4 shows an example of a

Gaussian ring kernel with three rings. The center ring of the Gaussian kernel is given by

2 2 (푥−푥표) +(푦−푦표) 퐺1(푥, 푦) = 퐶 ∙ exp (− 1 ) (3.6) (푅 −푅 )2 2 푖 푖−1

where C is the scaling factor, xo and yo are center point coordinates, and Ri is the radius of ring i. Dividing the largest dimension of the object by the total number of rings, ρ, gives the radius Ri. All subsequent Gaussian rings are given by

14

1

0.8

0.6

0.4

0.2

0 35 35 30 30 25 25 20 20 15 15 10 10 5 5 0 0

Figure 3.4. Gaussian ring kernel, with rings 휌 = 3.

1 2 (√(푥−푥 )2+(푦−푦 )2− (푅 −푅 )2) 표 표 2 푖 푖−1 퐺푖(푥, 푦) = 퐶 ∙ exp (− 1 ) , 𝑖 = 2 … 휌. (3.7) (푅 −푅 )2 2 푖 푖−1

An intensity histogram is calculated for each of the rings, using the corresponding

Gaussian ring as a mask. The histograms are then normalized using

푉 푁푖 = ∑푣=1 퐻푖(푣) , 𝑖 = 1 … 휌 (3.8)

th where Ni is the normalization factor for ring i, 퐻푖(푣) is the histogram for the v bin of ring i, and V is the number of bins in the histogram. Each value of the histogram is then normalized as

퐻푖(푣) 퐻푖푛푡푒푛푠푖푡푦,푖(푣) = , 푣 = 1 … 푉, 𝑖 = 1 … 휌. (3.9) 푁푖

As desired, additional weighing can then be used on the histograms in order to place additional emphasis on the central ring or rings. The DRIFT algorithm as used for results evaluation uses 32 histogram bins.

15

5 −3 −3 −3 −3 5 −3 −3 −3 −3 5 5 [5 0 −3] [−3 0 5] [ 5 0 −3] [−3 0 5 ] 5 −3 −3 −3 −3 5 5 5 −3 −3 −3 −3 East West Northeast Southwest −3 −3 −3 −3 −3 −3 5 5 −3 5 5 5 [−3 0 −3] [−3 0 5 ] [ 5 0 −3] [−3 0 −3] 5 5 5 −3 5 5 −3 −3 −3 −3 −3 −3 North Northwest Southeast South

Figure 3.5. Kirsch compass kernels [1]

The second segment of feature extraction applies a Kirsch mask to the grayscale image. The Kirsch operator is a non-linear edge detector [37], allowing edge features to be easily extracted. This operator uses a single kernel mask rotated in eight different directions to find the maximum edge strength in 45 degree increments. Figure 3.5 shows the kernels. Only four kernels, however, were found to be needed in order to create a strong edge feature [1]. By using fewer kernels, the execution time is able to be reduced.

To obtain the edge features, first the frame is filtered with each of the four kernels.

It is necessary to filter either the entire frame or at least a sizable region around the area under test. This is necessary to prevent inaccuracies that could otherwise occur during normalization. The four filtered images then have the Gaussian ringlet masks applied to them before being normalized using (3.9). The four masked histograms are then combined using

4 퐻푘푖푟푠푐ℎ,푖(푣) = ∑푢=1 퐻푢,푖(푣), 푣 = 1 … 푉 (3.10)

where 퐻푢,푖 is the feature histogram for Kirsch mask a and ring i, and 퐻푘푖푟푠푐ℎ,푖(푣) is the combined feature histogram.

16

In order to create the Kirsch filtered image feature histogram, the filtered image, denoted as 퐼퐾푖푟푠푐ℎ, is thresholded so that all values are in the range [0 255]. The thresholded image is obtained by using

0, 퐼퐾푖푟푠푐ℎ ≤ −127 퐼푡푟 = {퐼퐾푖푟푠푐ℎ − 127 < 퐼퐾푖푟푠푐ℎ < 127 (3.11) 255, 퐼퐾푖푟푠푐ℎ ≥ 127

The threshold is applied so that the filtered images can be used in the same way as the 8-bit intensity images for feature extraction. To prevent excessive information cutoff, the filtered image is calculated as

1 퐼 (푥, 푦) = ∑1 ∑1 퐾𝑖푟푠푐ℎ (푚, 푛) ( 퐼(푚+푥,푛+푦)) 퐾푖푟푠푐ℎ 푚=−1 푛=−1 푖 푏 (3.12)

th where 퐾𝑖푟푠푐ℎ푖 is the Kirsch kernel in the i direction and b is a constant used to reduce the maximum and minimum values of 퐼퐾푖푟푠푐ℎ(푥, 푦), the Kirsch filtered image. This constant, b, should be selected to reduce the information cut out by the thresholding step.

Once both the intensity and Kirsch mask feature histograms are completed, they are vertically concatenated, resulting in the final feature descriptor of size 2ρ by 32, assuming the default value of 32 histogram bins is maintained. The completed feature descriptor is as shown in Figure 3.3.

The complete feature descriptor is calculated for each candidate region, and then compared to the reference descriptor. The classification method used for comparison is the Earth Mover’s Distance (EMD) [8]. The raw distance values calculated by the EMD are then weighted so that candidate regions closer to the center of the search region are emphasized. This is done because candidates that are closer to the estimated location are

17 more likely to be the correct object. The candidate region with the lowest weighted distance score is then determined to be the most likely center point. This score is then also used as a confidence measure. If the score is below a confidence threshold, then the region is considered to be the new object center point. However, if the score is above the threshold, a bad match or occlusion is considered to have occurred, and the Kalman filter

[38] location estimate will be used instead. This estimate will continue to be used until the score is low enough again that a positive match is considered to have occurred. The occlusion threshold, Tocc, is found as

푇표푐푐 = 훼 ∙ 퐸푀퐷푚푖푛,푎푣푔 (3.13) where α is a constant value that determines the valid EMD range for all tracked objects, and 퐸푀퐷푚푖푛,푎푣푔 is the average minimum EMD of matched objects in previous frames.

Lastly, the new center point information is used to update the Kalman filter and the reference object feature descriptor. The updates will only occur, however, if the match’s

EMD score is below a second, more stringent threshold. This is done to ensure that only strong matches are used to update the reference features. The reference update threshold,

Tref, is defined as

푇푟푒푓 = 훽 ∙ 퐸푀퐷푚푖푛,푎푣푔 (3.14) where β is a constant that determines the valid EMD range for strong matches that will update the reference histogram, and with 훽 < 훼.

If the likely center point’s EMD score is below the threshold for a strong match, then the reference object feature histogram is updated. This update is done to ensure that

18 the reference features remain a good descriptor of the object’s current appearance. Global or regional lighting changes, as well as any changes in viewpoint, make the update necessary. The update to the reference histogram feature descriptor for the kth frame,

퐻푟푒푓(푘), is calculated as

퐻푟푒푓(푘) = (휂 ∙ 퐻푛푒푤(푘) + (1 − 휂) ∙ 퐻푟푒푓(푘 − 1)) ∙ 휏 + 퐻표푟푖푔 ∙ (1 − 휏) (3.15)

where 휂 and 휏 are learning rate parameters such that 0 < 휂, 휏 < 1, 퐻푛푒푤(푘) is the histogram feature descriptor of the tracked object in the new frame, and 퐻표푟푖푔 is the original object feature descriptor used to initialize the tracking.

Additionally, the Kalman filter is updated so that it can predict the location of the target object in the next frame. The Kalman filter tracking method can be written as

푥푘 = 퐴푥푘−1 + 퐵푢푘 + 푤푘 (3.16) and

푧푘 = 퐻푥푘 + 푣푘 (3.17)

where k is the frame number, 푤푘 and 푣푘 are assumed to be zero mean Gaussian noise, 푥푘 is the state model that includes the position and velocity information, 푢푘 is the control vector, A is the state transition matrix, H is the measurement matrix, B is the control-input matrix, and 푧푘 is the measurement vector for the position.

3.2 DRIFT Tracking Results

The DRIFT tracking algorithm was designed primarily for use in WAMI tracking scenarios, and as such it has been evaluated on two WAMI datasets. These datasets are the

Columbus Large Image Format (CLIF) [16] and the Large Area Image Recorder (LAIR)

19

publicly released datasets. The data in these sets is captured at approximately two frames

per second, and a total of eight object sequences are used. Challenges from this set include

vehicle turns, multiple similar objects in the scene, and very small, low resolution objects.

The objects tracked are on the order of 8-10 pixels in size. Each sequence is between 10

and 20 frames in length. The DRIFT search area for this test was fixed at 15 pixels, which

was decided by observation of the object movements within the sets. Results are presented

using the Frame Detection Accuracy (FDA) metric [51]. The numerical results are all

taken from [1], the paper in which DRIFT is first presented in this form.

Table 3.1. Object Tracking Frame Detection Accuracy (%) for CLIF and LAIR sets [1].

CLIF 1 CLIF 2 CLIF 3 CLIF 4 LAIR 1 LAIR 2 LAIR 3 LAIR 4 Average HOG [20] 40.10 8.05 16.11 11.35 9.94 13.98 5.26 22.99 15.97 LBP [27] 46.45 8.50 5.00 47.30 5.26 17.43 10.09 16.11 19.52 RILBP [28] 58.00 7.55 9.17 8.70 9.36 24.26 5.26 14.37 17.08 RECT [29] 82.40 75.40 61.39 6.60 47.69 73.85 81.08 24.65 56.63 CIRC-EA [30] 65.90 75.25 93.06 68.30 35.61 81.58 77.69 55.90 67.91 CIRC-ED [31] 69.90 69.25 83.06 68.50 39.39 49.75 79.12 68.06 65.88 GRID-EA [3] 18.00 69.35 84.72 60.50 37.23 76.23 76.69 42.57 58.16 GRID-ED [3] 18.00 69.10 81.53 69.70 38.47 73.11 76.77 26.04 56.59 WCIRC-EA [3] 75.35 72.75 81.53 67.60 36.19 75.49 77.51 40.69 65.89 WCIRC-ED [3] 70.00 67.85 81.53 70.65 38.73 79.69 84.43 67.50 70.05 WGRID-EA [3] 73.25 68.80 81.67 69.90 37.75 76.23 73.91 37.78 66.16 WGRID-ED [3] 80.30 68.35 81.53 73.40 38.21 78.37 82.12 85.21 73.44 DRIFT [1] 85.70 68.25 88.06 79.30 61.40 86.51 83.86 85.28 79.80 DRIFT – STTF [1] 82.60 58.10 94.31 71.90 68.81 90.21 88.86 86.32 80.14

The results in Table 3.1 above show the results for many different tracking

algorithms, which were presented in Chapter 2. The DRIFT-STTF is the version of DRIFT

explained in the previous section, with the image enhancement preprocessing stage. The

other version is DRIFT without the preprocessing step. It can be seen that DRIFT even

20

Figure 3.6. Object tracking on CLIF and LAIR datasets [1]. The top left and bottom right are from CLIF, the others are from LAIR

without the preprocessing step still averages higher than any other algorithm, while the

DRIFT – STTF performs the best of all.

Though DRIFT is demonstrated to be a strong algorithm, there is still room for improvement. The DRIFT features, intensity and edge detection, are not especially strong when it comes to partial occlusions. Additionally, the use of grayscale only intensity can be especially limiting in cases where objects with radically different colors have nearly identical intensity images as demonstrated in Figure 1.1. Both of these limitations can be overcome by the addition of an additional color feature, as this thesis proposes.

21

CHAPTER 4

COLOR HISTOGRAM BASED FEATURES FOR TRACKING

In this chapter, the development process of the color histogram feature is discussed, as well as its integration with DRIFT. First, the use of three dimensional color histograms using the --blue (RGB) color model, YCbCr color model, Lab color model, and hue saturation value (HSV) color model are discussed. Next, the transition to using a two- dimensional color histogram is explained, as well as the benefits of using the feature over a three dimensional histogram. Then, the evaluations of the final feature, the two dimensional color histogram created from the HSV hue and saturation channels, is presented, including tracking results when using this as the sole tracking feature. Lastly, the integration with the DRIFT algorithm discussed in Chapter 3 is presented.

4.1 Three Dimensional Color Histogram Features

A three dimensional histogram, much like a single dimension histogram, is formed by sorting pixels into bins based on their value, and then counting the number of pixels in each bin. Figure 4.1 illustrates a three dimensional RGB color cube, broken into 8 bins per dimension. This means that the histogram this cube represents has a total of 512 bins, laid out in an 8x8x8 grid.

22

Figure 4.1. RGB cube histogram. In this 8x8x8 histogram, values from RGB [0,0,0] to [31,31,31] would be placed into bin {1,1,1}

Color histograms, most commonly RGB histograms, have been used in a variety of tracking applications [4-6] as a strong descriptor of object color. They have also been shown to be a stable representation of objects over change in view, and in differentiation between multiple objects [7]. While RGB remains the most commonly used color space for finding color features, other color spaces can be used as well. For the purposes of this research, several color models were tested. These color models include RGB, YCbCr, Lab, and HSV. These color models were selected for both their three channel representations of color, and their relatively well-defined value ranges. In RGB, each channel ranges from

0-255, with integer values only. YCbCr in popular usage will typically also have this range.

Lab values ranges are less well-defined, but typically range from 0-100 for the L channel, and -128 to +127 for the A and B channels, using continuous values. HSV values are likewise continuous, but on a 0 to 1 range for all three channels.

23

The equations for conversions from the RGB color space to the various color spaces used are presented below. The conversion to YCbCr is calculated as

푌 = 0 + (0.229 ∙ 푅) + (0.587 ∙ 퐺) + (0.114 ∙ 퐵)

퐶푏 = 128 − (0.168736 ∙ 푅) − (0.331264 ∙ 퐺) + (0.5 ∙ 퐵) (4.1)

퐶푟 = 128 + (0.5 ∙ 푅) − (0.418688 ∙ 퐺) − (0.081312 ∙ 퐵) where R is the red channel value, G is the green channel value, and B is the blue channel value [53]. The conversion from RGB to HSV is performed as

퐺−퐵 , 𝑖푓 푀 = 푅 퐶 퐵−푅 퐻′ = + 2, 𝑖푓 푀 = 퐺 퐶 푅−퐺 + 4, 𝑖푓 푀 = 퐵 { 퐶

퐻 = 60° × 퐻′

0, 𝑖푓 푉 = 0 푆 = {퐶 (4.2) , 표푡ℎ푒푟푤𝑖푠푒 푉

푉 = 푀 where

푀 = max(푅, 퐺, 퐵)

푚 = min(푅, 퐺, 퐵)

퐶 = 푀 − 푚 and again the red, green and blue channels are denoted as R, G, and B, respectively [54].

Lastly, the conversion to LAB is performed in two steps. First, the RGB is converted to

XYZ [55], and then XYZ is converted to LAB [56]. This is done as

24

푋 0.49000 0.31000 0.20000 푅 1 [푌] = [0.17697 0.81240 0.01063][퐺] 0.17697 (4.3) 푍 0.0000 0.01000 0.99000 퐵 and

푌 퐿 = 116 ∙ 푓 ( ) − 16 푌푛 푋 푌 푎 = 500 [푓 ( ) − 푓( )] (4.4) 푋푛 푌푛 푌 푍 푏 = 200 [푓 ( ) − 푓( )] 푌푛 푍푛 where

6 3 푡1/3 𝑖푓 푡 > ( ) 푓(푡) = { 29 1 29 2 4 ( ) 푡 + 표푡ℎ푒푟푤𝑖푠푒 3 6 29 and Xn, Yn, and Zn are the CIE XYZ tristimulus values of the reference white point, which in this context are constants defined as 푋푛 = 95.047, 푌푛 = 100.00, 푍푛 = 108.883.

In order for the color histograms to be of use in tracking, it must be possible to compare the histogram feature to a reference feature and determine the feature distance.

The distance calculation needs to take into account that the bins of a multidimensional histogram have neighboring bins that are also in multiple dimensions. For single dimension histograms, such as those seen in the DRIFT tracking algorithm [1], the Earth

Mover’s Distance (EMD) calculation is preferred for determining the distance score. EMD calculates the minimum cost that must be paid to transform one distribution into another

[8]. As the name suggests, it can be visualized as a bulldozer moving mounds of dirt in order to match one pattern to another. The EMD, given two histograms P and Q, is defined as [8]:

25

EMD(푃, 푄) = (min ∑ 푓푖푗푑푖푗) / (∑ 푓푖푗) 푠. 푡 푓푖푗 ≥ 0, ∑ 푓푖푗 ≤ 푃푖, {푓푖푗} 푖,푗 푖,푗 푗

and ∑푖 푓푖푗 ≤ 푄푗 , ∑푖,푗 푓푖푗 = min(∑ 푃푖, ∑푗푖 푄푗) (4.5)

th where {fi j} denotes the flows, representing the amount transported from the i supply to

th the j demand, and di j is the ground distance between bin i and bin j [9]. Furthermore,

Pele and Werman propose in [10] the EMD̂, an improvement on EMD that does not require the histograms to be normalized (i.e. the histograms do not have to have the same “amount” in each). The equation for EMD̂ is

EMD̂ 훼(푃, 푄) = (min ∑푖,푗 푓푖푗푑푖푗) + |∑ 푃푖푖 − ∑ 푄푗푗 |훼 max 푑푖푗 (4.6) {푓푖푗} 푖,푗 which is subject to the same constraints as the base EMD, and with α ≥ 0.5. This gives additional robustness to the calculation, allowing its use in additional applications.

For the purposes of the three dimensional histograms, the EMD can be extended to operate across multiple histogram dimensions, as done by Pele and Werman in [9]. This allows for the distance metric to be calculated for any dimension of histogram, but at a cost. The complexity of the EMD calculation scales exponentially with the number of dimensions, so the computation time also increases roughly by an order of magnitude for each dimension of the histogram data [11]. Using three dimensional data pushes the cusp of acceptable execution speed when used in a tracking application where many candidate regions are checked every frame.

In order to determine which color space model gives the strongest three dimensional color histogram, tracking tests using color histogram features from the color

26

Figure 4.2. Structure of color space testing tracking algorithm.

models was conducted. The tracker used for testing consisted of a Kalman filter for next frame location prediction, a sliding window search area, histograms computed in the color space undergoing test, a three dimensional EMD calculation to find each candidate region’s distance from the reference model, and an update to the reference model. Figure 4.2 shows a flow chart of the tracking algorithm’s function. The test was conducted on the Egtest01 dataset, which is a part of the PETS2005 dataset produced by the United States Defense

Advanced Research Projects Agency (DARPA), under the Video Verification Identity

(VIVID) program. The purpose of the dataset is to provide challenging sets against which an automated tracking algorithm can be tested. The egtest01 set features a convoy of vehicles on a runway that turn around in a wide circle before accelerating down the tarmac.

The car of interest then accelerates further and passes two other vehicles. This set was chosen because it is not overly challenging for even a fairly basic tracker such as the one used in this test. Figure 4.3 shows an example frame from the set.

The metric used for calculating the accuracy of the tracker when using a color histogram in each different color space was the average percent overlap that the tracker’s bounding box had with the truth data’s bounding box. This was calculated on a per-pixel basis. It is important to note that in all cases the bounding box size from the tracker was

27

Figure 4.3. Sample frame from Egtest01 dataset. The target vehicle has the red bounding box around it.

kept constant, at 25x25 pixels. The truth data bounding box is not a constant size, so in some cases even if the tracker’s bounding box is centered on the truth data’s bounding box, the score for that frame is not necessarily one hundred percent. The results for the various color spaces are presented in Table 4.1.

Table 4.1. 3D histogram tracking results by color space

3D Histogram Color Space RGB LAB YCbCr HSV Egtest01 0.7956 0.8156 0.1110 0.8197

It can be seen from the tracking test results that HSV produced the most accurate track, though not by a significant margin over the LAB color space. RGB performed well, but still marginally lower. YCbCr scored much lower, because it lost the track fairly early in the sequence, first jumping from one vehicle to another, and then losing track of all vehicles entirely. This is likely because the value range for the YCbCr color channels tended to cluster more strongly in the middle of the range, resulting in most values falling in just a few bins. In such a case, the histogram distances will not be nearly as useful as a case where the data are more spread out, as is the case in the other color spaces. In the

28 cases of the other color spaces, however, it can be clearly seen that the color histogram is a very strong feature, allowing the tracker to function accurately even though the color histogram is the only feature being used.

4.2 Two Dimensional Color Histogram Features

While the three dimensional color histograms previously discussed show very promising results as a tracking feature, they have the significant drawback of execution time. Though the histogram itself does not take any significant amount of computation time to produce, comparing the three dimensional histograms does. As discussed above, the EMD distance calculation time scales exponentially with the number of histogram dimensions. While the time for a single calculation is not prohibitive, using the distance calculation with a large number of candidate regions, as is the case with a sliding window search model, quickly accumulates into unacceptably lengthy execution time with even smaller search radii.

The proposed method for overcoming this performance limitation is to reduce the histogram to two dimensions. Color spaces such as HSV, YCbCr, and LAB store the color information in only two of the channels, with the illumination value being stored on the third channel. Note that while the illumination/intensity channel contains useful feature information, that information is already utilized by the DRIFT algorithm. Because the goal of this research is improvement of the DRIFT algorithm by integrating a color feature, using the intensity data in the color feature as well would be redundant, and would to some extent be increasing the weight of the intensity data for the overall algorithm. The channel

29

Figure 4.4. Heat map of HSV 2D histogram

reduction will not work on any color space that stores the color data across more than two channels, however. This rules out using RGB any further.

In order to evaluate the viability of a two dimensional color histogram feature, and to allow direct comparison to the three dimensional color histogram feature, the tracking test used in the previous section (outlined in Figure 4.1) was again utilized. The histogram calculation was modified to use the two dimensional color histograms, and thus the EMD distance calculation was also reduced to two dimensions. The tests were conducted the same way, on the Egtest01 dataset produced by DARPA. Table 4.2 shows the feature performance by color space, and Table 4.3 shows the total tracker time per frame, in seconds, for the three dimensional HSV histogram and the two dimensional HSV histogram.

30

Table 4.2. 2D histogram tracking results by color space

2D Histogram Color Space LAB YCbCr HSV Egtest01 0.0000 0.0000 0.8077

Table 4.3. Comparison of tracker time per frame for 3D and 2D histograms

Tracker Time Per Frame 2D Histograms 3D Histograms Time (s) 0.2052 1.7921

From the results presented in Table 4.2, it can be seen that the only color space that was able to still be used by itself for object tracking when using just its color channels is

HSV. YCbCr suffered from the same problems that it had in the three dimensional test, except that without the luminosity channel the problem of tightly packed values was intensified. A similar problem occurred with LAB, where without luminosity the values from the color channels all fell into just a couple different histogram bins, rendering the feature almost entirely useless. The HSV histograms, however, were still able to be used for tracking, with only a fairly small reduction in accuracy. The three dimensional HSV histogram had an accuracy 0.8179, while the two dimensional histogram achieved 0.8077.

The difference in execution time per frame, as seen in Table 4.3, is very significant, with the two dimensional time per frame running almost nine times faster. The speed increase is due almost entirely to the lowered complexity of the two dimensional EMD calculation.

4.3 Color Feature Fusion with DRIFT

In order to improve tracking accuracy of the DRIFT algorithm when used with color video, and to create a more robust overall tracker, the proposed two-dimensional color histogram is integrated with the DRIFT algorithm. This is done by adding the color

31

Figure 4.5. Diagram of Color DRIFT Structure.

histogram as an additional feature that is calculated for each candidate region. Figure 4.5 shows the overall structure of the hybrid tracker. In this diagram, the blocks with rounded edges indicate an input to the algorithm, while the rectangle blocks with lines on the sides indicate a section, rather than a function or calculation. The blue rectangles indicate a block section where calculations are occurring, while the green rectangles indicate that separate calculations are occurring for the grayscale and color features.

The DRIFT algorithm performs the candidate region selection using a sliding window approach, but with certain candidates eliminated based on intensity differences with the reference object, as described in the previous section. For each candidate, the

RGB color region is converted into HSV color space. The value channel is then discarded, leaving only the hue and saturation layers. From these two layers, a two-dimensional histogram Hi is created with 8 bins per channel, using the Gaussian rings as a mask, as done with the intensity and Kirsch mask features. The color histogram normalization factor Ni

32 is then found using equation (3.8). Using Ni, the color histogram is normalized to equally weigh each ring using

퐻푖(푣) 퐻푐표푙표푟,푖(푣) = , (4.7) 푁푖 which yields the final two-dimensional Gaussian weighted color histogram feature. The histogram is 8x8 bins, for a total of 64 bins per ring, resulting in 64ρ bins total. Additional bins could be used to increase accuracy in certain cases, but the computation time of the distance calculation would increase greatly.

The distance calculation is performed using the two-dimensional Earth Mover’s

Distance, as described in section 4.1. The EMD score is calculated separately for the

DRIFT features and for the color histogram feature. Experimentally, it was determined that the color histogram distance scores returned by the two-dimensional EMD are significantly higher than the scores returned for the DRIFT features. In order to weigh the two distance scores equally for the combined score, the distance scores for both features are normalized. Because the EMD scores are not bound to a range, normalization must be performed using constant values, rather than being based on the maximum scores for each frame. The normalization constants are determined as

푁푐표푙표푟 = max(퐸푀퐷푐표푙표푟) (4.8)

and

푁퐷푅퐼퐹푇 = max(퐸푀퐷퐷푅퐼퐹푇) (4.9)

33 where 퐸푀퐷푐표푙표푟 and 퐸푀퐷퐷푅퐼퐹푇 are the EMD distances for the color feature and DRIFT features, respectively. The normalization constants are used to calculate the overall distance scores for all frames as

퐸푀퐷푐표푙표푟(푛) 퐸푀퐷퐷푅퐼퐹푇(푛) 퐷푐표푚푏푖푛푒푑(푛) = + (4.10) 푁푐표푙표푟 푁퐷푅퐼퐹푇

th where 퐷푐표푚푏푖푛푒푑(푛) is the combined distance score for the n candidate region. The candidate with the lowest combined distance score is then determined to be the likely object center point. The thresholds for determining whether the reference object feature descriptor should be updated remain the same as with the base DRIFT algorithm. The color reference feature descriptor is updated as

퐻푟푒푓,푐표푙(푘) = (휂 ∙ 퐻푛푒푤,푐표푙(푘) + (1 − 휂) ∙ 퐻푟푒푓,푐표푙(푘 − 1)) ∙ 휏 + 퐻표푟푖푔,푐표푙 ∙ (1 − 휏) (4.11)

where 휂 and 휏 are the update parameters, 퐻푛푒푤,푐표푙 is the new color feature descriptor from lowest scoring candidate region, 퐻표푟푖푔,푐표푙 is the object’s original color feature descriptor that it was initialized with, and 퐻푟푒푓,푐표푙 is the reference color feature descriptor.

It is also proposed that an image registration stage be integrated with the DRIFT algorithm. This image registration uses the FAST algorithm [18-19] to find common features between frames, and by comparison of the pixel locations of these features, determine by how many pixels the camera has shifted, if any, from the previous frame.

This calculated shift is then added to the Kalman filter center point estimate for the frame.

By doing this, the estimate for the center point will be more accurate, resulting in more accurate tracking. Figure 4.6 shows the FAST corner features on a pair of frames from the

Visual Test Benchmark [39] dataset.

34

Figure 4.6. FAST feature comparison between pair of frames. Note that only the 100 strongest are displayed in each image.

35

CHAPTER 5

OBJECT TRACKING EVALUATIONS

In this chapter, the proposed Color DRIFT algorithm is tested by measuring its object tracking ability in both color WAMI data, and in general color videos. First, the datasets are discussed, as well as the testing setup. Next the test strategy is discussed, and the results are presented in table form, with graphs to help visualized the results for the second dataset. Lastly, the results are discussed.

5.1 Datasets and Testing Setup

The WAMI data is from the VIVID dataset first introduced in section 3.1. The parts of the dataset that are used for testing are the Egtest01 and Redteam sequences. The

Egtest01 sequence follows a car within a convoy driving on an airstrip. The convoy first turns around in a wide loop, presenting the camera with many different viewpoints of the target vehicle, before accelerating down the airstrip. The target vehicle then breaks away from the center of the convoy, overtaking two of the other vehicles. The Redteam video sequence follows a primarily red in color vehicle traveling down a dirt road. The camera pans in and out at various points, offering significantly different scale views of the vehicle.

Figure 5.1 shows frames from the Egtest01 sequence and the Redteam sequence.

36

Figure 5.1. Frames from object tracking sequences corresponding to Table 5.1.

The general video tracking sequences are taken from the Visual Tracker Benchmark dataset [39], a set of challenging videos that feature tracking of people, faces, objects, and vehicles. The video sequences include many challenging aspects such as in-plane and out- of-plane rotation, occlusions, large global lighting changes, and non-rigid object deformation. The video sequences are also of long duration, ranging from roughly 250 frames to roughly 1350 frames. Certain sequences from the set were necessarily excluded because they were grayscale-only video, which prevents the use of the color features under test. Figure 5.2 shows frames from a sampling of the sequences.

The setup for both sequences is very similar. While the framerates vary from sequence to sequence, none of the sequence frame rates are substantially lower than the others, with frames rates roughly in the range of 15 – 25 frames per second. Because the

37

Figure 5.2. Frames from object tracking sequences corresponding to Tables 5.2 and 5.3.

frames rates are relatively consistent, the same formula was used for calculating the search radius to use. The search radius is found as

Search Radius = 0.75 ∙ max(푂푤푖푑푡ℎ, 푂ℎ푒푖푔ℎ푡) (5.1)

where 푂푤푖푑푡ℎ is the width of the object being tracked, and 푂ℎ푒푖푔ℎ푡 is the height of the object. The values of the constants 훼, 훽, and 휂 and 휏 from equations (4.13), (4.14), and

(4.15), respectively, are also set to be the same between the two datasets. The value of 훼, the occlusion threshold constant, is set to 5. The reference features update constant, 훽, is set to 3. The values of 휂 and 휏, the constants used for defining the reference feature update ratios, are set to 0.2 and 0.9, respectively. The value for b, which limits the loss of data from the Kirsch mask when it is thresholded (4.12), is set to 5. The number of Gaussian

38 rings (휌), however, do differ between the datasets. For the WAMI sequences, the number of Gaussian rings used is set to 4; for the Visual Tracker Benchmark sequences, the number is set to 8, as the objects are much higher resolution.

5.2 Testing Strategies and Results

For the WAMI datasets, the metric used for evaluation is average overlap, but as defined in [40], which is described as the percentage of the intersection between the target and the ground truth with respect to the ground truth area. What this amounts to is counting the number of pixels from the ground truth that are captured by the test bounding box, and then dividing by the number of pixels in the ground truth bounding box. This method will heavily favor oversize bounding boxes, and will penalize undersize bounding boxes. Table

5.1 shows the results from these WAMI sequences.

Table 5.1. VIVID Object Tracking Overlap (%)

Egtest01 Redteam Average Mean Shift+ 0.66 0.68 0.67 Variance Ratio+ 0.77 0.73 0.75 Peak Diff+ 0.62 0.72 0.67 Frag Track+ 0.58 0.44 0.51 SemiBoost+ 0.63 0.71 0.67 MIL+ 0.47 0.64 0.56 Adaptive Tracker+ 0.62 0.73 0.68 DRIFT 0.69 0.72 0.71 C-DRIFT 0.78 0.79 0.79 + Experimental results obtained from [40].

For the Visual Tracker Benchmark (VTB) sequences, two different metrics are used. First the Frame Detection Accuracy (FDA) is presented, then the average center pixel error is presented. The FDA is defined as:

39

푎푟푒푎(푅푂퐼 (푘) ∩ 푅푂퐼 (푘)) 퐹퐷퐴(푘) = 푇 퐺 (5.2) 푎푟푒푎(푅푂퐼푇(푘) ∪ 푅푂퐼퐺(푘))

where 푅푂퐼푇 is the tracked bounding box, 푅푂퐼퐺 is the ground truth bounding box, and

퐹퐷퐴(푘) is the Frame Detection Accuracy for frame k. The results are then averaged over all frames to obtain the average area detection result. The average center pixel error is simply the Euclidian distance between the centers of the tracked bounding box and the ground truth bounding box.

Figures 5.3 and 5.4 show a graphical view of the C-DRIFT results for the Visual

Tracker Benchmark sets. Figure 5.3 shows a plot of thresholded overlap success rates, and

Figure 5.4 shows a plot of thresholded center error success rates.

Figure 5.3. Plot of thresholded overlap success for Visual Tracker sets.

40

Figure 5.4. Plot of thresholded center error success for Visual Tracker sets.

The overlap threshold success plot (Figure 5.3) is formed using

푂푣푒푟푙푎푝푡ℎ푟푒푠ℎ(푘) = 푚푒푎푛(푂푣푒푟푙푎푝(푛) > 푘) (5.3)

th where Overlap(n) is the overlap of the n frame, and 푂푣푒푟푙푎푝푡ℎ푟푒푠ℎ(푘) is the thresholded overlap success at k, where k is the applied threshold. In Figure 5.3, the thresholds run along the x-axis. The center pixel error threshold success plot is found similarly, using

퐶푒푛푡푒푟퐸푟푟표푟푡ℎ푟푒푠ℎ(푘) = 푚푒푎푛(퐶푒푛푡푒푟퐸푟푟표푟(푛) < 푘) (5.4) where 퐶푒푛푡푒푟퐸푟푟표푟(푛) is the center pixel error of the nth frame, and k is the applied center pixel error threshold. These graphs show each different sequence from the VTB dataset represented as a separate line. To illustrate how to read the results on these graphs, an example is offered. On Figure 5.5, for a threshold of 0.5, the Basketball sequence achieves a success rate at threshold of roughly 0.95. This means that roughly

95 percent of frames in the sequence had an overlap with the truth data of at least 0.5.

41

Table 4.2 shows the FDA for the Visual Tracker Benchmark sequences, and Table

4.3 shows the results of the average center pixel error. In these tables, the proposed color histogram feature fusion with DRIFT is referred to as C-DRIFT.

Table 5.2. Visual Tracker Benchmark Object Tracking Frame Detection Accuracy (%) ASLSA- ASLSA- C- RAW* HOG* L1_APG* CT_DIF* DRIFT DRIFT Basketball 0.46 0.27 0.29 0.36 0.65 0.70 Lemming 0.42 0.41 0.4 0.41 0.27 0.69 MountainBike 0.27 0.37 0.4 0.25 0.65 0.67 Shaking 0.1 0.06 0.06 0.33 0.13 0.17 Skating1 0.38 0.35 0.31 0.26 0.31 0.37 Tiger1 0.25 0.24 0.42 0.15 0.34 0.52 Tiger2 0.32 0.21 0.38 0.1 0.34 0.23 Trellis 0.49 0.49 0.28 0.28 0.28 0.38 Average 0.34 0.30 0.32 0.27 0.37 0.47

The evaluations for the VIVID dataset of WAMI sequences compare the results from seven different trackers, as well as the base DRIFT tracker and the proposed C-DRIFT tracker.

The Mean Shift tracker [45] uses histograms regularized by special masking with an isotropic kernel. The Variance Ratio tracker and the Peak Diff tracker [32] use features formed from linear combinations of the R, G, and B channels of a color image. The Frag

Track method [46] uses multiple arbitrary image fragments to form feature descriptors.

SemiBoost [47] uses an on-line semi-supervised boosting method to classify the target object. The Multiple Instance Learning (MIL) tracker [48] uses an online method similar to SemiBoost, but with improvements to help reduce inaccuracies. The Adaptive Tracker

[49] extends the mean-shift algorithm by selecting reliable shapes from color and shape- texture cues according to their descriptive ability. Evaluation results for these trackers are taken from [40].

42

Table 5.3. Visual Tracker Benchmark Object Tracking Average Center Error (Pixels)

ASLSA- ASLSA- C- RAW* HOG* L1_APG* CT_DIF* DRIFT DRIFT Basketball 70.2 245.6 107.2 123.6 10.4 7.4 Lemming 165.5 155.4 138.2 149.3 120 11.9 MountainBike 185.9 155.2 210.1 212.9 11 8.6 Shaking 86.5 86.6 113 30.9 71 94.4 Skating1 14.6 15.6 72.3 87.9 43.9 80.8 Tiger1 71.2 112.9 61.7 85.6 70.8 25.0 Tiger2 61.8 96.2 58.4 83.3 40 63.3 Trellis 31.9 18.8 62.5 47.4 54.8 43.8 Average 85.95 110.79 102.93 102.61 52.74 41.9 * Experimental results obtained from [44]

The evaluations for the Visual Tracker Benchmark sequences compare the results of four different trackers, plus two versions of the DRIFT tracker. The ASLSA-RAW tracker and the ASLSA-HOG tracker [41] use structural local sparse representation for object tracking. The L1_APG tracker [42] uses accelerated proximal gradients. The

CT_DIF tracker [43] uses features extracted from multi-scale image feature space with non-adaptive random projections. The DRIFT tracker uses the framework described in

Chapter 3.

5.3 Discussion

Results from Table 4.1 show that on the evaluated WAMI sequences, the proposed

C-DRIFT achieved the highest tracking overlap on both the Egtest01 and Redteam sequences. It can be seen that the base DRIFT algorithm also achieved high overlap, with performance levels quite nearly equal to the best other tracker for each set. While the results for both DRIFT trackers are quite good, the large scale changes present in both sequences gave significant challenge to the fixed bounding box size of the DRIFT and C-

43

DRIFT trackers. Due to the way that the overlap metric is calculated in [40], having an undersized bounding box when the object scale increases results in a lower score. Had [40] given other methods of evaluation as well, they would likely have been used in this thesis.

Results from Table 4.2 show that the proposed color histogram feature fusion with

DRIFT results in significantly higher average tracking accuracy compared to the base

DRIFT. DRIFT achieves a tracking average of 37%, while C-DRIFT is able to achieve a full 10% higher accuracy at 47%. The average center error in Table 4.3 shows that the average center error for C-DRIFT is also sizable margin lower. DRIFT achieves an average center error of 52.74, versus C-DRIFT’s 41.7 center pixel error. For both metrics, C-

DRIFT significantly outperforms the base DRIFT algorithm.

44

CHAPTER 6

CONCLUSION

A novel color feature extraction method is presented, and its integration with the

DRIFT tracking algorithm is presented. The color feature is designed to improve the ability of the DRIFT algorithm to deal with color imagery and to create an overall more effective tracker, both for WAMI data and general use tracking.

First, the three dimensional RGB color space histogram is presented, and its successful use in related works is discussed. Related works also show the use of histograms from other color spaces, and so this idea is explored. A test is done to determine the suitability of different color space histograms for use in tracking. This is done by using a basic tracking framework, stripped from the DRIFT tracker, for testing of tracking using solely the three dimensional color histogram as a feature. The tracker consists of a sliding window candidate region selector, Earth Mover’s Distance (EMD) for feature distance comparison, and a Kalman filter for prediction of the next frame location. The color spaces tested are RGB, YCbCr, LAB, and HSV. It is shown that all except YCbCr achieve considerable success when used as a feature for tracking, though HSV performs the best.

The complexity of the EMD calculation for three dimensional histograms, however, motivates a move to a two dimensional histogram feature, which reduces the calculation time by an order of magnitude. By stripping away the intensity/luminosity channel from the YCbCr, LAB, and HSV color space, the two dimensional color histogram feature is 45 able to be formed. This feature is again tested using the various color spaces in a tracking application, and it is shown that only the HSV color space is able to be used by itself for two dimensional histogram tracking. It is also shown that the solo tracking results of the two dimensional histogram remain comparable to the three dimensional histogram results.

Second, the two dimensional color histogram feature is integrated with DRIFT to create a more robust object tracker. The color histogram is integrated as a separate feature, next to the DRIFT features of intensity and Kirsch edge detection. It uses the existing

Gaussian ringlets used by the other DRIFT features as weights for the color histograms in order to generate a robust color feature. The color feature distance is calculated separately from the DRIFT feature distance, and then the two distances are equally weighted and added together to form the final weight for each candidate region. The combined feature tracker is shown on average to outperform the base DRIFT algorithm, and for the sequences from the Visual Tracker Benchmark dataset, it is shown to outperform versions of DRIFT developed to account for scale changes in the video sequence, even though the combined tracker does not handle scale changes.

The future work that can be accomplished using this combined tracker would include an improved component for handling scale changes within a tracking sequence.

Some scale changing is already integrated into certain versions of DRIFT, but by using normalized color histograms combined with the existing scale changing in much the same way that the features are combined in the DRIFT tracking, it could be possible to create more robust rescaling method. This is especially important for WAMI data, because longer tracking sequences such as those seen in Egtest01 and Redteam often include scale and

46 orientation changes. A robust rescaling method could allow for more accurate tracking in both WAMI data and general object tracking.

47

REFERENCES

[1] E. Krieger, P. Sidike, T. Aspiras and V. K. Asari, "Directional ringlet intensity feature

transform for tracking," Image Processing (ICIP), 2015 IEEE International Conference

on, Quebec City, QC, 2015, pp. 3871-3875.

[2] E. Krieger, P. Sidike, T. Aspiras and V. K. Asari, "Vehicle tracking under occlusion

conditions using directional ringlet intensity feature transform," 2015 National

Aerospace and Electronics Conference (NAECON), Dayton, OH, 2015, pp. 70-74.

[3] T. Aspiras, V. K. Asari, J. Vasquez, “Gaussian ringlet intensity distribution (GRID)

features for rotation-invariant object detection in wide area motion imagery,” IEEE

International conference on Image Processing, 2014.

[4] G. R. Bradski. “Computer video face tracking for use in a perceptual user interface.”

Intel Technology Journal, 2, 1998.

[5] R. Collins, Y. Liu, and M. Leordeanu. “Online selection of discriminative tracking

Features,” IEEE Transactions on Pattern Analysis and Machine Intelligence,

27:1631– 1643, 2005.

[6] Tao Yang, Quan Pan, Jing Li and S. Z. Li, "Real-time multiple objects tracking with

occlusion handling in dynamic scenes," 2005 IEEE Computer Society Conference on

Computer Vision and Pattern Recognition (CVPR'05), 2005, pp. 970-975 vol. 1.

48

[7] M. J. Swain and D. H. Ballard, "Indexing via color histograms," Computer Vision,

1990. Proceedings, Third International Conference on, Osaka, 1990, pp. 390-393.

[8] Y. Rubner, C. Tomasi, and L. J. Guibas. “The earth mover’s distance as a metric for

image retrieval,” IJCV, 2000.

[9] O. Pele and M. Werman, "Fast and robust Earth Mover's Distances," 2009 IEEE 12th

International Conference on Computer Vision, Kyoto, 2009, pp. 460-467.

[10] O. Pele and M. Werman. “A linear time histogram metric for improved sift

matching,” In ECCV, 2008.

[11] S. Shirdhonkar and D. W. Jacobs, "Approximate earth mover’s distance in linear

time," Computer Vision and Pattern Recognition, 2008. CVPR 2008. IEEE Conference

on, Anchorage, AK, 2008, pp. 1-8.

[12] B. Zitova and J. Flusser, “Image registration methods: A survey,” Image Vis.

Comput., vol. 21, no. 11, pp. 977–1000, Oct. 2003.

[13] C. Harris, M. Stephens, “A combined corner and edge detector,” in Proceedings of

the Alvey Vision Conference. (1988) 147 – 151.

[14] T. Lindeberg, “Feature detection with automatic scale selection,” IJCV 30(2)

(1998) 79 – 116

[15] K. Mikolajczyk, C. Schmid, “Indexing based on scale invariant interest points,” in

ICCV. Volume 1. (2001) 525 – 531.

[16] D. Lowe, “Object recognition from local scale-invariant features,” in ICCV. (1999)

[17] H. Baya et al, “Speeded-Up Robust Features (SURF),” Computer Vision and Image

Understanding, 110(3), pp. 346-359, 2008.

49

[18] E. Rosten and T. Drummond, “Fusing points and lines for high performance

tracking,” in Proceedings of the International Conference on Computer Vision, pp.

1508–1511, 2005.

[19] E. Rosten and T. Drummond, “Machine learning for high-speed corner detection,”

in Proceedings of the European Conference on Computer Vision, pp. 430–443, 2006.

[20] N. Dalal and B. Triggs, "Histograms of oriented gradients for human detection,"

2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition

(CVPR'05), San Diego, CA, USA, 2005, pp. 886-893 vol. 1.

[21] S. Arigela and V. K. Asari, “Self-tunable transformation function for enhancement

of high contrast color images,” J. Electron. Imaging, vol. 22(2), pp. 023010, 2013.

[22] E. Krieger, V. K. Asari and S. Arigela, “Color image enhancement of low resolution

images capture in extreme lighting,” Proc. SPIE 9120, pp. 91200Q, 2014.

[23] A. Soetedjo and K. Yamada, “Traffic sign classification using ring partitioned

method,” IEICE Trans. Fundam. Electron. Commun. Comput. Sci., E88-A(9), pp.

2419–2426, 2005.

[24] D. Comaniciu, V. Ramesh and P. Meer, "Kernel-based object tracking," in IEEE

Transactions on Pattern Analysis and Machine Intelligence, vol. 25, no. 5, pp. 564-577,

May 2003.

[25] E. Maggio and A. Cavallaro, "Multi-part target representation for color tracking,"

IEEE International Conference on Image Processing 2005, 2005, pp. I-729-32.

[26] M. Isard and A. Blake. “CONDENSATION-Conditional density propagation for

visual tracking,” International Journal of Computer Vision, 29(1): 5-28, 1998.

50

[27] T. Ojala, M. Pietikainen and T. Maenpaa, "Multiresolution gray-scale and rotation

invariant texture classification with local binary patterns," in IEEE Transactions on

Pattern Analysis and Machine Intelligence, vol. 24, no. 7, pp. 971-987, Jul 2002.

[28] T. Ahonen, J. Matas, C. He, and M. Pietikäinen “Rotation invariant image

description with local binary pattern histogram Fourier features,” In Image Analysis,

pp. 61–70, 2009.

[29] A. Mathew and V. K. Asari, “Local histogram based descriptor for tracking in wide

area imagery,” Lecture Notes in Computer Science, Published by Springer-Verlag

Berlin Heidelberg, Wireless Networks and Computational Intelligence

(Communications in Computer and Information Science series), vol. 292, pp. 119–128,

2012.

[30] A. Soetedjo and K. Yamada, “Traffic sign classification using ring partitioned

method,” IEICE Trans. Fundam. Electron. Commun. Comput. Sci., E88-A(9), pp.

2419–2426, 2005.

[31] Z. Tang, X. Zhang, and S. Zhang, “Robust perceptual image hashing based on ring

partition and nmf,” IEEE Transactions on Knowledge and Data Engineering, vol. 26,

pp. 711–724, 2014.

[32] R. T. Collins and Y. Liu, "On-line selection of discriminative tracking features,"

Computer Vision, 2003. Proceedings. Ninth IEEE International Conference on, Nice,

France, 2003, pp. 346-352 vol.1.

[33] M. Swain, and D. Ballard, “Color indexing,” International Journal of Computer

Vision, 7, 1 (1991), pp. 11–32.

51

[34] K. Nummiaro, E. Koller-Meier, and L. Van Gool, “Color features for tracking non-

rigid objects,” Chin. J. Autom., vol. 29, no. 3, pp. 345–355, May 2003

[35] V. Takala and M. Pietikainen, "Multi-Object Tracking Using Color, Texture and

Motion," 2007 IEEE Conference on Computer Vision and Pattern Recognition,

Minneapolis, MN, 2007, pp. 1-7.

[36] J. Wang and Y. Yagi, "Integrating Color and Shape-Texture Features for Adaptive

Real-Time Object Tracking," in IEEE Transactions on Image Processing, vol. 17, no.

2, pp. 235-240, Feb. 2008.

[37] B. H. Shekar, K. Raghurama Holla, and M. Sharmila Kumari, “KID: Kirsch

Directional Features Based Image Descriptor,” Pattern Recog. and Machine Int.

Lecture Notes in Comp. Sci., vol. 8251, pp. 327-334, 2013.

[38] R. E. Kalman, “A New Approach to Linear Filtering and Prediction Problems.”

Journal of Basic Engineering, 82(1), pp. 35-45, 1960.

[39] Y. Wu, J. Lim, and M. Yang, “Online Object Tracking: A Benchmark,” IEEE

Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2411-2418

2013.

[40] M. Siam and M. Elhelw, "Enhanced Target Tracking in UAV Imagery with P-N

Learning and Structural Constraints," Computer Vision Workshops (ICCVW), 2013

IEEE International Conference on, Sydney, NSW, 2013, pp. 586-593.

[41] X. Jia, H. Lu, and M. H. Yang, “Visual tracking via adaptive structural local sparse

appearance model,” IEEE Conference on Computer Vision and Pattern Recognition

(CVPR), pp. 1822–1829, 2012.

52

[42] C. Bao, Y. Wu, H. Ling, and H. Ji, “Real time robust l1 tracker using accelerated

proximal gradient approach,” IEEE Conference on Computer Vision and Pattern

Recognition (CVPR), pp. 1830–1837, 2012.

[43] K. Zhang, L. Zhang, and M. H. Yang, “Real-time compressive tracking,” European

Conference on Computer Vision (ECCV), pp. 864– 877, 2012

[44] L. Wang, T. Liu, G. Wang, K. L. Chan and Q. Yang, "Video Tracking Using

Learned Hierarchical Features," in IEEE Transactions on Image Processing, vol. 24,

no. 4, pp. 1424-1435, April 2015.

[45] D. Comaniciu, V. Ramesh, and P. Meer, (2003). “Kernel based object tracking.”

IEEE Trans. Pattern Anal. Mach. Intell. 25(5), 564–577.

[46] A. Adam, E. Rivlin, and I. Shimshoni, (2006). “Robust fragments-based tracking

using the integral histogram.: IEEE Conference on Computer Vision and Pattern

Recognition (CVPR). 1, 798–805.

[47] H. Grabner, C. Leistner, and H. Bischof, (2008). “Semisupervised on-line boosting

for robust tracking.” European Conference on Computer Vision (ECCV).

[48] B. Babenko, M.-H. Yang, and S. Belongie, (2009). “Visual Tracking with Online

Multiple Instance Learning.” Conference on Computer Vision and Pattern Recognition

(CVPR).

[49] Junqiu Wang, Yasushi Yagi, (2008). “Integrating Color and Shape-Texture

Features for Adaptive Real-Time Object Tracking.” IEEE Transactions on Image

Processing. 17(2), 235-240.

[50] CLIF data from AFRL, ID: HAAA08E09D,

https://www.sdms.afrl.af.mil/index.php?collection=clif2007.

53

[51] V. Manohar, P. Soundararajan, D. Goldgof, R. Kasturi and J. Garofolo,

“Performance Evaluation of Object Detection and Tracking in Video,” In Proc. of the

seventh Asian Conference on Computer Vision, pp. 151-161, 2006.

[52] Original color image from website “beyond.ca”, http://www.beyond.ca/i/popular-

car-colors.jpg.

[53] “T.871: Information technology – Digital compression and coding of continuous-

tone still images: JPEG File Interchange Format (JFIF).” ITU-T. September 11, 2012.

[54] A. R. Smith, “Color Transform Pairs,” Comput. Graph., vol. 12, pp. 12-19,

1978.

[55] H. Fairman, M. Brill, H. Hemmendinger, “How the CIE 1931 Color-Matching

Functions were Derived from the Wright-Guild Data.” Color Research and

Application. 22 (1): 11-23. February 1997.

[56] J. Schanda, “Colorimetry.” Wiley-Interscience. p. 61.

54