Monocular Golf Ball Tracking and Size Estimation

Master’s thesis in Computer Science Computer Vision and Active Perception Lab

MARKUS PETTERSSON

Supervisors: Mårten Björkman & Joakim Hugmark Examiner: Stefan Carlsson

TRITA xxx yyyy-nn

Abstract

In order to estimate the distance between a moving camera and a golf ball, the projective ball size in images is measured. Two methods are evaluated based on their ability to detect golf balls in TV broadcasting images: circular Hough transform and scale-space blob detection. A tracking algorithm is then proposed where, given a known ball position, a search window is used, adapting for ball speed in the image plane, ball size and previous window size. Various pre-processing methods are also discussed, including deinterlacing and an approximate background subtraction method. Ex- periments shows that none of these pre-processing methods signiﬁcantly improves the performance in terms of ball detection, although tendencies of improvement is noticed in some test sequences. The tracking algorithm is evaluated by using sequences from TV broadcasts of golf competitions. The algorithm performs well under certain circumstances, where the projective size of the ball is the key to accurate measurements of the distance between ball and camera. The noisy input data however introduces noise also to the size estimations, resulting in noisy distance estimations. Additionally, the test sequences contains no ground truth for the actual distance between ball and camera, and the accuracy of the method could therefore only be determined in terms of noise analysis. Referat

Monokulär trackning och storleksmätning av golfbollar

För att kunna uppskatta avståndet mellan en rörlig kamera och en golf- boll så uppskattas den projektiva storleken av en bollen i bilder. Två metoders förmåga att upptäcka golfbollar i bilder från TV-sändningar utvärderas: cirkulär Hough-transform och en blobdetektor baserad på skalrumsteori. En trackningsalgoritm presenteras, som givet en initial bollposition anpassar ett sökfönster efter bollens rörelse i bildplanet bollstorlek och föregående fönsterstorlek. Vidare diskuteras också olika förbehandlingstekniker, däribland av- ﬂätning och en approximativ metod för bakgrundssubtraktion. Tester visar att ingen av dessa förbättrar resultaten nämnvärt, dock sker en viss förbättring i vissa fall. Trackningsalgoritmen utvärderas baserat på sekvenser från TV-sändningar av golftävlingar. Algoritmen presterar väl i vissa förhållanden, där den avgörande faktorn för goda avståndsuppskattningar är bollens storlek. Vidare ger brus i indatan också brus i avståndet. För den testdata som används ﬁnns inget facit för det verkliga avståndet mellan boll och kamera, varvid metodens noggrannhet bara kan analyseras genom att studera bruset. Acknowledgements

I would like to thank my academic supervisor Mårten Björkman, who despite a massive workload invested both time and eﬀort in my thesis project, and Joakim Hugmark, for daily discussions, support and feedback. I would also like to thank Daniel, Joakim, Fredrik, Ludvig, Dennis and Jonas, for showing interest in me and the project, and my examiner Stefan Carlsson for his feedback. Beyond the people mentioned above, There are of course many more who have helped me throughout this project, especially friends and fam- ily. I can’t mention you all, but thanks a million for your support, en- couragement and love.

Stockholm, June 2014

Markus Pettersson Contents

1 Introduction 1 1.1 Problem description ...... 2 1.2 Previous works ...... 4 1.2.1 Soccer ...... 4 1.2.2 Other sports ...... 5 1.2.3 Fruit and traﬃc signs ...... 6 1.3 Contributions of this work ...... 6 1.4 Report outline ...... 7

2 Theory 9 2.1 Scale-space based blob detection ...... 9 2.1.1 Faster Gaussian convolution ...... 11 2.2 Circular Hough transform ...... 12 2.3 Interlacing ...... 13 2.4 Kalman ﬁlters ...... 14

3 Method 15 3.1 Input data ...... 16 3.1.1 Interlacing ...... 16 3.1.2 Frame types and video semantics ...... 17 3.1.3 Sharpening filter ...... 18 3.2 Window selection ...... 18 3.2.1 Predict ball position ...... 19 3.2.2 Predict ball radius ...... 19 3.2.3 Window size ...... 20 3.3 Peak detection ...... 20 3.3.1 Pre-processing ...... 20 3.3.2 Circular Hough transform ...... 22 3.3.3 Scale-space blob detection ...... 22 3.3.4 Tracking specifics ...... 23 3.4 Size estimation ...... 24 3.5 Analysis of expected accuracy ...... 24 4 Experiments 27 4.1 Test data analysis ...... 27 4.2 Initialization ...... 28 4.3 Choice of deinterlacing technique ...... 28 4.4 Tracking ...... 29 4.5 Distance estimation ...... 35 4.6 Computational efficiency ...... 38

5 Conclusions 39 5.1 Future work ...... 40

Bibliography 41

Appendices 44

A Test data 45 A.1 Test sequences without focal length data ...... 45 A.2 Test sequences with focal length data ...... 45

B Scale-space blob size 47

Chapter 1

Introduction

In sports, measurements of physical properties are important for a wide range of issues. Did the ball cross the line or not? How far was the discus thrown? How fast did the downhill skier go? Such measurements have traditionally been made manually, but automating the measurements is getting easier as technology is evolving. In particular, to measure balls is a well-studied area since balls are essential in many diﬀerent sports. Historically, the techniques for automatic, or at least technology-aided, measurements of ball speed and position has been dominated by radar approaches like Spike et al. [45] and Dilz [15], whereas more recent methods uses laser scanners [18]. However, improved graphics processing hardware and the digital camera rev- olution have created openings to do ball tracking with computer vision techniques. There are several pros for using cameras as sensors instead of radar or laser. For instance, cameras are typically cheaper, and it is much easier to visualize measurement problems with images than with other types of sensor data. A common method in computer vision to track balls is to use stereo vision, where two cameras with known distances between them are used. If the ball’s position in images from the two cameras is determined, then its 3D position can be computed using the pinhole camera model and the theory of epipolar lines [3, p. 24]. The tracking problem immediately gets harder when only one camera can be used, since triangulation with epipolar lines no longer is possible. One alternative can then be to keep track of the camera rotation, translation 1 and scaling 2 , whereafter the size of the ball is measured in the image. With known image sensor parameters, ball size and focal length, the distance between the ball and the camera can then be computed with basic trigonometry. Hence, the position of the ball in 3D space can also be determined. The fact that a golf ball has a well speciﬁed size 3 enables this kind of approach. In this report, a novel algorithm for automatic detection and size estimation

1Translation - how the camera is moved forwards, backwards or sideways 2Scaling depends on how much the view is zoomed in/out 3The rules of golf [42] states a minimum ball size of 42.67 mm i diameter. It is allowed to use a larger ball, but that is rare due to golf ball aerodynamics.

1 CHAPTER 1. INTRODUCTION of golf balls in video sequences is proposed. Based on this and previous work by Hugmark [25], where data from an IMU4 is stabilized with image features, a complete positioning system can be built. Such a system can be used for determining the distance of a drive5, as well as for shot analysis or graphics generation, neither of these aspects are however cover in this report.

1.1 Problem description

The task to detect the ball and to measure its size may seem easy, but, as will be seen in this section, this is not the case. This description will mainly focus on multiple peaks and noise, which are more general. Issues more specific to the video data used in this project are described in Section 3.1. Such issues includes motion blur, interlacing and camera filters. To determine the size of the ball, the first step is to determine the location of the ball within the image. The difficulty of this task varies from sequence to sequence, but typically gets easier when the ball occupies a larger part of the image (see Figure 1.1). Naturally, since an image has a limited resolution (every pixel corresponds to a certain physical size), the possibilities for both detection and accurate size estimations increases with a larger projective size 6.

(a) (b)

Figure 1.1: Sample images with smaller (a) and larger (b) parts of the image occu- pied by the ball.

Both of the sample frames in Figure 1.1 are still easy cases, since the scene except for the ball mainly consists of grass and trees. When other obstacles of similar shape and size exists in the scene, it is much harder to determine the ball position. The ball in Figure 1.2 is almost invisible when in front of the house in the background. Even when the background is only grass, there might be ball-like objects in the lawn, like in Figure 1.3 where the imprints in the lawn are really hard to distinguish from the ball.

4An IMU (Inertial Measurement Unit) measures orientation, velocity and acceleration in three dimensions using a set of accelerometers and gyroscopes. 5A drive is a (usually long) shot starting from tee 6Projective size - the size of the object in an image, measured in pixels

2 1.1. PROBLEM DESCRIPTION

Figure 1.2: Background objects similar to the ball can make the detection much harder. The image above highlights the ball position; the ball is almost invisible due to a white detail on the house in the background.

Figure 1.3: It is much harder to determine the ball position when the scene contains multiple objects similar to the ball. For this image, it is hard to distinguish the ball (highlighted) from the imprints in the lawn.

Nevertheless, even with a large and sharp image it is not always trivial to determine which pixels are ball pixels and which that are background. One of the easiest ways to segment out a white object from a darker background in a gray-scale image is to do gray-level thresholding[39]. This means that the histogram 7 of the image is analyzed in order to find a proper threshold. Based on the assumption that the ball is the brightest object in the image, all pixels with intensity higher than a certain threshold value can be considered ball pixels. The crucial step here is of course to find a proper threshold value. In many cases, it is not easy to even manually select this threshold, and therefore even harder to do it automatically. In Figure 1.4, two different thresholds are used, but neither gives a perfect segmentation. There are methods to optimize the choice of threshold, it can for instance be aided by the fact that the object is compact (it contains no holes

7Histogram - a diagram showing the distribution of pixel intensities in an image. The top right plot of Figure 1.4 is an example of a histogram.

3 CHAPTER 1. INTRODUCTION or similar), like the work by Das et al. [10], but this still doesn’t guarantee proper segmentation.

Figure 1.4: Histogram thresholding. Upper row shows the input image (left) and a histogram of image intensities with 100 bins (right). The second rows shows the result of a threshold of 0.6 (left) and 0.9 (right).

1.2 Previous works

The most obviously related works to the golf ball problem are other ball approaches, mainly in soccer. Additionally, work in related ﬁelds such as traﬃc sign detection and even measurement of fruit will be described below.

1.2.1 Soccer Kim et al. [28] made a monocular (one camera) physics-based solution where the height of soccer players is a reference to estimate the ball height. The height data was ﬁtted into a trajectory model which approximated the 3D position of the ball.

4 1.2. PREVIOUS WORKS

D’Orazio et al. published several papers on soccer ball tracking [17, 30, 35, etc.]. In [30], images are preprocessed with both wavelet and independent component analysis. The output vectors are fed into a neural classifier to detect soccer balls. Another approach [17] uses a circular Hough transform to find candidate regions where the ball might be. Then, the candidates are fed into a neural classifier trained with a set of images with varying occlusions and shadowing. Both these methods works well in practice, but require a large set of samples. Their described tests report good performance in certain outdoor environments, but the robustness cannot be guaranteed since the test data set might not contain all possible situations. For instance, the performance is measured on video data from both sunny days and sunny evenings, but not for cloudy or rainy days. Additionally, Yu et al. [51] points out that D’Orazio et al. [17] uses their own camera rather than broadcast video. Mazzeo et al. [37] reviews the previous work from the same research group. Additionally, a new method with circular Hough transform combined with scale- space analysis representation is also described. The final histogram-based classifier performs well on the black-and-white ball that is used, but is sensitive to strong illumination variations. The ball candidates are selected from moving blobs based on comparison between the current frame and a background model, which also simplifies the problem and makes the solution unusable for a moving camera. Gong et al. [21] first detects the grass and white lines of the pitch. All the remaining regions containing a certain amount of white pixels are given a circularity measurement between 0 and 1, and many of the candidates are filtered away by thresholding the circularity values. Finally, the candidate is compared with the closest neighborhood in the successive frame and if a similar ball-shape is found there, the ball position is considered to be found. Besides this approach, most other approaches does not exploit color information.

1.2.2 Other sports Within baseball, a few attempts to do ball tracking have been made. Lin and Chang [31] uses an "off-the-shelf" handheld static camera to estimate the speed of a baseball. The speed is calculated from shutter speed and estimated blur caused by the ball. To detect the ball within the image, it is assumed that the ball is travelling in parallel with the horizontal image scan lines. The blur caused by noise and minor movement of the camera will then be easy to distinguish from the blur caused by the ball, as the ball blur will be horizontal ramp edges. A circle fitting algorithm based on a random sampling algorithm is then used to determine the size of the ball, and as a consequence, also speed and 3D position. The assumption of horizontal travel direction (which of course easily could have been changed to any specified direction) makes it unsuitable for this project. In volleyball, ball tracking is harder because of increased player occlusions due to high player density in the court. Volleyball tracking techniques therefore must rely on a physics model to a much larger extent than for soccer [5, 6]. Chen et al. [7] refines the trajectory estimation with sound highlight such as referee whistles

5 CHAPTER 1. INTRODUCTION and smashing sounds. Such sound highlights typically indicates non-continuities in the ball trajectories which therefore can be taken into account. Similar methods are used for a wide range of ball sports, such as cricket[29], basketball[4] and tennis[46].

1.2.3 Fruit and traffic signs Circular objects cannot be limited to balls only; detection and measurements of circular objects has been used for a wide range of applications. One of these is in the agriculture, another is in driver-aid systems. Rakun et al. [43] proposes a three-step algorithm to detect fruit in a plantation which uses color image segmentation followed by a texture analysis and a final 3D reconstruction. The reported results are impressive for cases where a fruit occupies a large part of the image, but no results are given for smaller objects. Qingbing et al. [41] uses a much more controlled environment by placing a red plate behind green grapes, which makes segmentation by histogram thresholding easy. Such a controlled environment is however not realistic to assume for an outdoor environment. To detect circular traffic signs, Huang et al. [24] refines a symmetry detector described by Loy and Zelinsky [34]. An accumulator space is built based on gradient magnitude and gradient direction of each pixels. The peaks in the accumulator space indicates centers of circle arcs. A Support Vector Machine model based on color histograms then classifies the sign candidates. Another traffic sign detection system [40] analyzes the contours of various sign types using curvature scale-space representation, with reference to Mokhtarian [38]. This method, which is an ex- tention of the ordinary scale-space theory introduced by Witkin [49] (see Section 2.1), can be used to recognize complex-shaped oabjects, but such a complexity is not needed for basic-shaped objects such as balls.

1.3 Contributions of this work

As seen above, many diﬀerent approaches have been used for detection and tracking of circular objects. Most of these approaches use some kind of additional information to reduce the complexity of the problem.

• In most sports, there are other obstacles in the scene that can be used for determining things like ball height[28], camera parameters[50] etc.

• Many solutions (for instance Marco et al. [35]) uses a static camera with a much higher framerate than what is used in TV broadcasting.

• The ball is larger and often more textured in sports such as soccer and volleyball than it is in golf.

• The pitch usually has a well-deﬁned size and is planar.

6 1.4. REPORT OUTLINE

• For indoor sports, the illumination is much more controlled and therefore produces fewer unexpected artifacts and optical phenomena.

All of these issues reduces the problem, but for golf all of them must be handled since all holes8 have diﬀerent curvature, illumination, surrounding appearance etc. As for later discussions, the semantics of the input video data is also very diverse, from a large sharp ball to a small blurred one. In this report, I try to address all of the issues mentioned above, to come up with a robust, well-performing algorithm tweaked for golf. However, I hope that this work can be extended to be used for more applications, both for sports and for other areas.

1.4 Report outline

The rest of the report is organized as follows. In Section 2, a theoretical ground for the rest of the report is given, including descriptions of scale-space theory, circular Hough transform and an overview of some basic thresholding techniques. After that, Section 3 describes in detail the proposed algorithm, combined with qualitative analysis of the test data. The estimated accuracy of the method is also described. Section 4 gives data for the performance of the method, both for frame-to-frame tracking, size estimation and single frame peak detection. Finally, the results are summarized in Section 4, where also suggestions for how this work could be extended are given.

8A hole is a part of a golf course; a course typically has 18 holes.

Chapter 2

Theory

This chapter covers the theory that will be considered and used later in the report. It is assumed that the reader is familiar with basic image processing concepts such as convolution, Gaussian kernels and edge detection. Gonzalez and Woods [22] provides an excellent introduction for these concepts. The purpose of the tracking system is to detect the ball and measure its projective size. Two methods suitable for this are therefore described: blob detection based on scale-space theory and circular Hough transform. However, the accuracy of these methods are limited, especially when searching for a large range of projective sizes. Therefore, Section 2.4 introduces the concept of Kalman filters, which later will be used for stabilizing the distance estimations between the ball and the camera. Additionally, a few methods for a more detailed measurement of the ball size given a ball position are described, and their feasibility are discussed. Finally, Section 2.3 defines the problem of interlacing, which arises when the horizontal pixel lines not are sampled simultaneously and the ball is moving significantly with respect to the image frame.

2.1 Scale-space based blob detection

The idea behind scale-space theory is that since the scale of a scene - meaning which level of detail the image should be interpreted at - cannot be known in advance, a sound image representation is to consider all possible scales simultaneously. More precise, this means that the image is blurred with larger and larger variances. In this continuum of scales, all objects will at certain scales ﬁrst become more distinct and then get "blurred away" and disappear. By detecting peaks in the scale-space representation, the most distinct objects in a scene can be identiﬁed, regardless of size. This idea will be formalized below. For an overview of the history of scale-space theory, please refer to Johansen [26]. 2 The scale-space representation L : R × R+ → R of a two-dimensional signal

9 CHAPTER 2. THEORY

2 I : R × R is deﬁned as [33, 49] Z L(x; y; t) = I(x − ξ, y − η)g(ξ; η; t) dξ dη = I(x, y) ∗ g(x, y, t) (ξ,η)∈R2 where g is a Gaussian smoothing kernel deﬁned as

1 2 2 g(x; y; t) = e−(x +y )/2t 2πt This means, loosely speaking, that the scale space representation is a set of the result of convolving the image with Gaussian kernels with varying variance t. The reason for using Gaussians is that it can be shown not to introduce any artifacts as the scale goes from fine to coarse [20]. Additionally, within the class of linear transformations, Gaussian kernels are the only one able to create a scale-space representation [32]. Further on, the set of scale-space derivatives (Lx,Ly,Lxx,Lyy,Lxy) can be used to defined a variety of operations such as edge detection, blob detection and corner detection[33]. A scale-space derivative is the derivative of an image at a specified scale t. That is, initial convolution with a Gaussian of scale t is followed by a convolution that approximates a derivative. As we will see later, only the second- order derivatives are needed for blob-detection, approximations thereof are given in Figure 2.1. A very nice property of convolution is that it is a linear operator, which means that the convolution between the Gaussian kernel and the derivative operator can be made before the result is convolved with the input image [22]. This can increase efficiency since the Gaussian derivative kernels can be computed in advance.

−1 h i   (a) −1 2 −1 (b)  2  −1

Figure 2.1: Second order derivative approximations in (a) x direction and (b) y direction.

In most cases 1 , the ball will be a region of almost white pixels - a bright blob. As Lindeberg [32] describes, such blobs can be detected in a scale-space representation. To do this, consider the second order derivative of an image in x and y direction respectively, and denote these derivatives Lxx and Lyy. Now, a peak in intensity would be analogous to ﬁnding the maximum values in a gray-scale image, but there might be many patches with very high brightness (in particular, golf scenes may include audience with white clothes etc.). Instead, the most distinct peaks in the scale-space representation is what is interesting here, and a sharp peak will result 2 in a local minimum in the second derivative. With the sum ∇ L = Lxx + Lyy, a

1A presentation of the most distinct ball tracking situations is provided in Section 3.1.2.

10 2.1. SCALE-SPACE BASED BLOB DETECTION two-dimensional minimum can be retrieved. This sum will from now on be referred to as the Laplacian response. To ﬁnd the variance t which maximizes the Laplacian response (which has a direct correlation to object size), the scale normalized Laplacian

2 ∇normL = t · (Lxx + Lyy) (2.1.1) can be used[33]. Without this normalization, all peaks will be for lower variances, since the intensity in the sharpest peaks for low variances have been "spread out" for higher ones. By looking for minima in the scale normalized Laplacian, detection of the brightest blobs in both space and scale[33]. Properties of the Laplacian of Gaussian approximation (described below) makes it an eﬃcient implementation of this blob detector.

2.1.1 Faster Gaussian convolution The computation of the Laplacian response in multiple scales has a high computational cost. It is therefore desirable to make it faster, enabling usage in real-time applications. As it turns out, the Laplacian response can be computed more efficient with the use of a kernel known as Difference of Gaussians (DoG). Comparing a Laplacian with the difference between two Gaussians of different variance, it is obvious that they are similar (as shown in Figure 2.2). Hence, it is possible to replace the Laplacian with an approximative Laplacian of Gaussians and still retrieve an equivalent result[9]. By once again exploiting the linearity of convolution, the Laplacian response can be computed based on the scale-space representation[2].

Laplacian of Gaussians Laplacian

−3 −4 x 10 x 10

0 1 0 −1 −1 −2 −2 −3 −3

−4 −4

−5 −5 60 50 50 40 50 40 40 30 40 30 20 30 20 20 20 10 10 10 0 0 0 0

Figure 2.2: 3d plot comparing a Diﬀerence of Gaussian Laplacian approximation (left) with an actual Laplacian operator (right)

An important property of a Gaussian kernel is that it is separable. This means that instead of convolving an image with a full Gaussian, the convolution can be separated as follows.

I ∗ G = I ∗ (Gx ∗ Gy) = (I ∗ Gx) ∗ Gy (2.1.2) In terms of computational cost, this reduces the cost for convolution with a Gaussian of size N × N from O(N 2) to O(2N).

11 CHAPTER 2. THEORY

The approximation of a Laplacian is related to the Diﬀerence of Gaussians as described below[14].

(k − 1)σ2∇2L ≈ L˜ = Lˆ(x, y, kσ) − Lˆ(x, y, σ) where ∇2L is the Laplacian response, L˜ the Diﬀerence of Gaussian approximation and Lˆ(x, y, k) = G(x, y, k) ∗ I(x, y) We can now rewrite eq. 2.1.1 so that the scale normalized Laplacian is approximated as below.

Lˆ(x, y, kσ) − Lˆ(x, y, σ) ∇2 L(x, y, σ) = σ2∇2L(x, y, σ) ≈ (2.1.3) norm k − 1 Finally, including the result of (2.1.2) the scale-space representation at a certain scale t can be computed as √ √ √ √ (I ∗ G (k t)) ∗ G (k t) − (I ∗ G ( t)) ∗ G ( t) ∇2 L(t) = x y x y norm k − 1 This means that the Diﬀerence of Gaussian approximation can be used to detect bright blobs, the constant (k − 1) can be neglected since the actual value of a peak is irrelevant. The value of k is set so that it approximates the Laplacian as well as possible. Crowley and Riﬀ [9] determined this factor to k = 1.7, which also is the value used in Figure 2.2.

2.2 Circular Hough transform

Since the outline of a ball is circular, at least in most cases, a sound approach would be to look for circular shapes in an image. The idea of circular Hough transform is to transform pixels with a certain property, such as high gradient magnitude, to a three-dimensional circle space defined by circle positions (x, y) and radii r. Peaks in this circle-space indicates possible circles or circle arcs. As mentioned in Section 1.2.1, this method has been commonly used with good results even for partly occluded balls[17, 35]. There are many parametrizations of the circular Hough transform, the one below was presented by Chen and Chung [8]. For a pixel (a, b) to lie on a circle of radius r centered in (x, y), the following holds: (x − a)2 + (y − b)2 = r2 (2.2.1) This means that a point (a, b) for each considered radius r might come from a circle arbitrarily centered at (x, y) fulfilling (2.2.1). Based on this, a so called accumulator space can be generated, this space is a 3-dimensional matrix storing the votes for each possible circle center (two dimensions) and the radius of the circles (third dimension). If only one radius is considered, the accumulator space is generated so that for each point of interest (a, b), each pair (x, y) fulfilling eq. 2.2.1 gets a vote.

12 2.3. INTERLACING

This gives a 2-dimensional space for each circle radius, and therefore 3-dimensional when a collection of radii is considered. Geometrically, this means that for a given pixel (a, b), all possible circles can be represented by a cone-shaped surface in the (a, br) parametrized accumulator space. After all pixels have been processed into this space, potential circles will be represented by peak values in the (a, b, r) space. If the circle size is unknown, the parameter space becomes large which gives a higher computational cost. There are several methods to make it faster. Davies [11] notes that one can use the gradient direction to find the direction of the circle center. That is, since the change in intensity is largest in the direction perpendicular to the circle, this perpendicular direction will also have the highest gradient. By exploiting this direction, a line according to all possible circle radii can be drawn. If the gradient direction is accurate, this line will pass through the true circle center, and intersections of such lines will then indicate circle center candidates. Johansson et al. [27] notes that by using the double gradient angle, the duality between the two possible circle centers for an edge pixel is removed and the search space hence becomes smaller. It is also possible to define a 2D filter kernel which the image is convolved with, which gives a significantly lower computational cost[1]. This method is the one that is used in this work, see Section 3.3. The method combines the ideas of gradient direction with a phase coding approach for detection of circles of multiple radii in a single filter operation, a combination which gives higher noise tolerance and size precision than when the methods are used separately.

2.3 Interlacing

In TV broadcasting, it is desirable to have a high frame rate without using too high bandwidth. For this reason, a common technique is to use so called interlacing. This means that only every second scan line (horizontal pixel line) is updated in each frame, or, in other words, that each pixel is updated only in every second frame. This makes it possible to double the frame rate without needing higher bandwidth[16]. When an object, like a golf ball, is moving fast in the image, interlacing becomes a problem. Since pixels in the two sets of scan lines (odd and even) in the same image are sampled at diﬀerent times, it will seem as the image is multi-exposed (see Figure 3.2b). Obviously, some kind of preprocessing is needed to remove this phenomenon, which otherwise may confuse the peak detection functionality. An overview of such de-interlacing techniques can be found in [13]. As will be seen later, the choice of de-interlacing technique is not crucial for the performance of the system, therefore only the most basic techniques are presented in this section. The most naive solution to de-interlace an image is to simply duplicate one of the sets, so that ( p(i) if n is even p0(i) = (2.3.1) p(i + 1) if n is odd

13 CHAPTER 2. THEORY if the set of even lines are used, and the analogous ( p(i − 1) if n is even p0(i) = (2.3.2) p(i) if n is odd for odd lines. Here, p(i) denotes the ith line of the input image p, and p0 denotes the de-interlaced image. Although this technique ignores half of the available data, it is fast and it roughly preserves the size of objects. A somewhat more advanced solution is to replace every second line with the mean value of the upper and lower row. This means that ( 0 p(i) if n is even p (i) = p(i+1)+p(i−1) (2.3.3) 2 if n is odd with the same notation as above. Just like for (2.3.1-2.3.2), (2.3.3) can also be done reversely, so that the odd scan lines are kept.

2.4 Kalman ﬁlters

The purpose of the size measurement method is to determine the distance between the ball and the camera. Since the size measurement is under the influence of noise (for instance, sharpening filters which makes the edge of the ball rather ambiguous), there is a need to filter the measurement to get a more robust distance estimation by exploiting measurements from multiple frames. A widely used approach for this is to use a Kalman filter. Kalman filters are not used in the implementation for this project, but is mentioned in the discussions. A very general overview is therefore given here, a more theoretical overview can be found in [48]. For a thorough mathemathical derivation of the Kalman filter, please refer to [47]. The Kalman filter keeps a belief of a state over time. At each time t, this belief is represented with the mean µt and covariance Σt, where µt is the most probable value of the state from the given measurements. The covariance Σt represent the uncertainty of the belief; for very noisy measurements it will be larger than for measurements with lower noise.[47] he mean and covariance is updated in each timestep, based on a set of parameters. Amongst these are covariance matrices describing the noise of the measurements and of the system model. If the input to the model are available (for instance, the control signal to the motors controlling the position of a robot), the input data can also be integrated to improve the belief.[47]

14 Chapter 3

Method

In this chapter, the proposed algorithm will be presented in detail. An overview of the algorithm is given in Algorithm 1. The chapter begins with a description of the input data, followed by the details for how the tracking window used for increasing both eﬃciency and performance of the method is selected. The actual tracking, based on scale-space blob detection and circular Hough transform, is then described in Section 3.3. Section 3.4 discusses the need for additional size measurements. Finally, the accuracy of the suggested distance estimation method is analyzed in Section 3.5.

Data: A set of consecutive frames from a movie stream Result: An approximation of the ball size in each frame initialization; while An initial ball position isn’t found do Look for circles in the first frame using circular Hough transform; if significant peak wasn’t found then discard the first frame end end set initial window size; for each frame do set minimum and maximum radius; compute ball movement in the last two frames; predict ball position from movement; set window size; duplicate odd or even scan lines; look for balls in the specified window; end Algorithm 1: Overview of the tracking algorithm

15 CHAPTER 3. METHOD

Figure 3.1: The figure shows an interlaced image of a golf ball with a relatively high vertical speed. The output images from three different de-interlacing techniques, indicated by the plot titles, show significant similarity.

3.1 Input data

For the experiments in Chapter 4, a handful of color video recordings from the US Masters tournament in 2013 was used. Images of size 1920x1080 pixels was captures from the original videos at a frame rate of 30 frames per second. Here, a few characteristics of the input data is given; a more detailed description of the image format is postponed to Section 4.1

3.1.1 Interlacing

The concept of interlacing was introduced in Section 2.3. There, it was also mentioned that more sophisticated methods for deinterlacing exists. The test data in this project is signiﬁcantly interlaced, and hence, deinterlacing is needed. As it turns out, however, the choice of deinterlacing technique is not that crucial. When applied to a heavily interlaced golf ball image, inspection of the output from some of the most commonly used methods (as shown in Figure 3.1) shows that the result is not that diﬀerent. Therefore, only the basic approaches described in Section 2.3 are used in this project.

16 3.1. INPUT DATA

3.1.2 Frame types and video semantics In the material, there are essentially four different types of frames, which poses different kinds of challenges on the tracking solution. Fortunately; since the main purpose of the method is to determine the distance of a golf shot, the most crucial part is to determine the ball size in the end of the flight. Since the ball speed is low or even zero at the very end, it is easy for the camera operator to produce an image with a large and centered ball, thus making the ball searching process much easier.

(a) Flying ball (b) Motion-blurred ball

Figure 3.2: Varying image conditions makes the ball identifying and tracking problem harder. The ﬁgure shows a few of these conditions.

The context where the method is to be used is to track the ball in video sequences starting where the ball is ﬂying in the air and ending with the ball lying still on the ground (usually grass). This is challenging, since there are several phases in the sequence requiring varying tweaks of the methods.

1. A ﬂying ball typically has the sky as background, which usually is brighter than the ball. Thus, it is insuﬃcient to only look for bright blobs or circles (see Figure 3.2a)..

2. In the transition between a high ﬂying ball and the ﬁrst bounce (after which the background typically is trees, grass and/or audience), the background is extremely motion blurred, which typically also the ball is. This motion blur causes artifacts which can confuse the tracking model (like in Figure 3.2b).

3. With the ﬁrst ground contacts, the ball bounces which makes it’s motion scheme discontinuous and hard to predict.

17 CHAPTER 3. METHOD

(a) Sharpened image (b) Gradient image

Figure 3.3: Example of how a sharpening ﬁlter changes the appearance of the ball. Note the correspondance between the original image and the gradient image.

4. As the ball slows down, motion blur gets gradually disappears for the ball since the camera movements follows it, but the background might be slightly blurred. Due to the slower ball speed, the ball is usually centered, at least horizontally, and relatively large (Figure 3.2c).

From the problems above, (1) is dealt with by the peak selection functionality (see 3.3). (2) and (3) can be solved with a proper position and search space prediction, which is described in Section 3.2. When (1-3) have been taken into account, the system should have no problem to handle (4).

3.1.3 Sharpening filter Sharpening filters are typically used in TV broadcasting to enhance the visual appearance. Unfortunately, this often causes problems for image analysis, which also is the case in this project. The sharpening filter which is applied to the entire set of test data introduces a dark area around the contours of the ball (like in Figure 3.3a and Figure 1.2). This phenomenon actually eases the gradient-based circle detection (Hough transform, see 3.3) since the gradient image gets more distinct in the transition from dark contour to white ball. However, it makes it harder to detect the actual ball size since the dark area might occupy some of the actual ball pixels.

3.2 Window selection

As will be described in Section 3.3, two methods are used for detecting peaks: the circular Hough transform and scale-space blob detection. A drawback with both methods is that neither by definition includes classification. That is, the methods can detect candidates where the probability of a ball is high, but can by no means identify if the candidate actually is a ball or not without relying completely on threshold values. In this work, this problem is overcome by requiring a solid identification in the initialization step and then only looking at a small window in the next frame. The assumption is that the ball should be in a specific region reduces the importance of good classification.

18 3.2. WINDOW SELECTION

3.2.1 Predict ball position A ﬁrst step to select a window is to predict the ball position in the next frame. In the optimal case where the two previous frames both were tracked successfully with ball positions xi~−2 and xi~−1, this is done by computing the diﬀerence in x and y direction and assuming constant ball speed.

~ ∆xi = i~x −1 − i~x −2 (3.2.1)

If i~x −2 is unknown, which means that the ball not was tracked in the second last frame, (3.2.1) cannot be computed. For simplicity, zero speed is assumed.

~ ∆xi = 0 (3.2.2)

The resulting predicted position x~ˆi is then computed as

~ x~ˆi = i~x −1 + ∆xi (3.2.3)

If the ball wasn’t tracked in the last frame, the two theories that the ball either was missed in the last frame due to noise or that the ball is lost and the tracking therefor needs to be re-captured are weighted together. The weighting function is ~c ~ given below, where xi is the image center, ~xp and ∆x is the position prediction from the last frame.

~ ~ ~c ~ ∆xi = kspeed ∗ ∆xi−1 + kcenter ∗ (xi − xˆi−1) (3.2.4)

Here, the ﬁrst factor attempts to predict the ball position, while the second factor pushes the position towards the center of the image. The constants kspeed and kcenter should typically be set within the interval [0, 1]. The position estimation is based on the position prediction for the last frame, as follows.

~ x~ˆi = xi~ˆ−1 + ∆xi (3.2.5)

3.2.2 Predict ball radius Next, the expected ball radius is computed. If the ball was tracked in the last frame with the radius ri−1, it is assumed that ri ∈ [ri − kroffset , ri + kroffset ], where ri is the radius for the frame to be analyzed. If ri−1 is unknown, the search interval for the radius is expanded to [krmin ∗ rmin, krmax ∗ rmax), where rmin and rmax are the radius limits used for the previous frame. This means that if the ball is not tracked during several frames, the search interval grows larger and larger. In order to not look for balls of unrealistic size, the parameters krlowest and krhighest speciﬁes the minimum and maximum allowed radii.

19 CHAPTER 3. METHOD

3.2.3 Window size Finally, the window size is set. This is done in a control theory inspired manner, where the optimal window size is computed in the ﬁrst step. After that, the window size is set based on the last window size and the diﬀerence between it and the optimal window size. If the ball was found in the last frame, the ‘’optimal” window size supt is computed. If ball data also from the previous frame was captured, the optimal window size is as follows.

sopt = ksbase + ksspeed |∆xi| + rmax (3.2.6) Otherwise,

sopt = ksbase + ksavgspeed + rmax (3.2.7) where ksbase is a constant margin around the ball and ksavgspeed is the typical ball movement. The control-inspired window size si is then calculated as below, based on the previous window size si−1.

si = si−1 + ksgain (sopt − si−1) (3.2.8) If the ball not was found in the last frame, the window size is increased regardless of ball size (since it is unknown), as below.

~si = si−1 + k~s (3.2.9)

x y T T where ~si = [si , si ] is the width and height of the window, and k~s = [ksx , ksy ] are constants. Note that the constants ksx and ksy don’t have to be equal. In fact, letting ksy > ksx is probably desirable to improve the tracking when the ball bounces, where the ball movement is discontinuous and therefor harder to predict.

3.3 Peak detection

The core of the tracking method is to detect the ball, or, since no method is completely error-free, to detect ball-like patches in an image. As seen in Section 1.2, a very common approach for ball detection is to use the circular Hough transform. Since the ball, when it not is in the sky, is brighter than its background, a blob detection technique based on scale-space has also been implemented and evaluated. The window prediction method, as described in the previous section, reduces the need for proper ball classification. Some kind of classification is however of course necessary, to be able to identify erroneous tracking. For both methods such a classification takes the form of a threshold value which a peak has to exceed.

3.3.1 Pre-processing To optimize the peak detection, the input image has to be optimized accordingly. Beyond deinterlacing (described in Section 3.1.1, two more preprocessing techniques were considered throughout this project.

20 3.3. PEAK DETECTION

As described in Section 3.1.2 and highlighted in Figure 3.2, the appearance of the ball is diﬀerent depending on the background. A sound approach would therefore be to look for a white blob/ball/circle when on the ground, and a dark one when the sky is the background. To do this automatically, a background subtraction operation is used, where the new image I0 is computed as follows.

I0(x, y) = |I(x, y) − mean(I)| (3.3.1)

This results in an image where the brightest pixels are the pixels diﬀering the most from the average value of the image, examples are given in 3.4.

(a) Input (b) Output

Figure 3.4: Background subtraction. The input image (a and c) is converted with (3.3.1), so that the pixels with the largest diﬀerence to the mean intensity gets brightest. As shown in the image, this approach produces a brighter ball compared to its background both when (a-b) the background is brighter than the ball and when (c-d) the ball is brighter than its background. Note: The ﬁrst input image (a) is just a part of an image; the full image is very similar to Figure 3.2a

Since the circular Hough transform is not dependent on solid shapes for its identification, but rather the existence of full or partial circles, another preprocessing technique is possible. By its definition, the transform is already using the gradient magnitude as input data. However, if the gradient magnitude is made twice, the circles gets even more significant.

21 CHAPTER 3. METHOD

Formally, let GI = grad(I) denote the gradient magnitude of the image I. GI is what a typical citcular Hough transform is based on. Now, we instead use GII = grad(GI ), the input image gets even clearer.

3.3.2 Circular Hough transform For the circular Hough transform, the MATLAB function imfindcircles[36] was used. This function is based on work by Atherton and Kerbyson [1], where the accumulator space is built by applying a filter with a kernel specified by a radius interval. The built-in phase coding method was used, which is described in detail in Atherton and Kerbyson [1].

The built-in sensitivity parameter is a threshold value kchts ∈ [0, 1]. A higher value gives more peaks. Due to limited documentation about its functionality, the choice of kchts was unfortunately quite ad hoc, but it is clearly stated in Section 4 which values that were used.

3.3.3 Scale-space blob detection In contrast to circular Hough transform, the scale-space detector only operates on a discrete set of radii. Given the interval [rmin, rmax], the following radii were used: h i rmin rmin + 0.3∆r, rmin + 0.45∆r, rmin + 0.55∆r, rmin + 0.7∆r, rmin + ∆r = rmax where ∆r = rmax − rmin. Note that if the ball size is not changed between two frames, then rmin + 0.5∆r is most likely radius, and should therefore be included. However, since the size measurement is subject to noise, the choice to not include rmin + 0.5∆r potentially increases the accuracy by not letting the radius approximation stay at the same, potentially erroneous, level. Further on, since ∆r is constant if all frames are tracked (according to Section 3.2.2), an estimated radius differing 0.05∆r from the true radius would in the next frame get adjusted to the true value. The choice of radii does therefore not introduce any systematic errors, instead it increases the resolution for radius detection. Using a higher number of radii would of course increase the accuracy, but also increase the computational cost accordingly. The observant reader may have noticed that the previous discussions concerning scale-space only have mentioned variance, not radius. There is of course a relation between radius and variance, however not as direct as for circular Hough transform. The theory states that if σ describes the size of the blob, then the peak scale σˆ is proportional to the blob size. That is, σˆ = Cσ (3.3.2) [32]. The constant C remains however unknown, and might vary depending on the intensity profiles in the blob neighborhood. This means that the scale-space by definition can’t give an accurate size estimation, only an approximation with much lower accuracy than for the circular Hough transform.

22 3.3. PEAK DETECTION

By numeric simulation (see Appendix B), the following approximative relationship between circle radius r and scale-space variance t was determined.

√ r2 r ≈ 2 t ⇔ t ≈ (3.3.3) 4 Note that this relationship only is approximative, as the constant C depends on the intensities in the actual blob and in its neighborhood. The test procedure is described in detail in Appendix B. From a mathematical point of view, the scale-space representation is a 3rd order tensor, which is equivalent to a 3-dimensional array. To only consider local maxima in the scale-space, the entire scale-space tensor is dilated with a 3-by-3-by-3 tensor consisting of ones except for the very center element which is a 0. This idea is described by Davies [12, p. 188]. The dilation ensures that considered points are local maxima both in the current scales and compared to the previous and next scale. Even with dilation, the number of peaks will usually be numerous. To always pick the strongest peak and consider that one as the true ball is problematic, because even an empty image only containing some noise will then be classified to contain a ball. To solve this, a certain significance of the peak, compared to the other peaks, is required. This significance is, like for imfindcircles, controlled by a sensitivity parameter kscales . Given a set of peaks p1...pN , a peak pi is let through only if

100 N p > X p i Nk j scales j=1 where the constant 100 is used solely to reduce the number of decimals in kscales . In words: a peak has to have a value signiﬁcantly higher than the mean peak value.

3.3.4 Tracking speciﬁcs It might be the case that no peak is detected within the speciﬁed window, even though the window contains the ball. If so, additional attempts for detecting the ball can be made. These are stated in the list below, including abbreviations (in italic) later used when the tracking performance is analyzed.

• Change preprocessing. If no peaks are found when using one set of scan lines for deinterlacing, an extra attempt can be made by using the other set of lines (odd if even lines were used etc.) (Interlacing). The preprocessing techniques described in Section 3.3.1 can also be used, that is, using the background subtraction method (Background) and/or computing the gradient magnitude twice (Double gradient).

• Higher sensitivity. That no peaks are found can be caused by too low

sensitivity parameters (kscaleS or kchtS ). By increasing them, the number of peaks might increase (Threshold).

23 CHAPTER 3. METHOD

For the case when multiple peaks are found, both circular Hough transform and scale-space gives a peak value which can be used to ﬁnd the strongest one. However, due to noise etc. it is not sure that the strongest peak is the ball. For this case, there are a few possible methods as well. • Centered peak. Assuming that the predicted ball position is accurate, a peak close to this position may be a better candidate than one close to the window border. Especially, if the strongest peak also is the centered one, that peak is probably the correct one (Center). • Double peak detection. If one peak-detection method can’t produce a single peak, the other one might (Scale-space, Hough transform). Beyond that, a heuristic is used where the strongest peak from “the other” method is computed. The distance between that peak and each of the multiple peaks is then computed. If the peak closest to the “extra peak” also is the most centered one, this peak is considered the ball position (Scale-space center, Hough center). If neither of these actions succeeds in ﬁnding a unique peak, the tracking for the current frame fails.

3.4 Size estimation

Both scale-space and circular Hough transform provides a size estimation of the detected circle(s), which depending on parameters can be more or less accurate. A natural step would be to do a more detailed size estimation from the candidate region, for instance with grey-level segmentation. However, as described in a previous chapter, it is nontrivial to find a good threshold value from a noisy histogram. Other techniques, such as active contour models, could also have been used, but the computational cost could possibly limit the possibilities for real-time tracking. Additionally, as described in Section 3.1.3, the dark contour introduced by the sharpening filter might influence which pixels that are considered to belong to the edge of the ball, thus introducing significant noise. Since a more detailed size estimation, which possibly could reduce the expected error to sub-pixel level, both increases computational cost and makes a negligible contribution to the size estimation compared to the noise, no such methods will be used. What typically would’ve been used in practice is a Kalman filter, which keeps track of the probability distribution of the estimated distance based on noisy size estimations. However, in order to not bias the experiments, not even such a solution will be considered here.

3.5 Analysis of expected accuracy

Due to the sampled nature of an image (an image is not a continuous signal, but consists of pixels), the accuracy is ﬁnite. Here, this is discussed. The argumentation

24 3.5. ANALYSIS OF EXPECTED ACCURACY only considers the X-Z plane, which means that the ball height or sensor height not is taken into consideration, but that does not pose a limitation since the ball is spherical. Let R denote the constant physical radius of a golf ball, d the distance between the camera and the ball, w the x-directional ball radius in the image and f the camera focal length measured in pixels. This gives the following expression for the distance.

R w Rf = ⇔ d = d f w

Now, let ew denote the error in the radius estimation. This yields Rf d(ew) = w + ew and hence the distance error ed

Rf Rf Rfew ed(ew) = |d(0) − d(ew)| = | − | = | | (3.5.1) w w + ew w(w + ew)

If ew =w ˆ − w, where wˆ is the estimated ball radius, the assumption that wˆ > 0 gives that wˆ = w +w ˆ − w = w + ew > 0. (3.5.1) can therefore be rewritten as Rf ed = |ew| w(w + ew)

As it turns out, the error percentage, denoted qe, is independent of distance and focal length, which is shown below.

Rf |d(0) − d(e )| e |ew| |e | q (w, e ) = w = d = w(w+ew) = w e w d(0) d(0) Rf w + e w w

A typical ball radius is w = 25 px, allowing ew = 1 px this gives 1 q (25, 1) = ≈ 4 % e 25 + 1 A plot between true ball radius and distance error with an error of 1 pixel is provided in Figure 3.5.

25 CHAPTER 3. METHOD

0 0 10 20 30 40 50

Figure 3.5: Plot between true ball radius (x axis, in pixels) and distance error (y axis, in percent) for an error between estimated and true ball radius of 1 pixel.

26 Chapter 4

Experiments

Since two methods for peak detection have been used, the performance of both of them is presented in this chapter. Beyond the choice of method, there are also numerous possibilities for tweaking of the method, including weighting factors for the position and size of the tracking window, diﬀerent preprocessing methods due to interlacing and also threshold values for peak selection. The window parameters were held constant during the experiments, to put more attention to the peak detection performance and choice of preprocessing. The results of the experiments are given in this chapter, whereafter they are summarized in Chapter 5. The small quantity of test cases makes it in many cases infeasible to present quantitative numbers on the performance. In these cases, data will be given for each individual test case.

4.1 Test data analysis

Two sets of data were used. As described in Section 3.1, the ﬁrst data set consists of a few sequences captured during the 2013 US Masters tournament. The video ﬁles are encoded with the lossy DNxHD codec, with an image size of 1920x1080 pixels and a frame rate of 29.97 frames/second. The individual frames were captured with FFmpeg [19]. Additionally, a data set from the 2014 Arnold Palmer Invitational competition in Bay Hill, Florida, was also used. This set consists of raw 1920x1080 bitmap images combined with the focal length of each frame, and the same frame rate as above. With a known sensor size corresponding to 5 µm/pixel, the distance between the camera and the ball can be estimated. Still, this data contains no ground truth about the actual distance between ball and camera. How this lack of information is treated is described in Section 4.5, where the accuracy of the distance estimations is discussed. The set of videos are characterized by:

• Various weather conditions

27 CHAPTER 4. EXPERIMENTS

• Various camera ﬁlters • Diﬀerent zoom behavior, varying from very zoomed-in and centered ball to small and occasionally occluded ball • Audience both present and absent There are of course additional, more subtle, variations between the videos than what was mentioned here as well. Furthermore, the test data doesn’t contain all combinations of the bullets above, more test cases would be needed to give any general results.

4.2 Initialization

For initialization of the tracking, as well as for recovery when the ball is lost, it is important that the ball can be correctly identified in as many full frames as possible (which is different form ball detection in the tracking mode, where the search space is significantly smaller). For the initialization, both of the peak detection methods can be used, data from both of them will therefore be presented. Neither scale-space based blob detection nor circular Hough transform has any built-in classification, instead they produce a set of peaks which then can be filtered with a threshold value. To make the statistical measurements independent of such a threshold value, the strongest peak was considered the answer from the method regardless of peak value. In that way, an answer was forced from the method, giving no possibilities for "no-ball" answers. This is possible since the ball is at least partially visible in all frames in the test sequences. The benchmarked sequences from the first data set was used, preprocessed with line repetition deinterlacing using Eq. 2.3.2. A ball radius between 5 and 30 pixels was assumed, and the distance between the strongest peak and the true ball position was measured, using the sensitivity values kchts = 0.98 and kscales = 1. The results presented in Table 4.1 shows the rate of frames where the largest peak was found within 10 pixels from the manually classified ball position. The results, presented in Table 4.1, shows that the double gradient magnitude preprocessing is necessary, since the rates for the circular Hough transform are significantly higher when double gradient magnitude was used. Concerning background subtraction, no clear improvement can be concluded for the circular Hough transform. Neither the results for background subtraction for scale-space are entirely consistent, but especially the rates for Set 1.1 and Set 1.5 are significantly higher when using background subtraction. Therefore, background subtraction should be used in combination with scale-space blob detection.

4.3 Choice of deinterlacing technique

Line repetition (see Eq. (2.3.2)) was used for the initialization experiments above. In Section 3.1.1 it was concluded that various interlacing techniques produces sim-

28 4.4. TRACKING

Method Set 1.1 Set 1.2 Set 1.3 Set 1.4 Set 1.5 Pure circular Hough transform 2.2 % 0.32 % 31.7 % 0.51 % 58.3 % Circular Hough transform with double magni- 59.1 % 36.7 % 40.3 % 10.3 % 80.3 % tude Circular Hough transform with background sub- 57.7 % 39.0 % 42.3 % 9.5 % 81.4 % traction and double magnitude Scale-space blob detector 38.7 % 40.0 % 34.0 % 8.7 % 67.6 % Scale-space w. background subtraction 54.9 % 38.3 % 38.3 % 5.6 % 82.5 % Table 4.1: Initialization results. The table shows the rate of images where the peak for each method was within a 10 px range from the manually classiﬁed ball position. The highest rate for each data set is highlighted in bold type, the second highest in italic. The test sets are described in detail in Appendix A.

Method Line rep. Mean fill Test set 2, double gradient magnitude 36.74 % 39.62 % Test set 4, double gradient magnitude 10.25 % 10.0 % Test set 4, clean 0.51 % 0.26 % Table 4.2: Comparison between using line repetition and mean fill for deinterlacing. ilar output. Hence, the choice of deinterlacing technique should not be relevant. To verify this, the initialization procedure was repeated for set 2 (on which most methods performed well) and set 4 (where the classification rates were quite low) with the mean fill method (see Eq. (2.3.3)). Arguably, the peak selection method that should be most sensitive to varying deinterlacing technique is the circular Hough transform, where a “softer” deinterlacing such as mean fill would give a more circular ball which then would be easier to detect. Therefore, only the circular Hough transform was used in the comparison, with and without double gradient magnitude. The results, in Table 4.2, shows no significant difference between the two deinterlacing techniques. Apparently, the performance of the techniques is almost equal, and the simpler line repetition method is therefor a reasonable choice.

4.4 Tracking

As a second test, the performance of the tracking is analyzed. Since circular Hough transform in the previous section was proved to outperform scale-space in global peak detection, the circular Hough transform is used as the primary peak detection method. Scale-space is however used in additional tracking methods described in Section 3.3.4, both for ﬁnding a unique peak where the Hough transform fails, and for the scale-space center heuristic.

29 CHAPTER 4. EXPERIMENTS

The purpose of the tracking is to in as many frames as possible track the ball position, to be able to determine the ball size. Additionally, it should be able to predict the ball position, even if the tracking fails for one or a few frames. With this in mind, it is obvious that the tracking heavily depends on the initialization, which makes it nontrivial to define a proper measurement model for the tracking. Therefore, both a qualitative and a quantitative evaluation is presented. The performance for each of the ten test sequences is discussed briefly, combined with statistics for the parts of the sequences that were successfully tracked. The tracking statistics gives data for how often certain methods (see Tracking specifics, Section 3.3.4) were used for locating the ball. From this data, conclusions could be drawn on their respective performance. The sequences with focal data are also used in Section 4.5. Through the description of the tracking algorithm, many parameters were introduced. The parameter values that were used to retrieve the results in this section are as follows:

• krmin = 0.9, krmax = 1.1 - speciﬁes how the radius search interval is expanded when the ball not was found in the previous frame

• kroffset = 2 - search interval for ball radius (the search interval is [ri−1 −

kroffset , ri−1 + kroffset if the radius ri−1 is available from the previous frame)

• krlowest = 2, krhighest = 60 - radius saturation values

• ksbase = 20 - base size of the window

• ksspeed = 1.5 - window size factor depending on estimated speed

• ksavgspeed = 8 - assumed speed for untracked ball

• ksgain = 0.5 - weight factor between optimal and previous window size

• ksy = 10, ksx = 5 - added to the previous window size if the ball not was tracked in the previous frame

Additionally, the sensitivity value for the circular Hough transform was set to kchts =

0.92 as default, and kchts = 0.95 as the higher sensitivity value. These values are close, but nevertheless makes a big diﬀerence in the number of peaks detected.

For the scale-space peak detection, kscales = 0.05 and kscales = 0.1 for single-peak and centered-peak detection respectively. For a more detailed description of the parameters, please refer to Section 3.2.

Test set 1.1 The first 170 frames, whereof 120 have the sky in the background, were classified correctly. However, a few frames after the first bounce, the ball is both small and motion-blurred (see Figure 4.1), and the ball was therefore lost. However, after a manual restart of the tracking (including re-initialization) 20 frames earlier, also the problematic section was passed without losing track of the ball. The tracking statistics for the second half of the sequence is presented in Table 4.3.

30 4.4. TRACKING

Figure 4.1: The ball was not identified in any of the three first images during tracking of Set 1.1, and the ball was therefore lost starting from the fourth frame of this figure.

Test set 1.2 Just like for Test set 1.1, the ﬁrst frames are sky images, with excellent tracking. As the ball lands, it is heavily motion blurred. Additionally, the ball is partly in the shadow, in front of a sun lit audience. This combination confuses the tracker, which as a result tracks humans instead of the ball. Re-initialization does not work either, due to the weather conditions. See Figure 4.2 for a sample image.

Test set 1.3 In this test set, the ball was tracked initially, but due to motion blur and partial occlusion, the ball was lost after 60 frames. However, since the tracking window is forced to the center of the image, the ball was re-captured and tracked during the last 210 frames.

Test set 1.4 The tracking worked ﬁne for 40 frames, whereafter the ball was lost due to a very small ball size (Figure 4.3). During the rest of the sequence, the ball size remains < 5 pixels; other obstacles in the scene (players, a house etc.) prevented re-initialization from succeeding.

31 CHAPTER 4. EXPERIMENTS

Figure 4.2: The ball is barely visible when it is both motion blurred and shaded, while the audience is sun lit. Frame from Test set 1.2.

Figure 4.3: The ball is very small and non-distinct compared to the obstacles in the background. Frame from Test set 1.4

32 4.4. TRACKING

Figure 4.4: Frame from Test set 2.1, where the ball is smaller than many of the markings in the grass.

Test set 1.5 Initialized in the sky, this tracking worked well until the ﬁrst bounce, where motion blur and discontinuous movement made that the ball was lost after 115 frames. However, 50 frames later the ball was re-captured due to the window movement towards the image center, whereafter the ball was tracked until the end of the sequence.

Test set 2.1 The ball was tracked during the ﬁrst 115 frames. Then, bright marks in the grass (see Figure 4.4) confused the tracking. Re-initialization failed until the last 50 frames, which were correctly tracked.

Test set 2.2 The entire sequence was tracked, except for a few single frames.

Test set 2.3 Except for 25 frames in the middle of the sequence, the entire sequence was tracked. The data in Table 4.3 includes these frames, counted as not classiﬁed.

Test set 2.4 The ﬁrst 25 frames were tracked. After that, re-initializations en- abled tracking of 10-25 frames in a few sequences, with similar reasons as for Test set 2.1. After re-initialization at frame 160, the rest of the frames could be tracked.

Test set 2.5 The entire sequence was tracked, except for three single frames.

The results for the entire evaluation procedure of the tracking algorithm is presented

33 CHAPTER 4. EXPERIMENTS

Test set Classified Instant Interlace Thresh. Backgr. Scale-sp. Center S-s cent. 1.1 (pt 2) 93.2 % 84.8 % 0.4 % 0.4 % 0.0 % 0.0 % 6.8 % 0.8 % 1.2 39.6 % 39.0 % 0.00 % 0 % 0 % 0 % 0.00 % 0 % 1.3 (full) 64.9 % 28.9 % 7.7 % 15.4 % 5.1 % 0.3 % 5.4 % 2.0 % 1.3 (first 30) 100.0 % 46.7 % 0.0 % 46.7 % 6.7 % 0.0 % 0.0 % 0.0 % 1.5 (full) 84.2 % 47.5 % 8.8 % 22.9 % 0.8 % 0.0 % 3.4 % 0.8 % 1.5 (first 115) 88.7 % 73.0 % 0.0 % 10.4 % 0.9 % 0.0 % 3.5 % 0.9 % 2.1 (full) 53.3 % 34.7 % 6.0 % 8.7 % 0.0 % 0.0 % 2.7 % 1.3 % 2.1 (first 110) 99.1 % 70.9 % 9.1 % 10.9 % 0.0 % 0.0 % 4.5 % 3.6 % 2.1 (last 50) 92.0 % 46.0 % 16.0 % 26.0 % 0.0 % 0.0 % 4.0 % 0.0 % 2.2 97.7 % 60.3 % 8.0 % 16.0 % 3.0 % 1.0 % 8.7 % 0.7 % 2.3 80.7 % 65.4 % 0.0 % 9.2 % 1.7 % 0.0 % 3.7 % 0.7 % 2.4 (full) 53.1 % 33.4 % 5.6 % 6.2 % 2.1 % 0.0 % 5.9 % 0.0 % 2.4 (last 182) 99.5 % 62.6 % 10.4 % 11.5 % 3.8 % 0.0 % 11.0 % 0.0 % 2.5 99.0 % 85.7 % 1.6 % 7.3 % 0.0 % 0.0 % 4.1 % 0.3 % Table 4.3: Tracking statistics. The table shows statistics for each test set, or for subsets of these sets in cases where the tracking was incosistent through the sequence. The rates describes in how many frames the ball was correctly tracked (Classified), and which method that was used to achive this tracking. Methods (detailed descriptions are given in Section 3.3.4): circular Hough transform w. double gradient (Instant), deinterlacing based on the other set of scan lines (Interlace), higher sensitivity (Threshold), background subtraction (Backgr.), single scale-space peak (Scale-sp.), centered peak (Center) and centered scale-space (S-s cent.). Sin- gle scale-space peak: if scale-space blob detection gives a single peak, that peak is considered to be the ball

in Table 4.3. The performance on various test sequences is very varying, but a few tendencies should be noted. Initially, for almost all sequences, over 80% of the frames were classified correctly. Note however that this rate is lower for sequences where the ball was lost, for instance for Test set 1.2. Note also that Test set 1.4 was not included at all, since the tracking failed as soon as the background changed from sky to trees and grass. Furthermore, the sensitivity value for the circular Hough transform is of high importance, since about 10% of the frames are detected with a higher sensitivity. This does however not mean that the threshold should be set high initially, since multiple peaks in this method are treated with heuristics based on peak position relative to the predicted ball position. Since the predicted ball position by its definition only is an approximation, the accuracy of the heuristic is also limited. Among the preprocessing techniques, both different deinterlacing and the background subtraction method were useful for many frames. In particular, this is the case for the second test batch (Test set 2.1-2.5), which also is subject to interlacing

34 4.5. DISTANCE ESTIMATION to a larger extent than what the first batch is. The low rates for single scale-space peak classification (“Scale-sp.”) might have several reasons. Initially, where the ball is large and sharp, it is usually captured earlier by one of the Hough transform methods. Another explanation is that the circular Hough transform in fact performs better than scale-space blob detection (see Section 4.2), the chances that the scale-space catches something that the Hough transform missed are therefore small. Finally, the two methods for treating multiple peaks are used in some cases. In particular, to use the strongest peak if it also is centered turns out to work well. The scale-space centering method was barely used, which most likely is due to that the first center method catches many cases that also scale-space centering should have caught.

4.5 Distance estimation

The core of the project is to make accurate distance estimation. As mentioned above, there is no test data with ground truth available, which makes it problematic to evaluate the accuracy. Consequently, there is no way to measure the actual error. Nevertheless, since images by their nature contains noise (for instance, an image is sampled which means that the accuracy is limited), there will be an error in the distance estimations. Even without a ground truth, the characteristics of the error can be studied. For the error analysis, two kinds of measurements are used. The first is to use sequences where the ball is lying completely still, but with the focal length gradually changing. In such sequences, the estimated distance should stay unchanged regardless of focal length. As a second measurement, quadratic regression is used to create a synthetic ground truth for sequences where the ball is moving. That is, a second-order curve is fitted to the distance estimations for five previous and five future frames. After that, the residual for the current frame is computed. which gives an indication of the accuracy of the distance estimations. The residual for timestep i can be written as

ˆ edi = |di − d(di−5..di−1, di+1..di+5)| (4.5.1)

where di is the distance estimation for frame i and dˆ is the quadratic regression model. Obviously, there will be no way to detect any systematic bias. However, if such a bias exists on the radius measurement (so that the estimated radius always is too small or too large), this will be indicated from the data from the static ball sequences. Other kinds of bias, such as erroneous focal length, can of course still aﬀect the results, but such bias is not relevant for the evaluation of the proposed algorithm.

35 CHAPTER 4. EXPERIMENTS

For both of the measurements, only data that actually was correctly classiﬁed is considered. In practice, the peak from each frame was manually classiﬁed ‘correct’ or ‘incorrect’, only the correct frames are included in the provided data.

Static ball performance The results for the distance estimations for a static ball is presented in Figure 4.5. Since only three test sequences were available (Test set 2.6-2.8), the static ball performanace should only be considered a sanity check. Apparently, the estimated distance stays on a more or less constant level throughout the sequences. By definition, the accuracy of the method is limited. According to Section 3.5, for a ball radius of 5 pixels, a 1 pixel error gives a distance error of 17%, corresponding to 17 meters for a 100 meter distance. For Figure 4.5a, the standard deviation lies within this approximated error, and the distance estimation therefore has to be considered accurate for the specific input. Also Figure 4.5b and 4.5c verifies that a larger ball radius reduces the error, and furthermore, the standard deviations does also here lie within their approximated error for an error of 1 pixel in projective size measurement. Except for the increasing noise for smaller ball radii, there is a tendency of overestimation, at least for Figure 4.5b and 4.5c, as the camera is zooming out. As- suming that the focal length of the camera is accurate, the overestimated distances is caused by underestimated radii. This error is probably caused by the sharpening filter (see Section 3.1.3), and is hence out of control. In practice, this static error has to be examined and compensated for.

Estimated distance (m) Estimated distance (m) Estimated distance (m) 150 120 200 100 150 100 80 100 50 60 50 0 50 100 150 0 20 40 60 80 100 120 140 160 0 20 40 60 80 100 120 140 160 180 Estimated ball radius (px) Estimated ball radius (px) Estimated ball radius (px) 8 15 10 6 10 5 4 5 2 0 0 0 50 100 150 0 20 40 60 80 100 120 140 160 0 20 40 60 80 100 120 140 160 180 Focal length (mm) Focal length (mm) Focal length (mm) 250 200 150 200 150 150 100 100 100 50 0 50 100 150 0 20 40 60 80 100 120 140 160 0 20 40 60 80 100 120 140 160 180

(a) Mean: 105.97 m, std:(b) Mean: 83.41 m, std:(c) Mean: 110.18 m, std: 11.91 m 4.34 m 16.70 m

Figure 4.5: The ﬁgure shows estimated distance, estimated radius and the focal length of each frame in test sequences where the ball is lying still on grass. Note the correlation between distance estimation smoothness and ball radius: the larger radius, the smoother distance estimations. Additionally, as the camera zooms out a tendency for overestimations is indicated in (b.) and (c).

Moving ball performance As mentioned above, the lack of ground truth is a problem, since only the error behavior, not its actual value, can be analyzed. The results in Figure 4.6 once again conﬁrms the correlation between ball radius and

36 4.5. DISTANCE ESTIMATION consistency in the distance estimations. This indicates that for sequences where the projective size of the ball is larger than in the sequences presented here, the accuracy can be expected to be high. A more detailed examination of the plots in Figure 4.6 reveals that the mean estimated distance is about 100 meters for most of the sequences. An explanation of that is that the sequences are captured from the same hole (which can be veriﬁed by looking at the sequences), and as the camera translation is zero, the distance between ball and camera should be similar. Still, since the estimated radii also are quite uniformly distributed among the sequences (around 5 pixels), it can not be concluded whether the real distances actually are that uniform, or if this is caused by a systematic error.

Estimated distance (m) Estimated distance (m) Estimated distance (m) 300 200 300

150 200 200 100 100 100 50

0 0 0 0 50 100 150 200 250 300 0 50 100 150 200 250 300 0 20 40 60 80 100 120

Estimated ball radius (px) Estimated ball radius (px) Estimated ball radius (px) 10 15 20

8 15 10 6 10 5 4 5

2 0 0 0 50 100 150 200 250 300 0 50 100 150 200 250 300 0 20 40 60 80 100 120

(a) 2.1 (1-110). Mean error:(b) 2.2. Mean error: 8.7 m.(c) 2.3 (1-110). Mean error: 24.5 m. Std. dev.: 26.3 m Std. dev.: 8.9 m 12.7 m. Std. dev.: 22.6 m

Estimated distance (m) Estimated distance (m) 100 400

80 300 Estimated distance (m) 250 60 200 200 40 100 150

20 0 100 0 20 40 60 80 100 120 140 0 20 40 60 80 100 120 140 160 180 50 Estimated ball radius (px) Estimated ball radius (px) 7 12 0 0 50 100 150 200 250 300 350 6 10 Estimated ball radius (px) 5 8 10

4 6 8 3 4 6 2 2 0 20 40 60 80 100 120 140 0 20 40 60 80 100 120 140 160 180 4

2 0 50 100 150 200 250 300 350 (d) 2.3 (151-275). Mean er-(e) 2.4 (161-342). Mean error: 10.3 m. Std. dev.: 8.5ror: 14.0 m. td. dev.: 17.8(f) 2.5. Mean error: 17.1 m. m m Std. dev.: 20.1 m

Figure 4.6: Distance estimations with estimated radius for sequences where the ball is ﬂying/rolling. The mean and standard deviation of the error is based on a second- order regression model (see Eq. (4.5.1)). For some sequences, only a subset of the frames are used, since the rest of the sequence either was not tracked or contained a more or less static ball. For the latter case, the performance for the static ball sequences gives a clearer picture.

37 CHAPTER 4. EXPERIMENTS

4.6 Computational eﬃciency

When it comes to performance in terms of computational eﬃciency, there are multiple factors that has to be taken into consideration. To begin with, a MATLAB implementation, like this project, is typically way much slower than what an equal C++ implementation would be. Further on, if functions with high computational cost were implemented on the GPU, which certainly would be the case for an in- dustry implementation, the computational time would decrease signiﬁcantly. For circular Hough transform, Ruiz et al. [44] reports impressive improvements, where the GPU implementation is between 25 and 358 times faster than a CPU implementation, depending on input image semantics. Heymann et al. [23] compares a SIFT implementation1 on CPU with a corresponding GPU implementation and reports up to 10 times faster computational times. All computations were made on a laptop with an Intel Core i5-3317U processor operating at 1.7 GHz, and 6.00 GB RAM. The MATLAB scripts were run with MATLAB R2013a, version 8.1.0.604.

Initialization For initialization, statistics are presented in Table 4.4.

a) b) c) d) e) Mean 3.04 4.13 4.09 3.90 3.92 Standard deviation 2.72 2.88 2.57 0.36 0.40 Max 10.6 12.7 15.4 6.65 8.89 Table 4.4: Statistics over computational time (in seconds) for initialization with 1920x1080 input images. Methods: a) circular Hough transform; b) circular Hough transform with double gradient magnitude; c) circular Hough transform with double gradient magnitude and background subtraction; d) scale-space blob detection; e) scale-space blob detection with background subtraction.

Statistics for computational time for initialization (Table 4.4) indicates that it can be run in real time, at least in 10 frames per second, assuming that the results by Ruiz et al. [44] and Heymann et al. [23] are applicable also in this context.

Tracking Since the tracking heavily depends on the window size, the computational time varies with each frame. Nevertheless, it is unusual that the window size exceeds 100x100 pixels, which only is about 0.5 % of the whole image area. As the computational time is dependent on image size (typically O(n2)), a real-time tracking solution similar to the one proposed in this report is deﬁnitely possible to implement with typical broadcasting frame rates.

1SIFT - Scale Invariant Feature Transform, an object recognition algorithm based on scale-space theory

38 Chapter 5

Conclusions

In this work, it has been investigated how scale-space theory and the circular Hough transform can be used to detect golf balls in a window which adapts based on the ball’s movement and size in 2D. By including the focal length for each frame, estimations of the distance between the ball and the camera were made and analyzed. Since no ground truth for the actual distances was available, the accuracy of the model could only be estimated approximately. The tests verifies the theoretical statement that a larger ball gives smaller errors from frame to frame and is therefore the single most important factor for high accuracy. For initialization, i.e. a global search for golf balls throughout the image, both scale-space blob detection and circular Hough transform has reasonable, and for most of the cases comparable, performance. In general, circular Hough transform performs slightly better, but does also require longer processing time. The proposed tracking solution is not yet mature to handle any input sequence. For instance, motion blur is a common reason that the tracking fails. Also, the heuristics used to solve the problems with multiple peaks needs to be studied in more detail. The number of parameters also needs to be reduced in order to ease optimization and tweaking. Nevertheless, the tracking algorithm is a proof-of-concept which can be used to achieve real-time tracking and size estimation. One obvious drawback of both scale-space and circular Hough transform is that neither can be used for classification of its candidates. The means that have been used to treat this problem here are different thresholding methods, to only let a peak through if it is significant enough. The ambiguity in the last sentence points out the lack of classification; a method based on these techniques will always depend on proper threshold parameters. The usefulness of the algorithm of course varies with application. As mentioned in the introduction, inertial data from the camera can be integrated with the distance estimations from this project to create a complete 3D positioning system. Such a system would for instance most likely not be accurate enough for rendering drive1 graphics due to the high noise. However, assuming that the rotation, scaling

1A drive is a (usually long) shot starting from tee

39 CHAPTER 5. CONCLUSIONS and translation data for the camera is accurate, the noise is almost entirely in the direction parallel to the camera axis. This means that for determining data where the camera axis direction is of little or no importance, the proposed system could by all means perform well. For instance, measuring the distance of a drive with given tee position would probably be quite accurate if the camera is placed perpendicular to the ball ﬂight direction.

5.1 Future work

To build a complete system for robust tracking and size estimation of golf balls during flight, camera movement must be taken into consideration. Given inertial data from the camera, the complete 3D positioning system makes is possible to use a physics model for the ball flight. Combined with a Kalman filter, such a system can produce accurate estimations of the ball position in each frame. This will make the tracking much more robust, since the tracking window can be set based on Kalman filter prediction, which is a more rigorous solution than the one described in this work. Furthermore, the tracking window solution can be extended with a probability density function where each peak gets weighted based on the distance between estimated ball position and the position indicated by the peak. A heuristic for this was used in this project (the “Centered peak” method, see Section 3.3.4), but physical modeling of the ball’s 3D movement would be needed for such a solution to reach higher accuracy. If data for the camera movement is not available, techniques such as optical flow or tracking of SIFT features can be exploited to estimate the camera movement. With the assumption that everything but the ball is static in the scene, the SIFT features can be used to create a homography between the two images and then make an accurate prediction of the ball position. Optical flow can estimate the 2D movement of the entire frame, which can improve the 2D ball position predictions.

40 Bibliography

[1] T.J. Atherton and D.J. Kerbyson. Size invariant circle detection. Image and Vision computing, 17(11):795–803, 1999.

[2] P.J. Burt and E.H. Adelson. The Laplacian pyramid as a compact image code. IEEE Transactions on Communications, 31(4):532–540, April 1983.

[3] S. Carlsson. Geometric Computing in Image Analysis and Visualization. March 2007. URL http://www.nada.kth.se/~stefanc/gc_lec_notes.pdf. [4] B. Chakraborty and S. Meher. Real-time position estimation and tracking of a basketball. In 2012 IEEE International Conference on Signal Processing, Computing and Control, pages 1–6, March 2012.

[5] B. Chakraborty and S. Meher. A trajectory-based ball detection and tracking system with applications to shot-type identiﬁcation in volleyball videos. In In- ternational Conference on Signal Processing and Communications (SPCOM), pages 1–5, July 2012.

[6] H-T. Chen, H.-S. Chen, and S.-Y. Lee. Physics-based ball tracking in volleyball videos with its applications to set type recognition and action detection. In IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2007, volume 1, pages 1097–1100, April 2007.

[7] H.T. Chen, W.-J. Tsai, S.-Y. Lee, and J.-Y. Yu. Ball tracking and 3D trajectory approximation with applications to tactics analysis from single-camera volleyball sequences.

[8] TC. Chen and KL. Chung. An eﬃcient randomized algorithm for detecting circles. Computer Vision And Image Understanding, 83(2):172–191, 2001.

[9] J. Crowley and O. Riﬀ. Fast computation of scale normalised Gaussian recep- tive ﬁelds. In Scale space, pages 584–598, 2003.

[10] P. Das, O. Veksler, V. Zavadsky, and Y. Boykov. Semiautomatic segmentation with compact shape prior. Image and Vision Computing, 27(1):206–219, 2009.

[11] E.R. Davies. A modiﬁed Hough scheme for general circle location. Pattern Recognition Letters, 7(1):37–43, January 1988.

41 BIBLIOGRAPHY

[12] E.R. Davies. Computer and machine vision : theory, algorithms, practicalities. Waltham, Mass: Elsevier, fourth edition, 2012.

[13] Gerard De Haan and Erwin B Bellers. Deinterlacing-an overview. Proceedings of the IEEE, 86(9):1839–1857, 1998.

[14] K. G. Derpanis. Outline of the relationship between the diﬀerence-of-Gaussian and Laplacian-of-Gaussian, Sept. 2006.

[15] A.E. Dilz. Miniature sports radar speed measuring device, January 26 1999. URL http://www.google.com/patents/US5864061. US Patent 5,864,061.

[16] J. Dong and K.N. Ngan. Real-time de-interlacing for H.264-coded HD videos. IEEE Transactions on Circuits and Systems for Video Technology, 20(8):1144– 1149, Aug 2010.

[17] T. D’Orazio, C. Guaragnella, M. Leo, and A. Distante. A new algorithm for ball recognition using circle Hough transform and neural classiﬁer. Pattern Recognition, 37(3):393–408, March 2004.

[18] R. Feys and K.D. Brown. Golf ball aiming device, September 13 2011. URL http://www.google.com/patents/US8016689. US Patent 8,016,689.

[19] FFmpeg. FFmpeg, 2014. URL http://www.ffmpeg.org/.

[20] J.M. Geusebroek, R. van den Boomgaard, A.W.M Smeulders, and A. Dev. Color and scale: The spatial structure of color images. In 6th European Con- ference on Computer Vision, number 1, 2000. Dublin, Ireland.

[21] Y. Gong, L. T. Sin, C. H. Chuan, H. Zhang, and M. Sakauchi. Automatic parsing of TV soccer programs. In Proceedings of the International Conference on Multimedia Computing and Systems, pages 167–174, May 1995.

[22] R. C. Gonzalez and R. E. Woods. Digital Image Processing. Prentice Hall, 3 edition, 2008.

[23] S Heymann, K Muller, A Smolic, B Frohlich, and T Wiegand. SIFT implementation and optimization for general-purpose GPU. In Proceedings of the international conference in Central Europe on computer graphics, visualization and computer vision, volume 144, 2007.

[24] Y.S. Huang, M.Y. Fu, and H.B. Ma. A combined method for traﬃc sign detection and classiﬁcation. In Chinese Conference on Pattern Recognition, 2010, pages 1–5. IEEE, 2010.

[25] J. Hugmark. Inertial-aided EKF-based structure from motion for robust real- time augmented reality. Master’s thesis, KTH Computer Science and Commu- nication, June 2013.

42 [26] P. Johansen. On the classification of toppoints in scale space. Journal of Mathematical Imaging and Vision, 4(1):57–67, 1994. ISSN 0924-9907. [27] B. Johansson, H. Knutsson, and G. Granlund. Detecting rotational symmetries using normalized convolution. In 15th International Conference on Pattern Recognition, volume 3, pages 496–500, 2000. [28] T. Kim, Y. Seo, and K.-S. Hong. Physics-based 3D position analysis of a soccer ball from monocular image sequences. In Sixth International Conference on Computer Vision, 1998, pages 721–726, Jan. 1998. [29] M. Lazarescu, S. Venkatesh, and G. West. Classifying and learning cricket shots using camera motion. In N. Foo, editor, Advanced Topics in Artificial Intelligence, volume 1747 of Lecture Notes in Computer Science, pages 13–23. Springer Berlin Heidelberg, 1999. ISBN 2978-3-540-66822-0. [30] M. Leo, T. D’Orazio, and A. Distante. Feature extraction for automatic ball recognition: comparison between wavelet and ICA preprocessing. ISPA 2003. Proceedings of the 3rd International Symposium on Image and Signal Process- ing and Analysis, 2:587–592, Sept. 2003. [31] H.-Y. Lin and C.-H. Chang. Automatic speed measurements of spherical objects using an off-the-shelf digital camera. In Proceedings of the 2005 IEEE International Conference on Mechatronics, pages 66–71, 2005. [32] T. Lindeberg. Feature detection with automatic scale selection. International Journal of Computer Vision, 30(2):79–116, 1998. [33] T. Lindeberg. Scale-space. In B. Wah, editor, Encyclopedia of Computer Sci- ence and Engineering, volume IV, pages 2495–2504. John Wiley and Sons, Hoboken, New Jersey, 2009. [34] G. Loy and A. Zelinsky. Fast radial symmetry for detecting points of interest. Transactions On Pattern Analysis And Machine Intelligence, 25(8), 2003. [35] T. De Marco, M. Leo, and C. Distante. Soccer ball detection with isotophes curvature analysis. Technical report, Istituto Nazionale di Ottica, Arnesano, Italy, 2013. [36] MathWorks. Find circles using circular Hough transform - MATLAB imfindcircles, 2014. URL http://www.mathworks.se/help/images/ref/ imfindcircles.html. [37] P. L. Mazzeo, M. Leo, P. Spagnolo, and M. Nitti. Soccer ball detection by comparing different feature extraction methodologies. Advances in Artificial Intelligence, 2012, 2012. [38] F. Mokhtarian. Silhouette-based occluded object recognition through curvature scale space. Machine Vision And Applications, 10(3):87–97, 1997.

43 BIBLIOGRAPHY

[39] N.R. Pal and S.K. Pal. A review on image segmentation techniques. Pattern Recognition, 26(9):1277 – 1294, 1993. [40] C.F. Paulo and P.L. Correia. Traffic sign recognition based on pictogram contours. In Proceedings of the 9th International Workshop on Image Analysis for Multimedia Interactive Services, pages 67–70, 2008. [41] Z. Qingbing, L. Chengliang, M. Yubin, F. Shengwei, and W. Shiping. A machine vision system for continuous field measurement of grape fruit diameter. volume 2, pages 1064–1068. ITA, 2008. [42] R&A. Rules of golf, 2012. URL http://www.randa.org/en/ Rules-and-Amateur-Status/Rules-of-Golf.aspx#/rules/. Visited 2014- 02-21. [43] J. Rakun, D. Stajnko, and D. Zazula. Detecting fruits in natural scenes by using spatial-frequency based texture analysis and multiview geometry. Computers And Electronics In Agriculture, 76(1):80–88, 2011. [44] A. Ruiz, N. Guil, and M. Ujaldon. Recognition of circular patterns on GPUs: Performance analysis and contributions. Journal of Parallel and Distributed Computing, 68(10):1329 – 1338, 2008. [45] D.L. Spike, D.C. Talbot, and J.L. Witler. Golfing apparatus, March 3 1992. URL http://www.google.com/patents/US5092602. US Patent 5,092,602. [46] K. Teachabarikiti, T.H. Chalidabhongse, and A. Thammano. Players tracking and ball detection for an automatic tennis video annotation. In 11th Interna- tional Conference on Control Automation Robotics Vision, pages 2461–2494, Dec 2010. [47] S. Thrun, W. Burgard, and D. Fox. Probabilistic Robotics. MIT Press”, year =. [48] G. Welch and G. Bishop. An introduction to the Kalman filter, Sept. 1997. Dept. of Computer Science, University of North Carolina at Chapel Hill. [49] A.P. Witkin. Scale-space filtering: A new approach to multi-scale description. In Acoustics, Speech, and Signal Processing, IEEE International Conference on ICASSP ’84, volume 9, pages 150–153, March 1984. [50] A. Yamada, Y. Shirai, and J. Miura. Tracking players and a ball in video image sequence and estimating camera parameters for 3D interpretation of soccer games. In 16th International Conference on Pattern Recognition, volume 1, pages 303–306, 2002. [51] X. Yu, H.W. Leong, C. Xu, and Q. Tian. Trajectory-based ball detection and tracking in broadcast soccer video. IEEE Transactions on Multimedia, 8: 1164–1178, Dec 2006.

44 Appendix A

Test data

This appendix contains detailed descriptions of the test data semantics.

A.1 Test sequences without focal length data

1. Set 1.1: Initialized in the sky, long roll after ﬁrst ground contact. Sunlight, audience in the background. Ball radius approx 10 px. 300 frames

2. Set 1.2: Initialized in the sky. Quite short ﬂight, the ball is traveling almost towards the camera. The ball stops in the shadow of a tree, audience in the background (in the sun). 350 frames

3. Set 1.3: Initialized in the sky. The ball is partly occluded (behind a tree). The ball is large and sharp in the end of the sequence ( 40 px radius). Cloudy, no audience. 360 frames.

4. Set 1.4: Initialized during ﬂight with trees behind. Cloudy. The sequence ends with a view of the entire scene, the ball is only 3 px in diameter and many other objects are present (players, bunkers). 390 frames.

5. Set 1.5: Initialized in the sky. Ball is visible during the entire sequence. Other conditions similar to ﬂight4. 354 frames.

A.2 Test sequences with focal length data

In general, these test sequences has a higher contrast (due to camera settings) than the sequences in Section A.1. Additionally, the perspective is diﬀerent, where these sequences are captured from a tower, about 10 meters above the golf course. This makes markings in the lawn (footprints etc.) more visible.

1. Set 2.1: Initialized with dark background (vegetation), medium-long roll after ﬁrst tough with fairway. Sunlight, no audience visible. Lots of markings

45 APPENDIX A. TEST DATA

(footprints etc.) in the fairway lawn. Ball radius about 15 px initially, gradually decreasing to about 8 px after landing. 300 frames

2. Set 2.2: Initialized in the sky, the ﬂight is in front of both vegetation and audience, before landing at fairway. Most of the ball ﬂight is in sunlight, audience is shadowed. Ball radius varying between about 2 and 10 pixels. 300 frames

3. Set 2.3: Very similar to 2.2, except for initialization which is earlier and therefore with houses and trees in the background. Sunlit audience and ball. 275 frames

4. Set 2.4: Initialization and entire air ﬂight with trees in the backgrounde. The ball lands in sunlight. No audience. Ball radius varying between 3 and about 10 pixels. 342 frames

5. Set 2.5: Similar to 2.2, except for the background which contains both houses, trees and audience, but never sky. 315 frames

6. Set 2.6: Static ball lying on fairway. Ball radius about 5 pixels. 150 frames

7. Set 2.7: Static ball lying on fairway. Ball radius about 10 pixels. 160 frames

8. Set 2.8: Static ball lying on fairway. Ball radius about 8 pixels. 169 frames

46 Appendix B

Scale-space blob size

According to scale-space theory, the relationship between a scale-space peak scale σ and the size of the detected blob σˆ is as follows:

σˆ = Cσ (B.0.1)

[32]. In this appendix, the constant C is determined numerically. Synthetic black-and-white images containing a circle in the center were generated with radii from 5 to 30 pixels. The scale-space representation for each image was then computed, to ﬁnd the peak variance for each image. The square root of these variances were then plotted together with the known circle radii. The result is given in Figure B.1. With linear regression, the following relationship was determined: √ √ t ≈ 0.42r − 0.39 ⇔ r ≈ 2.3 t + 0.92

Different intensity profiles might however give different constants C[32]. There- fore, an additional simulation was made where, in constrast to the first test where the image intensities was either 0 or 1, the circle intensities was 500 and the background pixels 200. The result of this simulation is presented in Figure B.2. The linear regression analysis resulted in a quite different relationship, namely q sqrt(t) ≈ 0.72r − 3.43 ⇔ r ≈ r = 1.39 (t) + 4.76

Obviously, the relation between blob size and scale-space variance does vary with background intensities, and the scale-space variance can hence not be considered as an accurate estimation of the blob size. Howveer, the approximation √ r ≈ 2 t provides an approximation of this relationship, and can be used as an indication of the blob size.

47 APPENDIX B. SCALE-SPACE BLOB SIZE

0 5 10 15 20 25 30

Figure B.1: Plot between ball radius (x axis) and Gaussian kernel standard deviation (y axis) for synthetic images with black background (intensity = 0) and white foreground (intensity = 1). The red line is a ﬁrst-order linear regression of the radius and standard deviation values.

48 20

0 5 10 15 20 25 30

Figure B.2: Plot between ball radius (x axis) and Gaussian kernel standard deviation (y axis) for synthetic images with a circle intensity of 500 and background intensity of 200. The red line is a ﬁrst-order linear regression of the radius and standard deviation values.