Tracknet: a Deep Learning Network for Tracking High-Speed and Tiny Objects in Sports Applications
Total Page:16
File Type:pdf, Size:1020Kb
TrackNet: A Deep Learning Network for Tracking High-speed and Tiny Objects in Sports Applications Yu-Chuan Huang I-No Liao Ching-Hsuan Chen Ts`ı-U´ı Ik˙ ∗ Wen-Chih Peng Department of Computer Science, College of Computer Science National Chiao Tung University 1001 University Road, Hsinchu City 30010, Taiwan ∗Email: [email protected] Abstract—Ball trajectory data are one of the most topic in the areas of image processing and deep fundamental and useful information in the evaluation of learning. In the applications of sports analyzing and players’ performance and analysis of game strategies. athletes training, videos are helpful in the post-game Although vision-based object tracking techniques have review and tactical analysis. In professional sports, been developed to analyze sport competition videos, it is still challenging to recognize and position a high-speed high-end cameras have been used to record high and tiny ball accurately. In this paper, we develop a deep resolution and high frame rate videos and combined learning network, called TrackNet, to track the tennis with image processing for referee assistance or data ball from broadcast videos in which the ball images are collection. However, this solution requires enormous small, blurry, and sometimes with afterimage tracks or resources and is not affordable for individuals or even invisible. The proposed heatmap-based deep learning amateurs. Developing a low-cost solution for data network is trained to not only recognize the ball image from a single frame but also learn flying patterns from acquisition from broadcast videos will be significant consecutive frames. TrackNet takes images with the size for massive sports data collection. of 640 × 360 to generate a detection heatmap from either Ball trajectory data are one of the most funda- a single frame or several consecutive frames to position mental and useful information for game analysis. the ball and can achieve high precision even on public However, for some sports such as tennis, badminton, domain videos. The network is evaluated on the video of baseball, etc., the ball is not only small but also may the men’s singles final at the 2017 Summer Universiade, which is available on YouTube. The precision, recall, and fly as fast as several hundred kilometers per hour, F1-measure of TrackNet reach 99:7%, 97:3%, and 98:5%, resulting in tiny and blurry images. That makes the respectively. To prevent overfitting, 9 additional videos ball tracking task becomes more challenging than are partially labeled together with a subset from the other sports. In this paper, we design a heatmap- previous dataset to implement 10-fold cross validation, and based deep learning network, called TrackNet, to the precision, recall, and F1-measure are 95:3%, 75:7%, precisely position ball of tennis and badminton on arXiv:1907.03698v1 [cs.LG] 8 Jul 2019 and 84:3%, respectively. A conventional image processing broadcast videos or videos recorded by consumer’s algorithm is also implemented to compare with TrackNet. Our experiments indicate that TrackNet outperforms con- devices such as smartphones. TrackNet overcomes ventional method by a big margin and achieves exceptional the issues of blurry and remnant images and can ball tracking performance. The dataset and demo video are even detect occluded ball by learning its trajectory available at https://nol.cs.nctu.edu.tw/ndo3je6av9/. patterns. The proposed network can be applied to Index Terms—Deep Learning, neural networks, tiny other ball-based sports and help both amateurs and object tracking, heatmap, tennis, badminton professional teams collect data with a moderate budget. I. INTRODUCTION Conventional image recognition is usually based Video considered as logs of visual sensors con- on the object’s appearance features such as shape, tains a large amount of information. Information color, size, etc., or statistical features such as HOG, extraction from videos has become a hot research SIFT, etc. Due to a relatively long shutter time of consumer or prosumer cameras, images of high- detection. At last, the position of our target object speed objects are prone to suffer from afterimage is calculated based on the heatmap generated by the or blur issues, resulting in poor image recognition deep learning network. To meet the characteristics accuracy. The performance of ball tracking can be of tennis and badminton games, our calculation and improved by pairing candidates from frame to frame evaluation are based on the assumption that there is according to trajectory models to find the most at most one ball on the court. possible one [1]. In addition, a classical technique in image processing to improve image quality is To evaluate the proposed network, we have la- by fusing multiple low-quality images. Based on beled 20; 844 frames from the broadcast of men’s the above observations, instead of using the rule- singles final at the 2017 Summer Universiade. To based techniques, we propose to adopt deep learning assess the performance of the proposed consecu- network to recognize the shape of the ball and tive input frames technique, both single-frame and learn the trajectory patterns by applying multiple multiple-frame versions of TrackNet are imple- consecutive frames to solve the mentioned issues. mented. Along with the conventional image recog- Object classification and detection are two of the nition algorithm [1], a comprehensive comparison earliest studies in deep learning. VGG-16 [2] is among different models is performed. Experiments one of the most popular networks for feature map indicate that the proposed TrackNet outperforms the encoding. To detect and classify multiple objects in conventional image recognition algorithm and effec- an image, the R-CNN family [3] [4] [5] structurally tively locates fast-moving tennis ball from broadcast examine the picture in two stages. It firstly selects sport competition videos. Moreover, to prevent the many areas that may contain interesting objects, notorious overfitting issue that happens frequently called Region of Interests (RoIs), and then applies in deep learning solutions, additional data from object detection and classification techniques on 9 tennis games on different courts are added to these regions. However, its performance cannot ful- the training dataset, including grass court, red clay fill the needs of real-time applications. To speed up, court, hard court, etc. Additionally, to explore the the YOLO family [6] develops a one-stage end-to- model extensibility, badminton tracking by Track- end approach to detect objects in a limited search Net is evaluated. We have labeled 18; 242 frames space, significantly reducing the computing time. from the video of 2018 Indonesia Open Final - TAI The streamlined version of Tiny YOLO can even run Tzu Ying vs CHEN YuFei. Although badminton on the Raspberry Pi. Compared to the block-based travels much faster than tennis, our experimental algorithms, Fully Convolutional Networks (FCN) results exhibit a decent performance. proceeds pixel-wise classification. To compensate for the size reduction of the feature map during the The critical contribution of TrackNet comes from encoding process, upsampling and DeconvNet [7] its capability of precisely tracking fast-moving and are often used to decode the feature map, generating tiny objects by learning the dynamic behavior of the an original size of the data array. trajectory. In the tennis tracking application, 10-fold In this paper, a deep learning network, called cross validation results in an outstanding perfor- TrackNet, is proposed to realize a precise trajec- mance of 95:3% precision, 75:7% recall, and 84:3% tory tracking network. Firstly, VGG-16 is adopted F1-measure. Such capability shows great potential to generate the feature map. Different from other in expanding the variety of computer vision applica- deep learning networks, TrackNet can take multiple tions. The rest of the paper is organized as follows. consecutive frames as input. In this way, TrackNet Section II provides an introduction to the relevant learns not only the features of the ball but also the researches and the convolutional neural network. characteristics of ball trajectories to enhance its ca- Section III introduces the datasets used in this paper. pability of object recognition and positioning. Since Section IV elaborates the proposed deep learning images are downsampled and encoded by pooling network and Gaussian heatmap techniques. Section layers, the network follows the upsampling mech- V provides experimental results and performance anism of FCN to generate the heatmap for object evaluation. At last, Section VI concludes this paper. 2 II. RELATED WORKS sampling, and deconvolution/up-sampling. A soft- max layer is usually used as the output layer. In recent years, the analysis of player perfor- For example, the widely used VGG-16 [2] mainly mance and game tactics based on the trajectory consists of convolutional, maximum pooling, and data of balls and players has received more and ReLU layers. Conceptually, front-end layers learn more attention [8] [9] [10] [11]. Many tracking to identify simple geometric features, and back-end algorithms and systems have been developed to layers are trained to identify object features. compute and collect the trajectory data. Current In CNNs, each layer is a W × H × D data array. commercial solutions mainly rely on high resolution W , H, and D denote the width, height, and depth and high frame rate video, resulting in high hard- of the data array, respectively. The convolution ware investment. For example, the Hawk-Eye sys- operation is a filter with a kernel of size w × h × D tem [12] has been extensively used in professional across the W × H range with the stride parameter competitions to calculate ball trajectories and assist s being set as 1 in many applications. To avoid the referee in clarifying controversial calls through information loss near the boundary or maintain the 3D visual depictions. Nonetheless, the system has to size of the output data array, columns and rows of deploy high-end cameras with dedicated operators the data array can be padded with zero by setting the at selected locations and angles.