A Dataset for Competitive Sports Content Analysis

JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 1 FSD-10: A Dataset for Competitive Sports Content Analysis Shenglan Liu Member, IEEE, Xiang Liu, Gao Huang, Lin Feng, Lianyu Hu, Dong Jiang, Aibin Zhang, Yang Liu, Hong Qiao Fellow, IEEE

Abstract—Action recognition is an important and challeng- datasets (e.g. HMDB51[18], UCF50[34]) or action datasets of ing problem in video analysis. Although the past decade has human sports (e.g. MIT Olympic sports[28], Nevada Olympic witnessed progress in action recognition with the development sports[24]) are not quite representative of the richness and of deep learning, such process has been slow in competitive sports content analysis. To promote the research on action complexity of competitive sports. The limitations of current recognition from competitive sports video clips, we introduce a datasets could be summarised as follows: (1) Current tasks Figure Skating Dataset (FSD-10) for finegrained sports content of human action analysis are insufficient, which is limited to analysis. To this end, we collect 1484 clips from the worldwide action types and contents of videos. (2) In most classification figure skating championships in 2017-2018, which consist of 10 tasks, discriminant of an action largely depends on both static different actions in men/ladies programs. Each clip is at a rate of 30 frames per second with resolution 1080 × 720. These clips human pose and background; (3) Video segmentation, as an are then annotated by experts in type, grade of execution, skater important task of video understanding (including classification info, .etc. To build a baseline for action recognition in figure and assessment in SCA ), is rarely discussed in current skating, we evaluate state-of-the-art action recognition methods datasets. on FSD-10. Motivated by the idea that domain knowledge is To address the above issues, this paper proposes a figure of great concern in sports field, we propose a keyframe based temporal segment network (KTSN) for classification and achieve skating dataset called FSD-10. FSD-10 consists of 1484 figure remarkable performance. Experimental results demonstrate that skating videos with 10 different actions manually labeled. FSD-10 is an ideal dataset for benchmarking action recognition These skating videos are segmented from around 80 hours algorithms, as it requires to accurately extract action motions of competitions of worldwide figure skating championships rather than action poses. We hope FSD-10, which is designed 1 in 2017-2018. FSD-10 videos range from 3 seconds to 30 to have a large collection of finegrained actions, can serve as a new challenge to develop more robust and advanced action seconds, and the camera is moving to focus the skater to ensure recognition models. that he (she) appears in each frame during the process of actions. Compared with existing datasets, our proposed dataset I.INTRODUCTION has several appealing properties. First, actions in FSD-10 are complex in content and fast in speed. For instance, the complex Due to the popularity of media-sharing platforms(e.g. 2-loop-Axel jump in Fig. 1 is finished in only about 2s. It’s YouTube), sports content analysis (SCA[31]) has become worth note that the jump type heavily depends on the take off an important research topic in computer vision [39][10][30]. process, which is a hard-captured moment. Second, actions of A vast amount of sports videos are piled up in computer FSD-10 are original from figure skating competitions, which storage, which are potential resources for deep learning. In are consistent in type and sports environment (only skater recent years, many enterprises (e.g. Bloomberg, SAP) have and ice around). The above two aspects create difficulties for focus on SCA[31]. In SCA, datasets are required to reflect machine learning model to conclude the action types by a characteristics of competitive sports, which is a guarantee for arXiv:2002.03312v1 [cs.CV] 9 Feb 2020 single pose or background. Third, FSD-10 can be applied to training deep learning models. Generally, competitive sports standardized action quality assessment (AQA) tasks. The AQA content is a series of diversified, high professional and ulti- of FSD-10 actions depends on base value (BV) and grade of mate actions. Unfortunately, existing trending human motion execution (GOE), which increases the difficulty of AQA by Shenglan Liu, Xiang Liu, Lin Feng, Lianyu Hu, Dong Jiang, Aibin both action qualities and rules of International Skating Union Zhang, Yang Liu are with Faculty of Electronic Information and Elec- (ISU)2. Last, FSD-10 can be extended to more tasks than the trical Engineering, Dalian University of Technology, Dalian, Liaoning, 116024 China. Email: ({liusl, hly19961121, s201461147, 18247666912, mentioned datasets. dlut liuyang}@mail.dlut.edu.cn, [email protected], [email protected]). Along with the introduction of FSD-10, we introduce a key Gao Huang is with Faculty of Tsinghua University of Technology, Beijing, China. Email: [email protected]. frame indicator called human pose scatter (HPS). Based on H. Qiao is with the State Key Laboratory of Management and Control for HPS, we adopt key frame sampling to improve current video Complex Systems, Institute of Automation, Chinese Academy of Sciences, classification methods and evaluate these methods on FSD- Beijing 100190 China. E-mail: [email protected]. This work is supported in part by The National Key Research and Develop- ment Program of China (2017YFB1300200) National Natural Science Foun- 1ISU Grand Prix of Figure Skating, European Figure Skating dation of People’s Republic of China (No. 61672130, 61602082, 91648205, Championships, Four Continents Figure Skating Championships 31871106), the National Key Scientific Instrument and Equipment Develop- (https://www.isu.org/figure-skating) ment Project (No. 61627808), the Development of Science and Technology 2One action appearing in second half of program will earn extra rate of of Guangdong Province Special Fund Project Grants (No. 2016B090910001), 10% score of BV than that appearing in first half of program. This means the LiaoNing Revitalization Talents Program (No.XLYC1086006). that same action may get different BV score because of the rules. JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 2

Spin in the air 휁퐻푃푆 (61.88)

Warm up Take off Landing 휁퐻푃푆(82.89) 휁퐻푃푆(187.15) 휁퐻푃푆(152.63)

Fig. 1. An over view of the ﬁgure skating dataset. It is illustrated that warm up, take off, spinning in the air and landing are 4 steps of an Axel jump. ζHPS indicates the human pose scatter of a gesture, which is introduced in Sec IV-B.

TABLE I A SUMMARY OF EXISTING ACTION DATASETS

Dataset Year Actions Clips Classiﬁcation Score Temporal Segmentation √ UCF Sports[29] 2009 9 14-35 √ -- Hollywood2[21] 2009 12 61-278 √ -- UCF50[34] 2010 50 min. 100 √ -- HMDB51[18] 2011 51 min. 101 √ √-- MIT Olympic sports[28] 2015 2 150-159 √ √ - Nevada Olympic sports[25] 2017 3 150-370 √ √ - AQA-7[24] 2018 7 83-370 √ √ √- FSD-10 2019 10 91-233

10. Furthermore, experimental results validate that key frame II.RELATED WORKS sampling is an important approach to improve performance of frame-based model in FSD-10, which is in concert with cogni- Current action datasets could be divided into human action tion rules of human in figure skating. The main contributions dataset (HAD) and human professional sports dataset (HPSD), of this paper can be summarised as follows. which are listed in Tab. I. HAD consists of different kinds of broadly defined human motion clips, such as haircut and long • To our best knowledge, FSD-10 is the first challenging jump. For example, UCF101[34] and HMDB[18] are typical dataset with high-speed, motion-focused and complex cases of HAD, which have greatly promoted the process of sport actions in competitive sports field, which introduces video classification techniques. In contrast, HPSD is a series of multiple tasks for computer vision, such as (fine grained) competitive sports actions, which is proposed for action quality classification, long/short temporal human motion segmen- assessment (AQA) tasks except classification. MIT Olympic tation, action quality assessments, and program content sports[28] and Nevada Olympic sports[25] are examples of assessment in time to music. HPSD, which are derived from Olympic competitions. • To set a baseline for future achievements, we also bench- Classification in HPSD is important to attract people’s atten- mark state-of-the-art sport classification methods on FSD- tion and to highlight athlete’s performance, and even to assist 10. Besides, the key frame sampling is proposed to referees [16][46][1]. The significant difference between video capture the pivotal action details in competitive sports, classification and image classification is logical connection which achieves better performance than state-of-the-art relationship in video clips [2], which involve motion besides methods in FSD-10. form (content) of frames. Therefore, human action classification task should be implemented based on temporal human In addition,compared to current datasets, we hope FSD-10 information without background rather than static action frame will be a standard dataset which promotes the research on information. H.kuehne et al. [18] mentioned that the recogni- background-independent action recognition, allowing models tion rate of static joint locations alone in UCF sports [29] to be more robust to unexpected events and generalize to novel dataset is above 98% while the use of joint kinematics is not situations and tasks. necessary. The result above obviously violates to the principle JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 3

ChComboSpin4 3Axel

FlyCamelSpin4 3Flip

ChoreoSequence1 3Loop

StepSequence3 3Lutz

2Axel 3Lutz_3Toeloop

Fig. 2. 10 figure skating actions included in FSD-10 shown with 5 frames. The color of frame borders specifies to which action category they belong (Spin, Sequence, Jump) and the labels specifies their action types. of video classification task. Thus, both motion and pose are A. Video Capture indispensable modalities in human action classification [7]. In FSD-10, all the videos are captured from the World Besides classification of HPSD, AQA is an important but Figure Skating Championships in 2017-2018, which originates unique task in HPSD. Some AQA datasets have been proposed, from high-definition record on YouTube. The source video such as MIT Olympic sports dataset[28], Nevada Olympic materials have large ranges of fps (frame per second) and sports dataset[25] and AQA-7[24], which predict the action image size, which are standardized to 30 fps and 1080 × 720 performance of diving, gymnastic vault, etc. In this field, to ensure the consistency of the dataset respectively. As for Parmar et al. [26] adopts C3D features combining regression the content, the programs of figure skating can be divided into with the help of SVR[6] (LSTM[12]), which gets satisfactory four types, including men short program, ladies short program, results. The development direction of HPSD is clearly given men free skating and ladies free skating. During the skating, by the above works while AQA still remains to be devel- camera is moving with the skater to ensure that the current oped as the materials of these datasets is still inadequate. action can be located center in videos. Besides, auditorium It could be concluded that the functions of datasets are and ice ground are collectively referred to the background, gradually expanded with the development of video techniques which are similar and consistent in all videos. [17][9][27][43][32]. However, it lacks temporal segmentation related datasets in recent years. Our proposed FSD-10 is a response to these issues. As far as we know, FSD-10 is the B. Segmentation and Annotation first dataset which gather actions task types including action To meet the demand of deep learning and other machine classification, AQA and temporal segmentation. learning models, video materials are manually separated into action clips by undergraduate students. Each clip is ranging from 3s to 30s, the principe of the segmentation is to com- III.FIGURE SKATING DATASET pletely reserve action content of videos, in other words, warm- up and ending actions are also included. Actions type, base In this section, we describe details regarding the setup and value (BV), grade of execution (GOE), skater name, skater protocol followed to capture our dataset. Then, we discuss the gender, skater age, coach and music are recorded in each temporal segmentation and assessment tasks of FSD-10 and clip. A detailed record example of one clip is shown in Tab. its future extensions. II. Unfortunately, the distribution of actions is out of tune, JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 4

Fig. 3. Number of clips per action class. The horizontal axis indicates action types while the vertical axis indicates action amout.

TABLE II D. Temporal Segmentation and Assessment A SINGLECLIPEXAMPLEOF FSD-10 • Action Quality Assessment (AQA). AQA would be Action Value emerged as an imperative and challengeable issue in figure skating, which is used to evaluate performance Type 2Axel BV 3.3/3.63 level of skaters in skating programs by BV score plus GOE 0.58 GOE score. As shown in Sec. III-B, BV and GOE are Skater name Starr ANDREWS included in our dataset, which are mainly depend on Skater gender F Skater age 16 action types and skater’s action performance respectively. Coach Derrick Delmore, Peter Kongkasem The key points of AQA are summarised as follows. (1) Music Fever performed by Beyonce BV is depend on action types and degree of action difficulty. Besides, 10% bonus BV score is appended on the second half of a program. (2) During the aerial as different kinds of actions are totally 197 and the number process, insufficient number of turns and raising hands of each class of action instances is from 1 to 223. To avoid cause GOE deduction and bonus respectively. (3) It is insufficient training, our FSD-10 is composed of the top 10 caused GOE deduction of one action by hand support, actions (as shown in Fig. 2, Fig. 3). Generally speaking, these turn over, paralleling feet and trip during the landing selected actions could fall into three categories: jump, spin process. and sequence. And in details, ChComboSpin4, FlyCamelSpin4, • Action Temporal Segmentation(ATS). It is indicated ChoreoSequence1, StepSequence3, 2Axel, 3Loop, 3Flip, 3Axel, that ATS is also a significant issue of FSD-10. A standard 3Lutz and 3Lutz 3Toeloop are included. action is of certain stages. For example during the process of a jump action, four stages are commonly included, C. Action Discriminant Analysis which are preparation, take-off process, spinning in the Compared with existing sports, figure skating is hardly air and landing process. Separating these steps is a key distinguished in action types. The actions in figure skating technical issue for AQA and further related works. look like tweedledum and tweedledee, which are difficult to • Long Sequence Segmentation and Assessment (LSSA). distinguish in classification task. (1) It is difficult to judge the Another important issue is LSSA, which is different action type by human pose in a certain frame or a few frames. from AAS in sensitivity of actions order. Action order For instance, Loop jump and Lutz jump are similar in warm- and relation need to be considered in LSSA, which is up period, it is impossible to distinguish them beyond take-off summarised as follows. (1) LSSA is the total of all the process. (2) The difference between some actions exists on a single action score, which is calculated similar to AQA subtle pose in fixed frames. For example, Flip jump and Lutz for each action. (2) It is caused a fail case 3 when the jump are both picking-ice actions, which are same in take-off twice appearance of same action. with left leg and picking ice with right toe. The only subtle difference of the two types of jump is that takes-off process of E. Future Extension Flip jump is by inner edge while Lutz jump is by outer edge. Our proposed FSD-10 is sensitive to domain knowledge and (3) The spin direction is uncertain. The rotational direction of abundant in practical tasks. Besides the tasks in Sec. III-D , most skaters is anticlockwise while that of others is opposite (e.g. Denise Biellmann). These factors of the above issues 3The twice appearance action will have a zero score whether it is successful make classification task of FSD-10 a challengeable problem. or not. JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 5

FSD-10 is convenient to be extended into other fields. For scatter of arms and legs relative to the body. The main example, pair skating is also a worth noting program in figure positions of human (left eye, right eye, etc. ) are marked by skating. In the pair skating, to throw one skater by another is the open-pose algorithm, which is shown in Fig. 4. Next, HPS a classical item, and the action in the air could be illustrated is introduced in the rest of this section. as a corresponding jump action in men/ladies programs. To transfer single skating models to pair skating ones is an interesting open problem, which could be deemed as a kind of transfer learning[23][45] or meta learning[19][14]. Besides, action reasoning[8][38] is an interesting issue. For example, it is straightforward to conclude a 3Lutz-3Toeloop jump if single 3Lutz jump and 3toeloop jump have been recognized. In addition, coordination between the action and the other sports 푥ҧ elements such as program content assessment by rhythmical actions in time to music is also a focused issue. Fortunately, 푥1 the corresponding music of actions is included in FSD-10. It is to be explored whether the music ups and downs is related to the action rhythm, which is an exciting cross-data-modality learning task. 푈 IV. KEYFRAMEBASED TEMPORAL SEGMENT NETWORK (KTSN) Fig. 4. The sketch map of human pose scatter. A 18-anatomical-keypoints In this section, we give a detailed description of our recognition algorithm called open-pose[4][3][33][41] is adopted to get the spatial position and anatomical structure of human in one frame. keyframe based temporal segment network. Specifically, we first discuss the motivation of key frame sampling in Sec. As shown in Fig. 4, given a frame matrix of the responding IV-A. Then, sampling method of key frame is proposed in anatomical structure, X = [x , ··· , x ] ∈ RD×n (in this Sec. IV-B and IV-C. Finally, network structure of KTSN is 1 n paper D = 2 and n = 18), in which each row corresponds to a detailedly introduced in Sec. IV-D. particular feature while each column corresponds to a human anatomical keypoint, and x¯ is the column mean value of X. A. Motivation of Key Frame Sampling Then, Xˆ can be defined as Xˆ = [x1 − x,¯ x2 − x,¯ ··· , xi − x¯]. To solve the problem of long-range temporal modeling, a Motivated by the principal component analysis (PCA[42]), sampling method called segment based sampling is proposed HPS can be defined by Eq. 1. Projection matrix U (UD×d = T T by temporal segment networks(TSN)[40]. The main idea of [u1, u2, ··· , ud], satisfying U U = I, which leads I − UU segment based sampling is “the dense recoded videos are the idemfactor) can be calculated by eigenvalue decomposition unnecessary for content which changes slowly”. The method (EVD) of XˆXˆ T . is essentially a kind of sparse and global sampling, which n is significant in improving recognition performance of long- X T 2 ζHPS = (xi − x¯) − UU (xi − x¯) (1) range videos. But an intractable problem is caused that it is 2 i=1 unbefitting for fast changing videos. Especially for tasks like figure skating, which is complex in content and fast in speed. Actually, Eq. 1, only a formal expression of ζHPS, can It is illustrated in Sec. III-C that actions of figure skating are be easily rewritten as Eq. 2 to simplify calculation. In Eq. ˆ ˆ T T ˆ ˆ T Pd 4 similar in most of frames, and the only difference lies on 2, T r XX and T r U XX U = j=1 λj have been certain frames. Therefore, for sampling problem, we propose calculated during the EVD process. Therefore, Eq. 2 offers an a method to sample key frames by domain knowledge and to efficient approach for calculation of ζHPS. sample other frames by segment based sampling. It is then n focus on how to sample key frames in segmentation based X 2 ζ = (I − UU T )(x − x¯) sampling. HPS i 2 The problem motivates us to explore an indicator of sam- i=1 n X pling by domain knowledge in figure skating. For example, = (x − x¯)T (I − UU T )(x − x¯) by observing the process of jump, we find that arms and legs i i (2) i=1 move following specific rules during the process. Thus, we are d encouraged to define an indicator using anatomical keypoints T X =T r(XˆXˆ ) − λj to measure this kind of rules, which is called Human Pose j=1 Scatter and is illustrated in the next section. ζHPS is the sum of variance towards the principal axis of human anatomical keypoints, which is used to reflect B. Human Pose Scatter (HPS) the degree of human pose stretching. Therefore, ζHPS is Before key frame sampling, we will introduce HPS for a 4 T preparation. HPS is an indicator which is denoted by the λj is the j-th eigenvalue of covariance matrix XˆXˆ , in our paper d = 1. JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 6 a tremendous help in finding key frames of sequences. For short snippets sampled from the videos. To make these short example, during a process of jump, ζHPS of spin in the air snippets representative for the entire videos as well as to keep process is lower than it of the take off process, and extreme computational consumption steerable, the video is divided point maps a turning point responding to some certain actions. into several segments first. Then, sampling is continued on This can be referred to Fig. 1. each piece of snippets. As illustrated in IV-A, our sampling methods are divided to two parts, which are segment based C. Key Frame Sampling sampling[40] and key frame sampling respectively. With the sampling method, each piece contributes its snippet-level predictions of actions classes. And these snippet-level predictions

휁퐻푃푆 “The specific key frame” are aggregated as video-level scores by a consensus function. O ∗ 푗 퐺푗∗,p The video-level prediction is more accurate than any single 퐺푗∗,푗∗−훿 퐺푗∗,푗∗+훿 snippet-level predictions for the reason that it captures details 푇 of the whole videos. During the process of training, the model is iteratively updated in video-level as follows. We divide a video into k segments (Vvideo = {α1, α2, ··· , αk} of equal duration pieces. One frame Tk is randomly selected from its corresponding segment αk. l Similarly, One snippet Gj∗,p is randomly selected from its l l corresponding Gj∗ ∪Oj∗ , which have been proposed in section IV-B and IV-C. The sequence of snippets (T1,T2, ··· ,Tk and 1 2 L 휁퐻푃푆 Gj∗,p,Gj∗,p, ··· ,Gj∗,p) makes up of the model of keyframe based temporal segment network as follows.

1 2 L t KTSN(T1,T2, ··· ,Tk,Gj∗,p,Gj∗,p, ··· ,Gj∗,p) =

P(G(F(T1; W ), F(T2; W ), ··· , F(Tk; W )), 1 2 L G(F(Gj∗,p; W ), F(Gj∗,p; W ), ··· , F(Gj∗,p; W ))) Fig. 5. The sketch map of key frame sampling. A 2-Axel jump is performed (3) by Alina Ilnazovna Zagitova, which is a same action with Fig. 1. t represents l the time variable. In Eq. 3, F(Tk; W ) (or F(Gj∗,p; W )) is the function on behalf of a ConvNet[20] with model parameters W which plays l Key frame sampling is an important issue in video analysis an important role in the snippet Tk (Gj∗,p) and obtain a con- of figure skating, and the key frames should contain most sensus of classification hypothesis. It is given the prediction of discriminant information of one video. In the figure skating the whole video by an aggregation function G, which combines task, the fast changing motion frames are distinctly important all the output of individual snippets. Based on the value of for jump actions. As illustrated in Sec. IV-B, the kind of G, P works out the classification results. In this paper, P is fast changing motion could be quantified as ζHPS by using represented as softmax function[22]. anatomical keypoints. Therefore, in a T frames clip of a V. EXPERIMENTS video, the frame corresponding to extreme frame Oj∗ , which ∗ j satisfies j = argmax ζHPS, is defined as “the specific key In order to provide a benchmark for our FSD-10 dataset, we j evaluate various approaches under three different modalities: frame”, where δ indicates the near frame radius of O ∗ , j ∈ j RGB, optical flow and anatomical keypoints (skeleton). We {δ + 1, ··· ,T − δ}. In addition, as nearby δ frames of O ∗ , j also conduct experiments on cross dataset validation. The G ∗ = {G ∗ ∗ , ··· ,G ∗ , ··· ,G ∗ ∗ } are defined as j j ,j −δ j ,p j ,j +δ following describes the details of our experiments and results. “key frames”, where p ∈ {j∗ − δ, ··· , j∗ + δ}\j∗(see Fig. 5). l In a multi-action clip, we will get Oj∗ , where l = 1, 2, ··· ,L, A. Evaluations of Methods on FSD-10 L indicates the number of extreme frames. Then, key frame The state-of-the-art classification methods of human action sampling could be described as follows. First, finding the are evaluated on FSD-10 in this section. In order to make specific key frame(s) in action sequences. Second, sampling the experiments representative in action recognition field, 2D the video from the specific key frame(s) and its nearby frames methods and 3D methods are selected respectively. TSN (or using the similar approach of TSN. In one batch, TSN samples KTSN) is selected as a 2D method, which is a typical two a random frame while key frame sampling involves a random l stream structure. As for 3D methods, two-stream inflated frame G in G ∗ ∪ O ∗ . j∗,p j j 3D ConvNet (I3D) with Inception-v1[36], Inception-v3[37] and Resnet-50, Resnet-101[11] are involved for comparison. D. Framework Structure Besides, anatomical keypoints related methods (st-gcn[44], An effective and efficient video-level framework, coined densenet[15]) are tested on our dataset for reference. As listed keyframe based Temporal Segment Network (KTSN, Fig. 6), in Tab. III, our proposed KTSN achieves state-of-the-art results is designed with the help of key frame sampling. Compared on FSD-10. KTSN method is raised as a baseline for this high- with single frame or short frame pieces, KTSN works on speed dataset. JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 7

Key-frame based Temporal Segment network

푇1 ℱ(푇1; 푊)

𝒢 ℱ(푇1; 푊 , ℱ(푇2; 푊), ℱ(푇 ; 푊) Segment based Sampling 푇2 2 ℱ(푇3; 푊)) Video Sequence

Red layer 풫 𝒢 ℱ(푇1; 푊 , ℱ(푇2; 푊 , 푇 1 Pooling 3 ℱ(푇3; 푊) ℱ(푇3; 푊)), 𝒢(ℱ(퐺푗∗,푝; 푊))) Green layer

Convolution Blue layer

Softmax Optical flow x 1 1 1 Input & Output Key Frame Sampling 퐺푗∗,푝 ℱ(퐺푗∗,푝; 푊) 𝒢(ℱ(퐺푗∗,푝; 푊)) 𝒢( ), 풫( ) Other Optical flow y

Fig. 6. Keyframe based temporal segment network. One input video is divided into k segments (here we show the k = 3 case) and a short snippet is randomly selected from each segment. Meanwhile, n key frames are selected from the Video (here L = 1). The snippets are represented by modalities such as RGB frames and Optical ﬂow. Then these snippets are fused by an aggregation function G. Last, P combines the outputs and makes video-level predictions. ConvNets on all snippets share parameters.

TABLE III than TSN[40]. PERFORMANCE OF BASELINE CLASSIFICATION MODELS EVALUATES WITH FSD-10. THESHORTNAMESOFTHESEMETHODSARETWO-STREAM INFLATED 3D CONVNET[5] (INCEPTION-V1[36], INCEPTION-V3[37], C. Cross-dataset Validations RESNET-50[11] AND RESNET-101[11]), SPATIAL TEMPORAL GRAPH CONVOLUTIONNETWORKS (ST-GCN[44]), DENSENETWITHANATOMICAL To verify motion characteristic of a sport video, the cross- KEYPOINTSNETWORKS[15], TEMPORALSEGMENTNETWORK[40] AND dataset validation is conducted on both classical dataset KEYFRAMEBASEDTEMPORALSEGMENTNETWORKRESPECTIVELY. DENSENET+ DENOTES DENSENET WITH CENTRALIZATION.BESIDES, UCF101 and FSD-10. RGB and optical flow features are two TSN(7) REPRESENTS TSN(k=7) AND KTSN(6,1) REPRESENTS typical ways for action recognition. Generally speaking, the KTSN(k=6, L=1) CORRESPONDINGTO SEC. IV-D, SIMILARLY RGB feature based method places emphasis on human pose HEREINAFTER. while the optical flow based method attaches importance to transformation between actions. It is illustrated in Tab. IV RGB Optical flow Skeleton Fusion that the performance differences between RGB and Optical Inception-v1[36] 43.765 52.706 - 55.052 flow of FSD-10 is greater than UCF101, which indicates Inception-v3[37] 50.118 57.245 - 59.624 motion is more important than human static pose in FSD- Resnet-50[11] 62.550 58.824 - 64.941 Resnet-101[11] 78.823 - 61.549 79.851 10. Besides, in sampling stage, FSD-10 is more sensitive to St-gcn[44] - - 72.429 - fine-grained segmentation, as the performance raises 5.4% DenseNet[15] - - 74.353 - compared with 2.5% in UCF101 (3-segment-sampling to 9- DenseNet+[15] - - 79.765 - segment-sampling). In FSD-10, actions are sensitive to its TSN(7)[40] 53.176 75.165 - 76.000 motion rather than its pose, which is consistent with the KTSN(6, 1) 56.941 76.941 - 77.412 original intention of the dataset. TSN(9)[40] 59.294 80.235 - 80.235 KTSN(8, 1) 59.529 80.471 - 80.471 TABLE IV TSN(11)[40] 59.294 82.118 - 82.118 RESULTS OF CROSS-DATASET GENERALIZATION. UCF101 AND FSD-10 KTSN(10, 1) 63.294 82.588 82.588 - ARETRAINEDWITH TSN(k=3) AND TSN(k=9) RESPECTIVELY.

Dataset Methods RGB Optical flow Fusion TSN(3) 85.673 85.170 92.466 B. Evaluations between TSN and KTSN UCF101 TSN(9) 86.217 89.713 94.909 The experimental results in Tab. III show that key frame TSN(3) 48.941 75.765 76.765 FSD-10 sampling is effective in sparse sampling. As illustrated above, TSN(9) 59.294 82.118 82.118 key frame is crucial for competitive sports classification tasks. Especially for jump actions, it is vital to choose take off frames, which enhances discriminant of different jump actions. An example of key frame sampling of jump actions is shown VI.CONCLUSION in Fig. 5. By catching these key frames instead of random In this paper, we build an action dataset for competitive sampling, KTSN achieves better classification performance sports analysis, which is characterised by high action speed JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 8 and complex action content. We find that motion is more valuable than form (content and background) in this task. Therefore, compared with other related datasets, our dataset focuses on the action itself rather background. Our dataset creates many interesting tasks, such as fine-grained action classification, action quality assessment and action temporal segmentation. Moreover, the dataset could even be extended to pair skating in transfer learning and coordination between action and music in cross-data-modality learning. We also present a baseline for action classification in FSD- 10. Note that domain knowledge is significant for competitive action classification task. Motivated by this idea, we adopt keyframe based temporal segmentation sampling. It is illustrated that action sampling should focus more on related frames of motion. The experimental results demonstrate that the key frame is effective for skating tasks. In the future, more diversified action types and tasks are intended to be included to our dataset, and more interesting action recognition methods are likely to be proposed with the Fig. 7. The confusion matrix results of FSD-10 help of FSD-10. We believe that our dataset will promote the development of action analysis and related research topics.

[2] Alan C Bovik. Handbook of image and video processing. Academic SUPPLEMENTARY MATERIAL press, 2010. [3] Zhe Cao, Gines Hidalgo, Tomas Simon, Shih-En Wei, and Yaser Sheikh. 1. Training of KTSN OpenPose: realtime multi-person 2D pose estimation using Part Affinity Fields. In arXiv preprint arXiv:1812.08008, 2018. Architectures Inception network, chosen in KTSN, is a [4] Zhe Cao, Tomas Simon, Shih-En Wei, and Yaser Sheikh. Realtime milestone in CNN classifier. The main characteristics of in- multi-person 2d pose estimation using part affinity fields. In CVPR, ception network is the inception module, which is designed to 2017. [5] Joao Carreira and Andrew Zisserman. Quo vadis, action recognition? process the features in different dimensionality (such as the a new model and the kinetics dataset. In proceedings of the IEEE 1×1, 3×3 and 5×5 convolution core) parallelly. The result of Conference on Computer Vision and Pattern Recognition, pages 6299– Inception network is a better representation than single-layer- 6308, 2017. [6] Harris Drucker, Christopher JC Burges, Linda Kaufman, Alex J Smola, networks, which is excellent in local topological structure. and Vladimir Vapnik. Support vector regression machines. In Advances Specifically, Inception v3 is selected in the experiments as in neural information processing systems, pages 155–161, 1997. [7] Hany El-Ghaish, Mohamed E Hussein, Amin Shoukry, and Rikio Onai. follows. Human action recognition based on integrating body pose, part shape, Input Our model is based on spatial and temporal level, and motion. IEEE Access, 6:49040–49055, 2018. which are corresponding to RGB frames[35] and optical [8] Matthew L Ginsberg and David E Smith. Reasoning about action i: A possible worlds approach. Artificial intelligence, 35(2):165–195, 1988. flow[13] respectively. RGB mode is an industrial standard [9] Chunhui Gu, Chen Sun, David A Ross, Carl Vondrick, Caroline of color, combining Red, Green and Blue into chromatic Pantofaru, Yeqing Li, Sudheendra Vijayanarasimhan, George Toderici, frames, which is also called the spatial flow. Besides, optical Susanna Ricco, Rahul Sukthankar, et al. Ava: A video dataset of spatio- temporally localized atomic visual actions. In Proceedings of the IEEE flow mode derives from calculating the differences between Conference on Computer Vision and Pattern Recognition, pages 6047– adjacent frames, which indicates temporal flow. 6056, 2018. [10] Joachim Gudmundsson and Michael Horton. Spatio-temporal analysis of team sports. ACM Computing Surveys (CSUR), 50(2):22, 2017. [11] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep 2. Confusion Matrix of Actions residual learning for image recognition. In Proceedings of the IEEE It’s significant to involve domain knowledge in competitive conference on computer vision and pattern recognition, pages 770–778, 2016. sports content analysis. It is illustrated the confusion relation- [12] Sepp Hochreiter and Jurgen¨ Schmidhuber. Long short-term memory. ships between actions by examples in Sec. 3.3, and this part is Neural computation, 9(8):1735–1780, 1997. a validation, as shown in Fig. 7. First, the skating spin obtains [13] Berthold KP Horn and Brian G Schunck. Determining optical flow. Artificial intelligence, 17(1-3):185–203, 1981. the best classification results, this is because it could judge [14] Kyle Hsu, Sergey Levine, and Chelsea Finn. Unsupervised learning via a skating spin of video precisely with a few isolated frames meta-learning. arXiv preprint arXiv:1810.02334, 2018. [15] Gao Huang, Zhuang Liu, Laurens Van Der Maaten, and Kilian Q using deep networks. Second, the skating jump is the most Weinberger. Densely connected convolutional networks. In Proceedings difficult to distinguish. It’s worth note that up to 24% Lutz of the IEEE conference on computer vision and pattern recognition, jump actions are identified as Flip jump actions. pages 4700–4708, 2017. [16] Andrej Karpathy, George Toderici, Sanketh Shetty, Thomas Leung, Rahul Sukthankar, and Li Fei-Fei. Large-scale video classification with REFERENCES convolutional neural networks. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pages 1725–1732, 2014. [1] Sami Abu-El-Haija, Nisarg Kothari, Joonseok Lee, Paul Natsev, George [17] Will Kay, Joao Carreira, Karen Simonyan, Brian Zhang, Chloe Hillier, Toderici, Balakrishnan Varadarajan, and Sudheendra Vijayanarasimhan. Sudheendra Vijayanarasimhan, Fabio Viola, Tim Green, Trevor Back, Youtube-8m: A large-scale video classification benchmark. arXiv Paul Natsev, et al. The kinetics human action video dataset. arXiv preprint arXiv:1609.08675, 2016. preprint arXiv:1705.06950, 2017. JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 9

[18] Hildegard Kuehne, Hueihan Jhuang, Est´ıbaliz Garrote, Tomaso Poggio, Tang, and Luc Van Gool. Temporal segment networks for action and Thomas Serre. Hmdb: a large video database for human motion recognition in videos. IEEE transactions on pattern analysis and recognition. In 2011 International Conference on Computer Vision, machine intelligence, 2018. pages 2556–2563. IEEE, 2011. [41] Shih-En Wei, Varun Ramakrishna, Takeo Kanade, and Yaser Sheikh. [19] Kwonjoon Lee, Subhransu Maji, Avinash Ravichandran, and Stefano Convolutional pose machines. In CVPR, 2016. Soatto. Meta-learning with differentiable convex optimization. In [42] Svante Wold, Kim Esbensen, and Paul Geladi. Principal component Proceedings of the IEEE Conference on Computer Vision and Pattern analysis. Chemometrics and intelligent laboratory systems, 2(1-3):37– Recognition, pages 10657–10665, 2019. 52, 1987. [20] Shih-Chung B Lo, Heang-Ping Chan, Jyh-Shyan Lin, Huai Li, [43] Jun Xu, Tao Mei, Ting Yao, and Yong Rui. Msr-vtt: A large video Matthew T Freedman, and Seong K Mun. Artificial convolution neural description dataset for bridging video and language. In Proceedings of network for medical image pattern recognition. Neural networks, 8(7- the IEEE conference on computer vision and pattern recognition, pages 8):1201–1214, 1995. 5288–5296, 2016. [21] Marcin Marszałek, Ivan Laptev, and Cordelia Schmid. Actions in [44] Sijie Yan, Yuanjun Xiong, and Dahua Lin. Spatial temporal graph context. In CVPR 2009-IEEE Conference on Computer Vision & Pattern convolutional networks for skeleton-based action recognition. In Thirty- Recognition, pages 2929–2936. IEEE Computer Society, 2009. Second AAAI Conference on Artificial Intelligence, 2018. [22] Roland Memisevic, Christopher Zach, Marc Pollefeys, and Geoffrey E [45] Jason Yosinski, Jeff Clune, Yoshua Bengio, and Hod Lipson. How Hinton. Gated softmax classification. In Advances in neural information transferable are features in deep neural networks? In Advances in neural processing systems, pages 1603–1611, 2010. information processing systems, pages 3320–3328, 2014. [23] Maxime Oquab, Leon Bottou, Ivan Laptev, and Josef Sivic. Learning and [46] Joe Yue-Hei Ng, Matthew Hausknecht, Sudheendra Vijayanarasimhan, transferring mid-level image representations using convolutional neural Oriol Vinyals, Rajat Monga, and George Toderici. Beyond short networks. In Proceedings of the IEEE conference on computer vision snippets: Deep networks for video classification. In Proceedings of and pattern recognition, pages 1717–1724, 2014. the IEEE conference on computer vision and pattern recognition, pages [24] Paritosh Parmar and Brendan Morris. Action quality assessment across 4694–4702, 2015. multiple actions. In 2019 IEEE Winter Conference on Applications of Computer Vision (WACV), pages 1468–1476. IEEE, 2019. [25] Paritosh Parmar and Brendan Tran Morris. What and how well you performed? a multitask learning approach to action quality assessment. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 304–313, 2019. [26] Paritosh Parmar and Brendan Tran Morris. Learning to score olympic events. In proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pages 20–28, 2017. [27] Federico Perazzi, Jordi Pont-Tuset, Brian McWilliams, Luc Van Gool, Markus Gross, and Alexander Sorkine-Hornung. A benchmark dataset and evaluation methodology for video object segmentation. In Pro- ceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 724–732, 2016. [28] Hamed Pirsiavash, Carl Vondrick, and Antonio Torralba. Assessing the quality of actions. In European Conference on Computer Vision, pages 556–571. Springer, 2014. [29] Mikel D Rodriguez, Javed Ahmed, and Mubarak Shah. Action mach a spatio-temporal maximum average correlation height filter for action recognition. In CVPR, volume 1, page 6, 2008. [30] Seyed Morteza Safdarnejad, Xiaoming Liu, Lalita Udpa, Brooks Andrus, John Wood, and Dean Craven. Sports videos in the wild (svw): A video dataset for sports analysis. In 2015 11th IEEE International Conference and Workshops on Automatic Face and Gesture Recognition (FG), volume 1, pages 1–7. IEEE, 2015. [31] Huang-Chia Shih. A survey of content-aware video analysis for sports. IEEE Transactions on Circuits and Systems for Video Technology, 28(5):1212–1231, 2017. [32] Leonid Sigal and Michael J Black. Humaneva: Synchronized video and motion capture dataset for evaluation of articulated human motion. Brown Univertsity TR, 120, 2006. [33] Tomas Simon, Hanbyul Joo, Iain Matthews, and Yaser Sheikh. Hand keypoint detection in single images using multiview bootstrapping. In CVPR, 2017. [34] Khurram Soomro, Amir Roshan Zamir, and Mubarak Shah. Ucf101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402, 2012. [35] Sabine Susstrunk,¨ Robert Buckley, and Steve Swen. Standard rgb color spaces. In Color and Imaging Conference, volume 1999, pages 127–134. Society for Imaging Science and Technology, 1999. [36] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. Going deeper with convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1– 9, 2015. [37] Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, and Zbigniew Wojna. Rethinking the inception architecture for computer vision. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2818–2826, 2016. [38] Mary Tuck and David Riley. The theory of reasoned action: A decision theory of crime. In The reasoning criminal, pages 156–169. Routledge, 2017. [39] Jacco Van Sterkenburg, Annelies Knoppers, and Sonja De Leeuw. Race, ethnicity, and content analysis of the sports media: A critical reflection. Media, Culture & Society, 32(5):819–839, 2010. [40] Limin Wang, Yuanjun Xiong, Zhe Wang, Yu Qiao, Dahua Lin, Xiaoou