Automatic Classication of

Tennis Video for Highlevel

z

Contentbased Retrieval

y

G Sudhir John C M Lee and Anil K Jain

Technical Rep ort HKUSTCS

August

Department of Computer Science

The Hong Kong University of Science and Technology

Clear Water Bay Kowlo on Hong Kong

y

Department of Computer Science

Michigan State University

East Lansing MI USA

z

This research is supp orted in part by the Sino Software Research Centre of The Hong Kong University

of Science and Technology under grantSSRCEG and by RGC Direct Allo cation Grant under

grantDAGEG

Email fsudhircmleegcsusthk jaincpsmsuedu

Abstract

This rep ort presents our techniques and results on automatic analysis of videos

for the purp ose of highlevel classication of tennis video segments to facilitate content

based retrieval Our approach is based on the generation of a image mo del valid under

perspective pro jection for the lines in a tennis video We rst derive this

mo del by making use of knowledge ab out dimensions of a tennis court typical camera

geometry used when capturing tennis videos and the connectivity information ab out the

court lines Based on this mo del we develop a court line detection algorithm and a

robust playertracking algorithm to track the tennis players over the image sequence In

order to select only those video segments which contain tennis court from an input raw

fo otage of tennis video we also prop ose a colorbased tennis court clip selection metho d

Automatically extracted tennis court lines and the players lo cation information form

crucial measurements for the highlevel reasoning mo dule which analyzes the relative

p ositions of the tennis players with resp ect to the court lines and the net in order to

map the measurements to highlevel events semantics like baselineral lies and

vol leying netgames passingshots etc Results on real tennis video data are presented

demonstrating the validityand p erformance of the approach

Intro duction

Automatic classication and retrieval of video information based on content is a very

challenging research area Recently there have b een many research eorts addressing

various relevant issues see for a recent survey The most dicult

question faced at the outset is what do es content mean Or more sp ecicallyhow should

one characterize visual content present in video and how to extract it for the purp ose of

generating useful highlevel annotations to facilitate contentbased indexing and retrieval

of video segments from digital video libraries It is generally accepted that content is to o

sub jective to be characterized completely as it often dep ends on the objects background

context domain etc in the video This is one of the main reasons why the problem

of contentbased access is still largely unsolved On the other hand advances in video

and storage technology have enabled i representation of visual data in digital form for

archiving and browsing ii ecient storage using compression algorithms and iii fast

retrieval with randomaccess capability using storage media like CDROMs Laser Discs

Video CDs and the latest DVDVideo and DVDROM systems

Recent literature contains a numb er of approaches to characterize visual content based

on color texture shap e and motion While these approaches have

their merit of b eing applicable to generic image and video data they also have a ma jor

limitation that is they can characterize only lowlevel information The end users will

almost always liketointeract at highlevel when accessing or retrieving images and video

segments from a database and the imp ortance of useful highlevel annotations can not b e

over emphasized Atthe same time manual generation of highlevel annotations is quite

cumb ersome time consuming and exp ensive Hence there is a dire need for algorithms

that are able to automatically infer highlevel content using the lowlevel information

extracted from data While this is extremely dicult for images and video in general

limiting the analyses to sp ecic domains can help overcome many hurdles in automatic

generation of highlevel semantic annotations as relevant to those sp ecic domains

This rep ort presents acase study wherein domainsp ecic knowledge and constraints are

exploited to accomplish the dicult task of automatic annotation of video segments with

highlevel semantics to enable contentbased retrieval

There have b een some recent attempts to exploit contextual or domain know ledge

for automatic generation of highlevel annotations for video segments Most of them

attempt to relate or map the lowlevel information measured from the image and video

data to highlevel content p ertinent to the sp ecic domains Zhang et al

present a metho d to parse and index news video based on a syntactic mo delbased

approach While they parse the news video into ancherp erson shot and news story

shot using some image pro cessing techniques and succeed in automatically creating a

more detailed description for a structured news video they do not attempt to describ e the

content of the news video for the obvious reason that it falls into the domain of generic

video and hence is far to o dicult a problem Gong et al address the problem of

automatic parsing of TV so ccer programs They make use of the standard layout of

a so ccer eld domain knowledge and partition it into nine categories such as mid

eld and p enaltyarea Then they apply various image pro cessing techniques like edge

detection and line identication on TV images for automatic determination of camera

lo cation A similar attempt has b een made byYow et al wherein they analyze a video

library of so ccer games They rep ort sp ecial techniques for automatically detecting and

extracting so ccer highlights by analyzing image content and for presenting the detected

shots by panoramic reconstruction of selected events A contextbased technique for

automatic extraction of embedded captions is rep orted by Yeo and Liu They apply

their metho d directly to MPEG compressed video and extract visual content which are

highlighted using the extracted captions Saur et al present a metho d for automatic

analysis and annotation of basketball video They use the lowlevel information available

directly from MPEG compressed video of basketball as well as the prior knowledge of

basketball video structure to provide highlevel content analysis like closeup views

fast breaks steals etc

In this rep ort we present our approach and results for an automatic analysis of tennis

video for the purp ose of highlevel classication of tennis video segments to facilitate

contentbased retrieval We exploit the available domain knowledge ab out tennis video

and demonstrate that it is p ossible to generate meaningful semantic annotations applicable

to the domain Our goal is to automate the generation of useful and highlevel annotations

like baselineral lies passingshots netgames and serveandvol ley games to p ertinent

segments of the video We have chosen these annotations b ecause they form some of the

most common play events in a tennis video Video is an imp ortant source in the sp ort

of tennis particularly in the areas of teaching coaching and learning how to play the

game They are used to teach the rules of the game analyze and summarize tennis

matches and tournaments demonstrate prop er playing techniques do cument the history

of the sp ort and prole current and past champions Annotating raw unstructured



Videotap es are the dominant video sources to day However with the advent of digital com

tennis video with these highlevel content can help professional tennis players and coaches

to retrieve video segments in a more meaningful manner from a digital library of tennis

video For example a player who wants to improvedevelop his serveandvol ley abilities

would only be interested in retrieving and studying a variety of serveandvol ley video

segments Or a tennis coach giving a visual demonstration of passing techniques andor

stratagies to players is mainly interested in a collection of passingshot video segments as

part of his video instructional material If the ab ove mentioned highlevel annotations

are available for indexing a digital library of tennis video then players or a coach can get

contentbased access to relevant video segments In this rep ort we present our techniques

to achieve this ob jective Exp erimental results on real tennis video data are presented to

demonstrate the merit of our approach

The rest of the rep ort is organized as follows In Section we present an overview of

our system We present a colorbased algorithm in Section to select video clips containing

tennis court only from the input raw fo otage of tennis video for further analysis In

Section we derive a quantitative mo del for a tennis court which is valid on the image

domain under the assumption of p ersp ective pro jection mo del for the camera Based on

this mo del Section presents our tennis courtline detection algorithm In Section

we present our playertracking algorithm to track the two tennis players over the image

sequence and in Section we describ e briey our metho d of mapping lowlevel positional

information ab out the tennis courtlines and the lo cations of the two tennis players to

highlevel tennisplayevents in the video segments Finally in Section we present some

pression and storage technology and the asso ciated advantages of random and networkbased access

capabilities we foresee that digital video will play a dominant role in future

of the highlevel annotation results wehave obtained and conclude the rep ort in Section

Overview of the System

Fig shows the blo ck diagram of our system The mo dules shown within the dotted

lines form the core part of our system and are the main sub ject of this rep ort Our

system analyses the input tennis video at dierent levels First the system pro cesses

an input raw fo otage of tennis video and partitions the fo otage into shots containing

continuous sequence of tennis court using a colorbased shot selection approach Then

the video segments containing tennis court are input to the courtline detection mo dule

which detects the tennis courtlines in the individual frames This information along

with the raw video data present in the video segment is then used in the detection and

tracking of the two players in the video segmentby the player tracking mo dule The high

level reasoning mo dule analyses the outputs of these two mo dules for inferencing dierent

tennisplay events in the video segment These analyses result in meaningful highlevel

semantic annotations for the segments of the input video After the analyses the

video segments are organized in an information management system with the highlevel

annotations providing indices for contentbased retrieval This information management

system takes other useful textual data also as inputs like for example the typ e of the court

clay or carpet or hard or grass the names of the players tournament typ e year etc

The end users interact with this information system through a GUI for providing queries

to the system Typical queries are like i Retrieve serveandvol ley clips containing John

McEnroe in ii Retrieve passingshot clips containing Ivan Lend l on a carpet court

iii Retrieve baselineral ly clips containing Andre Agassi on a hard court in etc

The system resp onds by searching the library and retrieves the matching video clips for

display they are shown as small icons on the icon pad of the GUI they can b e played on

the display pad of the GUI by the user through a mouseclick It should be noted that

while the whole pro cess of annotation is done oline the end user interactions with the

system are in fact online with near realtime retrievals

As mentioned ab ove the four mo dules shown within dotted lines in Fig form the

core part of the system and more details ab out our techniques and algorithms concerning

the four mo dules are presented in this rep ort In the following section we present details

ab out the co orbased shot partitioning approach we have used

Colorbased Selection of Court Segments

In this section we present our technique for the selection of video segments containing

tennis court from an input raw fo otage of tennis video This is necessary since the further

analyses presented in the next sections are done on video clips containing tennis court

only

For this purp ose we prop ose a colorbased algorithm to automatically indicate whether

a particular image frame b elongs to a tennis court scene or not As mentioned in Section

a tennis court b elongs to one of the four dierent classes i carpetii clayiii hard and

iv grass courts These classes can be distinguished by their dierent color prop erties



Atypical tennis video contains many other scenes interleaved b etween the tennis court scenes like

for example as those of sp ectators or a player or an umpire This is esp ecially so in b etween the p oints

These scenes need to b e segmented out from the input raw fo otage of tennis video

However each class can have multiple representative colors as its member Based on

statistical analysis using many example frames we summarize the color prop erties of

the four classes in the following table Using this information we apply the following

Table Standard colors of dierent typ es of tennis court

Class of Mean colors RGB of Class Fraction

Court example memb ers Thereshold Thereshold

Carpet

Clay

Hard

Gr ass

algorithm to select relevant clips containing tennis court

Court segment selection algorithm

For each image frame of the input tennis video do the following

Place a window at the center of the image frame

color from the colorfrequency Compute the color with maximum frequency max

distribution in the window

color from each class of courts The Compute the Euclidean distances of the max

distance of a color from a court class is the minimum of distances of the color from

the memb ers of the class

class b e the class of court corresp onding to the minimum of the Euclidean Let court

distances and let court color be the corresp onding member color

If the Euclidean distance between the max color and the court color is more than

class then classify the the Class Threshold see Table corresp onding to court

image as not a tennis court frame and stop Otherwise continue

Initialize court points and total points

For each pixel in the window

i Compute the Euclidean distance of the color of the pixel from court color

ii If the distance is more than the Fraction Threshold see Table corresp onding

to court class then court points court points

iii total points total points

court points

Compute court fraction

total points

If court fraction is more than then classify the input image frame as a tennis

court frame b elonging to court class otherwise not Stop

After the application of the algorithm the contiguous segments of similarly classied

frames are selected as tennis video clips and used for further analysis for highlevel classi

cation see Fig Typical results of application of the ab ove tennis court clip selection

algorithm are given in Fig

Remark Note that the color of a carp et court can be widely varying in general We

have used some of the standard colors in the above table However for any carp et court

with a nonstandard color an example image frame from the input tennis video needs to

be given as a training set for accurate classication by using the above algorithm

The following section presents in detail a quantitative mo del whichwehave develop ed

for a tennis court under the p ersp ective pro jection assumption for the camera This mo del

is used in the development of a courtline detection algorithm in Section

A Mo del for Tennis Court

In this section we derive a quantitative mo del for a tennis court in the image domain

under the assumption of perspective camera pro jection For this we exploit the formbased

and camera geometrybased constraints as describ ed b elow

We make the following assumptions

A The dimensions and connectivity form of the tennis courtlines are known see

Fig for a complete sp ecication of the tennis court dimensions for convenience

only the singles court is shown in the gure

A The camera geometry is shown in Fig with the equations for imaging given by

equations and mentioned b elow

According to the camera geometry shown in Fig the tennis court can be viewed

as a symmetrically lo cated D ob ject in a frontal plane ie a plane p erp endicular to

the viewing direction of the camera see dashed lines in Fig which is tilted away by

an unknown angle ab out the Xaxis in the ob jectcentered co ordinate system and then

perspectively pro jected on to the image plane of the camera using the cameracentered

co ordinate system Also note that the ob jectcentered co ordinate system is mo deled sim

ply as a translation of the cameracentered co ordinate system by an unknown amount D

along the viewing direction of the camera the Y axis also the Y axis Mathematically

the complete imaging mechanism can be describ ed by the following equations

X X

D Y cos sin Y

Z sin cos Z

and



X X

f x f



Y Y cos Z sin D



Z Y sin Z cos

f f y



Y Y cos Z sin D

t

where X Y Z is the co ordinate vector of a on the court in the ob jectcentered

t

co ordinate system is the tilt angle ab out the Xaxis X Y Z is the co ordinate vector

of the point on the court in the cameracentered co ordinate system f is the fo cal length

t

of the camera and x y is the p ersp ective pro jection vector of the point in the image

plane Typical frames of a tennis video shown in Figs and conrm the validityof

the camera geometry shown in Fig

While weusethe knowledge of tennis court dimensions and assume a typical imaging

geometry we do not make any assumptions ab out the knowledge of the tilt angle the

distance D and the camera fo cal length f However we make the following assumption

ab out feature extraction

A It is p ossible to extract from an image of a tennis court the three pro jected segments

of the courtlines corresp onding to P P P P and P P see Fig

In the rest of the rep ort we refer to these three segments as the basic segments Wenow

present the following result

Lemma Under the assumptions A A and A it is possible to reconstruct the per

spective projections of al l the other points on the tennis courtlines especial ly for the

points P to P Thus it is possible to completely recover the tennis courtlines in the

image domain

Proof We prove the lemma by showing that it is p ossible to reconstruct the p ersp ective

pro jection of the point P on the image Similar argument can b e given for all the other

points Let L be the length of the p ersp ective pro jection of the segment P P and L

b e the length of the p ersp ective pro jection of the segment P P From assumption Ait

is p ossible to measure L and L Also from assumptions A and A it can be shown

that see app endix A

L sin D

L sin D

From the ab ove equation we have

L L

F

L L

sin

where F Now let B b e the image domain vector ie the p ersp ective pro jection

D

corresp onding to the vector P P and let B b e the image domain vector corresp onding to

the vector P P Similarly let B b e the image domain vector corresp onding to the vector

P P Then from assumptions A and A it can be shown that see app endix B B

can be expressed as a linear combination of B and B as

B B B

where

F F

and

F F

Since the value of the parameter F can be estimated see Eq B can be recovered

and hence the p ersp ective pro jection of the p oint P can b e reconstructed on the image

Expressions similar to Eq can b e derived for all other p oints on the tennis courtlines

wherein the values of and change dep ending on each p oint The values of and can

be determined from the knowledge of the measurable parameter F see Table This

completes the pro of of the lemma 

From the ab ove lemma and the information ab out the form connectivity of a tennis

court it is p ossible to recover the tennis courtlines completely in the video frames The

advantages of using the prop osed mathematical mo del are twofold

It enables the recovery of the pro jected tennis courtlines in the video frames with

the help of minimal numb er of measurements and the asso ciated heuristics needed

for making the assumption A hold

The complete tennis court frame can be recovered even in those parts where the

courtlines are o ccluded andor are very dicult to extract This gives very im

p ortant information ab out the lo cation of all the segments of the tennis courtlines

in a given image This helps in an accurate determination of the relative p osition

of players with resp ect to the tennis courtlines which is crucial in the highlevel

reasoning steps for generating semantic annotations for the tennis video

Table lists the values of and for the p ersp ective pro jections of the tennis court p oints

P to P see Fig for expressing them as linear combinations of the basis vectors B

and B

Table Values of and for the p oints on the tennis courtlines shown in Fig The

parameter F can be measured see Eq

Point Point

F

P P F

F

P P

F

F F

P P

F F

F F F

P P

F F F

F

P P

F

F F F

P P

F F F

F

P

F

P F

Remark In the above analysis for convenience we have chosen P P and P P for

the basis vectors B and B respectively It should be noted that any other two segments

of the tennis courtlines could be used for the same purpose as long as they span the D

image domain

Based on the mo del derived in this section we present our tennis courtline detection

algorithm in the following section

Tennis Courtline Detection Algorithm

The purp ose of the tennis courtline detection algorithm is to extract courtlines in a

given tennis video frame This algorithm has two parts i extraction of the three basic

courtline segments in the image domain and ii reconstruction of the complete tennis

courtlines according to the mathematical mo del describ ed in the previous section

Extraction of the basic line segments

Extraction of the basic line segments is necessary to make the assumption A hold For

this we rst describ e briey a straightline detection algorithm that we have used The

following straightline detection algorithm takes as inputs i a starting point and ii a

linegrowing direction It also uses the following heuristics a the color of the court

line segments is the brightest in the image and b points on a courtline segment are

continuous Note that these heuristics are valid for go o d quality tennis video

Straightline detection algorithm

Go to the given starting p ointinthe image

Consider a pixel strip of image intensities perpendicular to the given line growing

direction wehave xed the length of the strip to pixels on each side of the line

Compute the mean and std standard deviation of the image intensities in the

strip Note the lo cations of the pixels whose intensities exceed the threshold Th

mean std Call them LinePoints

Find the midd le point of the LinePoints in the pixel strip Up date the given

starting p oint to the point which is pixel next to the midd le point along the

given line growing direction If the up dated starting p oint is not an neighbor of

the previous starting point STOP this is done only for the points other than the

given starting point and consider that the line segment has ended go to step

Otherwise go to step

Using this up dated starting p oint rep eat steps

List all the LinePoints detected during steps Fit a straightline through the

line p oints in the least square error sense and compute the parameters equation

of the line segment

We apply the ab ove mentioned straightline detection algorithm to detect four dierent

straightline segments as follows

a Initialize the starting search p oint at slightly b elow the image center This heuristic

is used only for the rst frame see the next subsection

b From this p oint apply the straightline detection algorithm to detect the vertical

line segment downwards till the p oint which is the p ersp ective pro jection of the

point P see Fig is reached the straightline detection algorithm stops at this

point Consider the end p oint of this segment as the starting p oint of the next line

segment

c Apply the straightline detection algorithm to detect the horizontal line segment

rightwards till the p oint which is the p ersp ective pro jection of the point P see

Fig is reached Consider the end p oint of this segment as the starting p oint of

the next line segment

d Apply the straightline detection algorithm to detect the vertical line segmentdown

wards till the pointwhich is the p ersp ective pro jection of the p oint P see Fig

is reached Consider the end p oint of this segment as the starting p oint of the next

line segment

e Apply the straightline detection algorithm to detect the horizontal line segment

leftwards till the line detection algorithm stops

f Using the equations of the four straightline segments detected in the steps

compute the lo cations of the p ersp ective pro jections of the points P P P and

P by considering intersections of the appropriate straightline segments

Note that we compute the lo cations of the end p oints corresp onding to the basic line

segments using the equations of four straightline segments Unlike direct corner detec

tion metho ds which are troubled by the lo calization versus detection uncertainties

using intersections of straightline segments to detect corner p oints leads to very accurate

lo calization also This in turn increases the accuracy of the basis vectors B and B and

helps reliable reconstruction of the complete tennis courtlines in the image domain using

the mathematical mo del as describ ed in the next subsection

Reconstruction of the complete tennis court

After the detection of the three basic segments of the tennis court in the image we

reconstruct the points P to P according to the mo del derived in the previous section

see Table Then we use the knowledge ab out the form connectivity of the tennis

courtline segments to reconstruct the complete tennis court in the image domain Except

for the rst frame we also reinitialize the center of the tennis courtline in the image based

on the reconstructed tennis courtlines in the previous image this is used as the starting

point in the step a ab ove Only for the rst frame we cho ose the starting p oint at

slightly b elow the center of the image during step a ab ove

Fig shows the typical result of application of our line detection algorithm to real

tennis video frames Without any loss of generality we have reconstructed the singles

court only in all the images Fig shows the result of tracking the courtlines over a

sequence of tennis video frames Note that the camera sways slightly to the right and

left during the play in this typical tennis video thus creating slight deviations from the

assumed camera geometry However the courtline detection algorithm has successfully

tracked the courtline segments during the play This shows the robustness of the prop osed

mo delbased approach for courtline detection

Tennis Player Tracking Algorithm

In this section we present our playertracking algorithm For the purp ose of tracking

the players over the image sequence the initial lo cations of the two players in the im

age domain have to be lo cated For this we use the following motionbased detection

algorithm

Player detection algorithm

Track the tennis courtlines over the rst few frames and determine if there is any

camera motion present during the p erio d The intersection points of the tennis

courtline segments are used to estimate a parameter ane D motion mo del in

the least squareerror sense and the six mo del parameters are analysed to check

if there is any genuine camera motion

nd rd

Select the rst frame and the next frame eg the or frame

If there is a camera motion detected in step then

i Estimate a D ane motion mo del for the motion b etween the two frames

ii Generate a motionpredicted frame from the rst frame for the next frame the

nd rd

or frame

iii Generate a residue frame after subtracting the motion predicted frame from

the next frame

otherwise

i Generate a residue frame after subtracting the rst frame from the next frame

Threshold and binarize and the residue frame and smo oth the residue frame

to eliminate the noise we employ an iterative smo othing algorithm

Compute the fo cus of interest regions for lo cating eachplayer by using the court

line information extracted for the next frame and generate two appropriate search

windows see Remark b elow

Extract the largest connected comp onent in the smo othed residue frame inside the

two window lo cations and compute the centroids of the two connected comp onents

these two centroids represent the lo cations of the two players in the next frame

Remark In step above we mentioned about appropriate search windows These

search windows are rectangular areas used to limit the connected component analysis to

within those windows only in the smoothed residue image Thus they serve as fo cus

of interest regions These search windows are dened adaptively depending on the video

frame being considered For the initial video frame the width of the search windows is

determined by the reconstructed baselines of the tennis court in the image domain and the

search windows arecentral ly located at the baselines with a xed height value for the search

windows For al l other frames we choose the locations of the search windows based on

the latest playertracking results available This way the windows are adaptively dened

and the player detection is done within the windows only

Fig shows the typical result of using the ab ove algorithm for motionbased detection

of the player lo cations After the detection of the lo cations of the two players in the

initial frame we generate templates for the two players and call them Top Player TP

and Bottom Player BP since the TP app ears near the top edge and the BP near the

bottom edge in an image The templates are nothing but square windows of xed sizes

placed centrally at the lo cations of the two players we use a larger window size for the

BP since the physical size of BP is usually larger than that of TP which contain the

image data pixels from the video frame Since the two templates are centrally placed at

the detected lo cations of the players they contain the image data corresp onding to the

two players and hence form crucial inputs to the playertracking algorithm

After the template generation we use the following algorithm for tracking the two

players over the rest of the image sequence We describ e the algorithm for a template T

which can b e either that of BP or that of TP extracted from a current frame and b eing

tracked in the fol lowing frames

Player tracking algorithm

Let T b e the template of size centered at lo cation p q inthecurrent frame C

Let F b e the fol lowing frame and let N be the next frame ie the frame following

F

Generate a binary image H containing only the reconstructed tennis court lines of

F using the courtline detection algorithm given in Section

if a tennis court line passes through i j

H i j

otherwise

Set Max value

For each pixel lo cation i j in a b b window B around p q inF do

a Compute match value between T and the similar sized template in F lo cated

at p i q j using

X

match value jT u v F p i u q j v j H p u q v

uv

b If Max v al ue match value do

Max value match value

min p p i and min q q j

p min q Up date the lo cation of the The matching player lo cation in F is min

player using p q min p min q Up date the contents of template T from the

data in frame F at the up dated lo cation of p q

Up date the current frame and the fol lowing frame using C F and F N

Rep eat steps till there is no next frame N

Note that the match value computed in step a ab ove is the net absolute dier

ence value between the windows and the algorithm used ab ove is basically a fullsearch

minimum absolute dierence algorithm MAD used widely in blo ckbased motion esti

mation algorithms However wehave mo died the basic algorithm by exploiting the

available information ab out the tennis courtlines in the current frame by using a binary

weighting factor H The basic purp ose of this weighting factor is to discard those p oints

of T which contain pixels corresp onding to the static ob ject the tennis court Thus the

template represents the player information more appropriately leading to more accurate

results for the template matching algorithm Also note that after the match is found the

algorithm up dates the template T so that it represents the current information available

ab out the player This is also very imp ortant for accurate p erformance of the tracking al

gorithm b ecause the player information keeps changing dynamically as the play progresses

over the image sequence and if such up dating of template is not done the mo died MAD

algorithm may easily give an incorrect match thereby leading to errors while tracking the

players The results rep orted in Section represent the typical p erformance of the player

tracking algorithm given ab ove

Highlevel Reasoning Mo dule

In this section we describ e briey the analysis done in the highlevel reasoning mo dule

This mo dule takes the information ab out the tennis courtlines extracted by the courtline

detection mo dule and the information ab out the player p ositions over the image sequences

extracted by the playertracking mo dule These two measurements form the crucial inputs

to the highlevel reasoning mo dule These lowlevel measurements are mapp ed to high

level semantic interpretations relevant to the tennisplay events in the video segment

For this simple logical decisions are made based on the values of these measurements

More sp ecically those decisions are made on the basis of the relative p ositions of the

player lo cations with resp ect to the tennis courtlines For example if the p ositional

information extracted by tracking the two players during the video segment conrms

that they play essentially near their resp ective baselines see Fig then the highlevel

reasoning mo dule maps the set of measurements to the baselineral lies play event and

annotates the video segment accordingly The following table summarizes this mapping

of lowlevel measurements p ositional information to the highlevel content ab out tennis

playevents In Table we use the following abbreviations see Fig

BL Baseline

SL Service line horizontal

NN Near the Net

BLC Center of Baseline

SLC Center of Service line

The following section presents some of the results we have obtained on real tennis

video segments using the prop osed automatic classication approach

Table Mapping lowlevel p ositional information to highlevel tennisplayevents

No Top Player TP Bottom Player BP Highlevel

Initial Final Initial Final Annotation

Lo cation Lo cation Lo cation Lo cation

BL BL BL BL Baselinerallies

BL NN BL BL Passingshot

BL BL BL NN Passingshot

BL BL BLC SLC Serveand

BLC SLC BL BL ServeandVolley

SL NN SL NN Netgame

Results

Wehave implemented the prop osed tennis video classication approach and have obtained

excellent results In the implementation of our approach we partition an input tennis

video segmentinto chunks of frames and pro cess eachchunk similarly For eachchunk

of video frames we do the following

Detect the tennis courtlines in all the frames of the chunk using the metho d given

in Section

Detect the lo cation of the two tennisplayers in the second or third frame of the

chunk by applying the motionbased strategy given in Section

Track the two players over the rest of the video frames of the chunk using the

playertracking algorithm given in Section

In what follows we present in detail the results of using our automatic tennis video

analysis approach for two tennis video segments containing dierent highlevel content

We b elieve these results demonstrate the merit of the mo delbased approach prop osed

here

A Baselineral lies Video Clip

In this subsection we present the results wehave obtained for a video segment containing

baseline rallies b etween two tennis players The video segment consists of frames

chunks of frames each of size pixels In the video segment the two players

play essentially from their resp ective baselines and there is no camera motion detected

Fig shows some of the video frames for visual depiction of the tennisplayevent in the

video segment The results of tracking players during the play are shown separately in

Fig The results of applying our technique are summarized pictorially in Fig which

shows b oth the detected tennis courtlines as well as the tracks of the two tennis players

as the match progresses over the video segment

A Passingshot Video Clip

In this subsection we present the results wehave obtained for a video segment containing

a passingshot situation between two tennis players The video segment consists of

frames chunks of frames each of size pixels In the video segment the

two players play initially from their resp ective baselines and as the match progresses

the bottom player BP rushes towards the net while the top player TP stays at the

farend baseline There is no camera motion in the video segment Fig shows some of

the video frames for visual depiction of the tennisplay event in the video segment The

results of tracking players during the segment are shown in Fig A pictorial summary

of the results of applying our technique is shown in Fig whichshows b oth the detected

tennis courtlines as well as the tracks of the two tennis players as the match progresses

over the video segment

Summary and Conclusion

In this rep ort wehave presented a case wherein domainsp ecic knowledge and constraints

are exploited to accomplish a very dicult task of automatic annotation of video segments

with highlevel semantics Particularly we have presented our research techniques and

results regarding automatic analysis of tennis video for the purp ose of highlevel classica

tion of tennis video segments to facilitate contentbased retrieval Our approach is based

on a quantied mo del for the tennis court in the image domain under the assumption

of p ersp ective pro jection for the camera We derive this mo del by making use of the a

priori knowledge ab out actual dimensions of a tennis court as well as the typical camera

geometry used while taking a tennis video we have used only the information regarding

the general camera imaging geometry and we do not use any particular information ab out

the camera parameters or ob ject distances We make use of the connectivity information

ab out the tennis courtlines also ie the information ab out the form of the ob ject

and have demonstrated the applicabilityof the mo del on real tennis video frames by de

veloping a tennis courtline detection algorithm and by showing the results of application

of the algorithm on the real tennis court images Wehave also presented a robust tennis

player tracking algorithm in order to track the two tennis players over the image sequence

We use the automatically extracted tennis court and the players lo cation information

which form the crucial measurements for the purp ose of highlevel reasoning where we

analyze the relative p ositions of the two players with resp ect to the tennis courtlines

and the tennisnet so as to map the measurements to highlevel events semantics like

baselineral lies passingshot serveandvol leying netgames etc Results on real tennis

video data are presented to demonstrate the merit of the approach

Webelieve that similar exploitation of domainsp ecic constraints in other applications

can b e highly useful in overcoming the very dicult task of automatic annotation of real

video segments with highlevel content p ertinent to those sp ecic domains We feel that

a ne blend of domainsp ecic techniques and general purp ose algorithms would help a

long way in achieving the lofty goal of providing contentbased access to video

Acknowledgement

We thankfully acknowledge the supp ort of Sino Software Research Center of The

Hong Kong University of Science Technology for carrying out this research under the

grant SSRCEG and the supp ort of Research Grants Council under the grant

DAGEG

App endix

A Derivation of Eq

From Fig and Eqs and it follows that the p ersp ective pro jections p p p and p

of the points P P P and P see Fig resp ectively are given by

C B C B

f f

sin D sin D

C B C B

C p B C p B

A A

cos cos

f f

sin D sin D

B C B C

f f

sin D sin D

B C B C

p B C and p B C

A A

cos cos

f f

sin D sin D

Using the ab ove expressions wenowhave the lengths L and L of the two segments p p

and p p resp ectively in the image domain given by

and L f L f

sin D sin D

Hence Eq follows 

B Derivation of Eq

Using the expressions for the p oints p p and p from Eq we have the two basis

vectors B and B given by

F

f

B C B C B C

f f

sin D sin D D F F

B C B C B C

B p p B C B C B C

A A A

f

cos cos cos

f f

sin D sin D D F F

f

B C B C B C

f

sin D D F

B C B C B C

B p p B C B C B C

A A A

cos cos

f f

sin D sin D

sin

where F Now consider the p ersp ective pro jection p of the p oint P see Fig

D

which is given by

C B

f

sin D

C B

C p B

A

cos

f

sin D

The vector p p is given by

f

B C B C B C

f f

sin D sin D D F D F D

B C B C B C

p p B C B C B C

A A A

f

cos cos cos

f f

sin D sin D D F D F D

Since B and B span the D image domain space p p can always be expressed as a

linear combination of B and B as follows

p p B B

Substituting for B B and p p we have

F

f f f

C B C B C B

D F D F D D F F D F

C B C B C B

C B C B C B

A A A

f f

cos cos

D F D F D D F F

Solving the two simultaneous equations of Eq we have

F F

and

F F

as desired 

References

SW Smoliar and HJ Zhang Contentbased Video Indexing and Retrieval Multi

media

AD Bimbo E Vicario and D Zingoni Sequence Retrieval by Contents through

Spatio Temp oral Indexing In Proc IEEE Symposium on Visual Languages pages

Y Tonomura A Akutsu Y Taniguchi and G Suzuki Structured Video Computing

IEEE Multimedia Fall

R Jain and A Hampapur Metadata in Video Databases ACM SIGMOD Record

Hong Jiang Zhang Atreyi Kankanhalli and SW Smoliar Automatic Partitioning

of Fullmotion Video Multimedia Systems

John C M Lee Q Li and W Xiong VIMS A Video Information Manipulation

System Multimedia Tools and Applications Special issue on Representation and

Retrieval of Visual Media in Multimedia Systems Jan

E Ardizzone and M La Cascia Automatic Video Database Indexing and Retrieval

Multimedia Tools and Applications Special issue on Representation and Retrieval of

Visual Media in Multimedia Systems Jan

Gulrukh Ahanger and Thomas DC Little A Survey of Technologies for Parsing and

Indexing Digital Video Journal of Visual Communication and Image Representation

Michael J Swain and Dana H Ballard Color Indexing International Journal of

Computer Vision

W Niblack R Barb er W Equitz M Flickner E Glasman D Petkovic PYanker

C Faloutsos and G Taubin The QBIC Pro ject Query Images by Content using

Color Texture and Shap e In SPIE Proc Storage and Retrieval for Image and Video

Databasesvolume pages

A Pentland RW Picard and S Scaro Photob o ok To ols for Contentbased

Manipulation of Image Databases In SPIE Proc Storage and Retrieval for Image

and Video Databases IIvolume pages

G Sudhir and John C M Lee Video Annotation by Motion Interpretation using

Optical Flow Streams Journal of Visual Communication and Image Representation

A Akutsu et al Video Indexing using Motion Vectors In SPIE Proc Visual Com

munication and Image Processingvolume pages

W Xiong John C M Lee and Rui Hua Ma Automatic Video Data Structuring

through Shot Partitioning and Key Frame Selection Machine Vision and Applica

tions Special issue on Storage and Retrieval for Stil l Image and Video Databases

to app ear Technical Rep ort HKUSTCS

HongJiang Zhang Shuang Yeo Tan Stephen W Smoliar and Gong Yihong Auto

matic Parsing and Indexing of News Video Multimedia Systems

Yihong Gong Lim Teck Sin Chua Ho ck Chuan HongJiang Zhang and Masao

Sakauchi Automatic Parsing of TV So ccer Programs In Proc Intl Conf on Mul

timedia Computing and Systems pages May

Drew D Saur YapPeng Tan Sanjeev R Kulkarni and Peter J Ramadge Auto

mated Analysis and Annotation of Basketball Video In Storage and Retrieval for

Image and Video Databases V volume SPIE pages Feb

D Yow Bo onLo ckYeo M Yeung and B Liu Analysis and Presentation of So ccer

Highlights from Digital Video In Second Asian Conference on Computer Vision

ACCVvolume pages

Bo onLo ck Yeo and B Liu Visual Content Highlighting via Automatic Extraction

of Emb edded Captions on MPEG Compressed Video In Proc of SPIE Conf on

Digital Video Compression Algorithms and Technologiesvolume pages

Feb

A Lumpkin A Guide to the Literature of Tennis Greenwood Press Connecticut

USA

Dennis J Phillips The Tennis Source Book The Scarecrow Press Inc USA

H Wang and M Brady Corner Detection with Subpixel AccuracyTechnical rep ort

Rob otics Research Group Department of Engineering Science University of Oxford

F Dufaux and F Moscheni Motion Estimation Techniques for Digital TV A Review

and a New Contribution Proceedings of the IEEE June Raw Footage of Tennis Video

Color-based Shot selection

Video segments Court-line containing Detection tennis court Module

High-level Reasoning Module

Player Tracking Module High-level Annotations

Video Information Management System

High-level Indices Tennis video Database Useful Textual Indices Textual data

Graphical User Interface

Retrievals Queries

Figure Blo ck diagram of our tennis video analysis system Results of automatic court-clip selection algorithm 7 Non-Court Segment Carpet-Court Segment Clay-Court Segment 6 Grass-Court Segment Hard-Court Segment

5

4

3

2 Court-class Indication Value

1

0

0 1000 2000 3000 4000 5000 6000 7000 8000

Frame Number

Figure Results of applying our tennis court clip selection algorithm on minutes

c

of input video from a VHS tap e titled SuperStars of Womens Tennis Vestron

Video This tap e contains the clips of all typ es of tennis courts thus making it most

suitable for testing our colorbased tennis court clip selection algorithm The results have

been manually veried to b e correct

P4 P5 P8 P11 P13

P3 P7 27’ P2 P10

13.5’ 21’

P6 P0 P1 P9 P12

78’

Figure Tennis court dimensions Without any loss of generality only the singles court

dimensions are shown Note the corners marked P to P which are referred to in the

rep ort Camera Z’

Z

Y’ D X’ 0 (tilt angle)

Y Tennis court X

Ground

Figure Typical camera geometry used while sho oting tennis video

Figure The tennis courtlines detected in some tennis video frames shown sup erimp osed

in blue color on the images The images in the top row are from the video of a

match while those in the b ottom row are from the video of a carpet court match Note the

dierences in the parameters of the camera geometry However the same tennis courtline

detection algorithm of Section whichdoes not dep end on any particular parameters of

the camera geometry has been applied to all the frames Courtline segments have b een

accurately detected in all the frames

Figure The tennis courtlines tracked over the image sequence of a tennis video segment

are shown sup erimp osed in dierent colors The order of frames is from blue to green to

magenta to red

Figure Typical result of applying the player detection algorithm right to the residue

frame left The two square windows in the right image show the templates corresp onding

to the two players lo cated centrally at their resp ective detected lo cations Center of baseline Baseline

Service line (horizontal)

Center of service line

Net

Center of service line Service line (horizontal)

Center of baseline Baseline

Figure Names of dierent line segments on a tennis court

Figure Selected frames from the Baselineral lies video segment consisting of consec

utive frames to visually depict the tennisplayevent The four rows of images corresp ond

to the successive chunks each of frames in a topdown order Each row shows the

frames numb ered and in the corresp onding chunk of frames The size of the

frames is pixels Here the frames are shown to a scale slightly less than half

the original size of the frames

Figure The templates of the players as they are tracked for consecutive frames of

the Baselineral lies video segment Shown on the left are the templates of the Top Player

TP and those of the Bottom Player BP are shown on the right The four rows of

images corresp ond to the successive chunks each of frames in a topdown order

For each player eachrow shows the templates etched out from the frames numb ered

and in the corresp onding chunk of frames The size of the TP templates is

pixels and that of BP templates is pixels The templates shown here are of actual

scale

Figure Pictorial presentation of the results of automatic analysis of Baselineral lies

video segment The tennis court lines detected on the image domain are reconstructed

and shown in black lines ab ove The tennis courtlines detected from only one frame are

shown ab ove since there is no camera motion in the video segment This is automatically

detected by analysis given in Section The results of tracking the two tennis players

over the image sequence are shown in color their initial p ositions are shown in red and

as the play progresses and their lo cations change the color changes gradually towards

blue with their nal lo cation shown in blue Note that the tracks of the players are near

their resp ective baselines only The highlevel reasoning mo dule rightly classies this set

of measurements as b elonging to Baselineral lies tennisplay event

Figure Some selected frames from the Passingshot video segment consisting of

consecutive frames to visually depict the tennisplay event The three rows of images

corresp ond to the successive chunks each of frames in a topdown order Eachrow

shows the frames numb ered and in the corresp onding chunk of frames The

size of the frames is pixels The frames are shown at a scale slightly less than

half the original size of the frames

Figure The templates of the players as they are tracked for consecutive frames of

the Passingshot video segment Shown on the left are the templates of the Top Player

TP and those of the Bottom Player BP are shown on the right The three rows of

images corresp ond to the successive chunks each of frames in a topdown order

For each player eachrow shows the templates etched out from the frames numb ered

and in the corresp onding chunk of frames The size of the TP templates is

pixels and that of BP templates is pixels The templates are shown at actual

scale

Figure Pictorial presentation of the results of automatic analysis of Passingshot

video segment The tennis court lines detected on the image domain are reconstructed

and shown in black lines ab ove The tennis courtlines detected from only one frame

are shown ab ove since there is no camera motion in the video segment The results of

tracking the twoplayers over the video segment are shown in color their initial p ositions

are shown in red and as the play progresses and their lo cations change the color changes

gradually towards blue with their nal lo cation shown in blue Note that the track of

the b ottom player BP moves towards the tennis net during the play while that of the

top player TP stays essentially at the farend baseline The highlevel reasoning mo dule

rightly classies this set of measurements as b elonging to Passingshot tennisplayevent