SIGNATURE SYSTEM FOR VIDEO IDENTIFICATION

by

Sebastian Possos Medellin

A Thesis Submitted to the Faculty of

The College of Engineering and Computer Science

in Partial Fulfillment of the Requirements for the Degree of

Master of Science

Florida Atlantic University

Boca Raton, Florida

August 2010

ACKNOWLEDGMENTS

I would like to thank my wife Adriana, because without her I would not have been able to reach this far and keep going. Also to Dr. Hari Kalva, who showed me a new world of knowledge, exciting and full of challenges, his guidance and friendship make this work possible. To Dr. Maria Petrie, who gave me the opportunity to be in this excellent institution and Dr. Oge Marques and Dr. Bassem Alhalabi for their help and support by being part of my Thesis Committee.

Finally I would like to thank my family, friends and colleagues who were always present, not only during this time but always with me, giving me their support and giving me a hand when I needed it.

iii

ABSTRACT

Author: Sebastian Possos Medellin

Title: Signature system for video identification

Institution: Florida Atlantic University

Thesis Advisor: Dr. Hari Kalva

Degree: Master of Science

Year: 2010

Video signature techniques based on tomography images address the problem of video identification. This method relies on temporal segmentation and sampling strategies to build and determine the unique elements that will form the signature. In this

thesis an extension for these methods is presented; first a new feature extraction method,

derived from the previously proposed sampling pattern, is implemented and tested,

resulting in a highly distinctive set of signature elements, second a robust temporal video

segmentation system is used to replace the original method applied to determine shot

changes more accurately. Under a very exhaustive set of tests the system was able to

achieve 99.58% of recall, 100% of precision and 99.35% of prediction precision.

iv

SIGNATURE SYSTEM FOR VIDEO IDENTIFICATION

TABLES ...... viii

FIGURES ...... ix

EQUATIONS ...... xi

Chapter 1 INTRODUCTION ...... 1

1.1 Overview and Motivation ...... 1

1.2 Problem Statement and Objective ...... 2

1.3 Main Contributions ...... 4

1.4 Overview of the Thesis ...... 5

Chapter 2 BACKGROUND AND RELATED WORK ...... 6

2.1 Introduction...... 6

2.2 Video Identification ...... 6

2.2.1 Digital Watermarks ...... 7

2.2.2 Video content based identification ...... 7

2.2.3 Signature based identification ...... 9

2.3 Video Tomography ...... 17

v

2.4 Advertisement auditing systems ...... 20

Chapter 3 PROPOSED SOLUTION ...... 22

3.1 Introduction...... 22

3.2 General Description ...... 22

3.3 Signature design ...... 23

3.3.1 First Video Signature version ...... 25

3.3.2 Second Video Signature version ...... 37

Chapter 4 IMPLEMENTATION ...... 42

4.1 Introduction...... 42

4.2 Signature extraction ...... 43

4.2.1 FFMPEG libraries ...... 48

4.2.2 Intel IPP 6.0 ...... 49

4.3 Shot Change Detector flow chart...... 50

Chapter 5 EXPERIMENTS AND RESULTS ...... 51

5.1 Introduction...... 51

5.2 MPEG Experiments ...... 51

5.2.1 Data set...... 52

5.2.2 Procedure for testing ...... 53

5.2.3 Uniqueness test ...... 53

5.2.4 Robustness test ...... 60

vi

5.3 Content Auditing ...... 72

5.3.1 Experiment setup ...... 72

5.3.2 Algorithm ...... 73

5.3.3 Results ...... 73

5.3.4 Summary ...... 75

Chapter 6 CONCLUSIONS AND FUTURE WORK ...... 76

6.1 Conclusions ...... 76

6.2 Future Work ...... 77

BIBLIOGRAPHY ...... 79

vii

TABLES

Table 1. Uniqueness Test Performance Summary ...... 57

Table 2. Uniqueness Test Time complexity Performance ...... 57

Table 3. Modified Uniqueness Test Performance Summary ...... 58

Table 4. Uniqueness test Performance Comparison ...... 59

Table 5. Video Modifications and Levels for Robustness Assessment ...... 60

Table 6. Content audit: Signature comparison results for signatures of Type 1...... 74

Table 7. Content audit: Signature comparison results for signatures of Type 2...... 74

viii

FIGURES

Figure 1. Region pair distribution signature extraction – NEC Corp. proposal ...... 10

Figure 2. Neighborhood and pixel relationship – Mitsubishi Electric proposal ...... 11

Figure 3. Multi-resolution signature structure – Mitsubishi Electric proposal ...... 13

Figure 4. Temporal hierarchical structure – University of Bresia proposal ...... 14

Figure 5. Saliency image generation – Peking University proposal for MPEG ...... 15

Figure 6. Saliency based signature extraction – Peking University proposal ...... 15

Figure 7. Mitsubishi Electric – NEC Corp. video signature bitstream ...... 17

Figure 8. Video Tomography Extraction ...... 18

Figure 9. Video tomography extraction for 1 of 6 components ...... 24

Figure 10. Tomography line pattern ...... 24

Figure 11. Tomography, edge and composite images ...... 27

Figure 12. Sub-signature extraction: Level changes along lines ...... 28

Figure 13. Sub-signature extraction: block or area assignment ...... 28

Figure 14. Tomography image with a shot change present ...... 29

Figure 15. Shot Signature Process ...... 30

Figure 16. Frame signature extraction ...... 32

Figure 17. Resulting image after Row sum procedure ...... 39

Figure 18. Resulting image after Column sum procedure ...... 40 ix

Figure 19. Resulting image after texture energy measurement ...... 40

Figure 20. Signature extraction process and external tools used ...... 43

Figure 21. Signature generation Flow chart ...... 48

Figure 22. Shot Change Detector flow chart ...... 50

Figure 23. Query construction for independence tests ...... 55

Figure 24. Success ratios for Text/Logo overlay test ...... 62

Figure 25. Success ratios for Sever compression test...... 63

Figure 26. Success ratios for Resolution reduction test ...... 63

Figure 27. Success ratios for Frame rate reduction test ...... 64

Figure 28. Success ratios for Capturing on Camera test ...... 64

Figure 29. Success ratios for Analog VCR Recording test ...... 65

Figure 30. Success ratios for Color to monochrome conversion test ...... 65

Figure 31. Success ratios for Brightness change test ...... 66

Figure 32.Success ratios for Interlaced / Progressive conversion test ...... 66

Figure 33. Result comparison for Text/Logo overlay test...... 67

Figure 34. Result comparison for severe compression test...... 68

Figure 35. Result comparison for Resolution reduction test...... 68

Figure 36. Result comparison for Frame rate reduction test...... 69

Figure 37. Result comparison for Capturing on Camera test...... 69

Figure 38. Result comparison for Analog VCR Recording test...... 70

Figure 39. Result comparison for Color to monochrome conversion test...... 70

Figure 40. Result comparison for Brightness change test...... 71

Figure 41. Result comparison for Interlaced / Progressive conversion test ...... 71

x

EQUATIONS

Equation 1. First descriptor element – Mitsubishi Electric ...... 11

Equation 2. Second descriptor element – Mitsubishi Electric ...... 11

Equation 3. Third descriptor element – Mitsubishi Electric ...... 11

Equation 4. Fourth descriptor element – Mitsubishi Electric ...... 12

Equation 5. Binary convertion function – Mitsubishi Electric and NEC Corp ...... 17

Equation 6. Euclidean Distance formula ...... 19

Equation 7. Similarity function – ETRI Institute ...... 21

Equation 8. Frame difference function (Pixel wise) ...... 34

Equation 9. Summation of histogram differences ...... 35

Equation 10. Row sum equation ...... 39

Equation 11. Column sum equation ...... 39

Equation 12. Texture energy ...... 40

Equation 13. Six tap filter for pixel interpolation ...... 45

Equation 14. High contrast five tap decimation filter ...... 45

xi

Chapter 1 INTRODUCTION

1.1 Overview and Motivation

The current technological advances allow people to copy and distribute video content with almost no specialized knowledge. Today, with just a click of a button any person is able to grab video from any source and post it on the Internet; just a simple search in

Google or Yahoo will retrieve several tutorials and tools at almost no cost or even free explaining how to create, edit or duplicate multimedia content.

Also the World Wide Web offers means for publishing and distributing this type of content. The perfect example would be YouTube, their statistics from March 2008 tell us that 78.3 million of videos were available for any person to see, 150 thousand to 200 thousand videos were uploaded every day, and on average 13 hours of video content were uploaded every minute. Now in 2010 the popularity of YouTube is so big that that last statistic has been increased on 85%, having a total of 24 hours of content being uploaded every minute. [1]

But with that huge amount of data, the Entertaining and IT industries are facing new challenges, on one side is the hosting and bandwidth consumption related to the constant 1

availability of this huge amount of content, and on the other, is how to control and

monitor the uploaded content, more specifically how to determine if an uploaded content

is under copyright protection, hence an automatic copyright enforcement system is highly

needed, taking into account the time required for a person to see all the video content

posted on the Internet would be larger than several human lives, for the same reason the

complexity of the system should be low for both, the signature extraction and the

comparison process; without compromising the uniqueness and robustness properties of

the signature.

Alternatively, monitoring systems play a crucial role in the broadcasting operations for any network. They provide a means to capture continuously the transmitted programming, details, schedules and events. All this information when stored as logs or databases permits the analysis of performance, quality of service and retrieval of any content that was made available at any particular time. Advertisers and the advertisement industry have become more interested in these types of applications where they can follow all details of the contracted air time.

1.2 Problem Statement and Objective

The solution proposed continues the work done by Gustavo Leon on his thesis “Content

Identification using Video Tomography” [3], where a new technique for signature

generation has been designed using tomography images, but based on three new aspects:

2

• A new pattern for sample extraction; the idea with this new pattern is to increase

the consistency level of the signatures by extracting sample lines of the same size

always, unlike the previous method where to be able to compose horizontal and

vertical tomography images, it was necessary to crop the horizontal image so it

would match the vertical. Now because all the extracted lines are the same size

no cropping is needed, resulting in an enhanced uniqueness property.

• Improved Temporal segmentation; to enhance the dependability, the temporal

segmentation of the video cannot be done arbitrarily, if the portions to be

compared are not synchronized, then the features extracted are not going to be

similar leading to a false identification; here the solution is to implement a shot

detection algorithm which allows to synchronize the video segments accurately

increasing the correlation between them.

• Frame normalization; most of the video hosting sites reduce the video frame size

to reduce the amount of data that has to be saved, because storage is a problem.

But because of this transformation, the signature confidence can be affected. To

solve this, a normalized frame size is proposed: all video analysis is going to be

done for a 360 x 240 frame size. The resizing has to be done through the use of a

decimation filter or an interpolation filter depending on the scenario. If the video

image is bigger than the proposed normalized size, a decimation filter has to be

used, where not only the reduction is done but also for the video is freed of

aliasing effects, that can introduce undesired noise in the extracted signature. In

3

case the video is smaller, then an interpolation filter to up sample the frame is

needed. Here the issues are different because the up sampling procedure

introduces different artifacts to the image, but to enhance the reliability this filter

has to increase the contrast of the new image with the purpose of augmenting the

stability and ensure a consistent feature extraction.

As an extension of this system, an implementation of a real life application is tested, in this case an automatic advertisement tracking and auditing system based on video signatures.

1.3 Main Contributions

The following are the main contributions of this work:

• This system was a response to MPEG Call for Proposals on Video Signature

technology.

• Implementation of a new signature design, increasing its robustness and

dependability.

• Implementation of a shot detection or temporal segmentation module, which

allows the real synchronization between signatures of compared videos.

4

1.4 Overview of the Thesis

The remaining chapters of this thesis are structured as follows: Chapter 2 provides background information on video identification and advertisement auditing, Chapter 3 describes in detail the proposed solution taking into account this technique main elements and the relevant used tools to increase the system performance, Chapter 4 presents the implementation of the system and the description of the algorithms used for this approach, Chapter 5 contains the experiments and results based on the implementation, and Chapter 6 presents conclusions and possibilities for future work.

5

Chapter 2 BACKGROUND AND RELATED WORK

2.1 Introduction

In this chapter a general review is done on video identification systems, on what techniques they are based on, and their components. Also on what the current

Advertisement Audit systems consist of.

2.2 Video Identification

There are two main ways to identify a video [2]:

1. Based on a digital watermark

2. Based on the video content

A further development of the video content identification has evolved on the creation of signatures, the signature method looks for specific elements in the image to extract a unique descriptor.

6

2.2.1 Digital Watermarks

Digital watermarking based approaches rely on an embedded watermark that can be

extracted anytime in order to determine the video source, as exposed on [4],

watermarking technology followed the idea of “if you can’t see it, and if it is not removed

by common processing, then it must be secure”. They were first proposed as a solution

for identification and tamper detection in video and images by G. Doer et al [5].

However, they are not usually designed to identify unique clips from the same video

source.

Their biggest drawback of this approach is the embedding of a robust watermark in the

video source and the fact that a large collection of “un-watermarked” files already exist.

2.2.2 Video content based identification

Content based identification, on the other hand, uses the content of the video to compute a unique signature based on various video features. A content based video identification system survey is presented by X. Fang at el. [6] and by J. Law-To at el. [7].

A proposal for copy detection in streaming videos is presented by Y. Yan at el. [8], where a video sequence similarity measure is used, which is a composite of the frame fingerprints extracted for individual frames. Partial decoding of the incoming video is

7 performed and the DC coefficients for key frames are used to extract and compute frame features.

[9] and [10] are also based on key frame analysis: [9] proposes a clustering technique where the authors take key frames for each cluster of the query video and perform a key frame based search for similarity regions in the target videos. [10], on the other hand, uses local features; it extracts key frames to match against a database and then matches the local spatial-temporal features to match the videos.

As can be inferred from the above, many of the content based video identification methods use video signatures that are computed using features extracted from individual frames, but these frame based solutions are complex and add a large amount of overhead, especially in long duration videos, as they require feature extraction and comparison on a frame basis.

Additionally, they are characterized by the use of key frames for temporal synchronization and subsequent video identification. But determining key frames either relies on underlying compression algorithms or requires additional computation.

Several video identification system surveys are presented in [6] and [7].

8

2.2.3 Signature based identification

Signature identification [12] is a system based on the extraction of specific elements or fingerprints from media items, which in the end would lead to uniquely identify it. The main idea behind these signatures is that the extraction of those features that uniquely describe the content would be able to identify it even if the media has been modified under a wide range of common editing operations.

One big advantage that the signature descriptors have is that they are passive means of identification; there is no need to alter the original content, in contrast to a watermark which must be added actively.

2.2.3.1 MPEG Call for Proposals on Video Signature technology.

Five candidates submitted proposals to the MPEG Call for Proposals. Between them an earlier version of this Thesis work was presented, which will be explained in section 2.3.

But first, the timeline of the Call for Proposals is overviewed next, along with the four rival proposals.

Timeline

• Updated Call for Proposals on Video Signature Tools issued: 2008.10.17 9

• Registration of Interest open: 2008.07.25

• Complete dataset distribution to registered participants by post: 2008.10.24

• Registration of Interest closed: 2008.11.30

• Submission deadline: 2009.01.26

• Evaluation of Responses: 2009.01.31

2.2.3.2 Video signature based on feature difference between various

pairs of sub-regions (NEC Corp).

In this proposal, a set of sub regions were defined, the mean value is calculated from each sub region and then a relationship is defined between pairs of regions, following the distribution presented on Figure 1

Figure 1. Region pair distribution signature extraction – NEC Corp. proposal

10

2.2.3.3 Video Signature based on Multi-resolution decomposition

(Mitsubishi Electric R&D Centre Europe).

For this approach a frame is divided into 2x2 neighborhoods, on each 2x2 neighborhood, the relationship between the descriptor elements {a’, b’, c’, d’} and the pixel values {a, b,

c, d} of the neighborhood are given by equations (1)-(4)

Figure 2. Neighborhood and pixel relationship – Mitsubishi Electric proposal

’ /4

Equation 1. First descriptor element – Mitsubishi Electric

b’ a b/2

Equation 2. Second descriptor element – Mitsubishi Electric

c’ b d/2

Equation 3. Third descriptor element – Mitsubishi Electric

11

d’ d c/2

Equation 4. Fourth descriptor element – Mitsubishi Electric

Then this same method is applied at 4 scales 2x2, 4x4, 8x8 and 16x16, to obtain certain resolution independence, each element is then binarized by keeping the most significant bits.

From each resolution stage a word is formed by projecting the descriptor to a lower space, resulting on multiple words that form the signature.

12

Figure 3. Multi-resolution signature structure – Mitsubishi Electric proposal

13

2.2.3.4 Hierarchical video signature description using existing

MPEG-7 features and motion activity (University of Brescia)

Based on MPEG-7 descriptors, this approach collects information obtained by extracting the following descriptors: Dominant Color, Color Layout, Motion Activity Map (MAM), and Direction of Motion Activity (DMA). The video sequence is then divided into three levels of hierarchy, each level of hierarchy has a specific temporal structure, and from each segment in the layer a set of descriptors are pulled out; the obtained values are then used to generate the signature.

Figure 4. Temporal hierarchical structure – University of Bresia proposal

2.2.3.5 Video signature based on saliency map (Peking University).

This approach follows the human attention model to extract unique set of values to

14 generate the signature.

Figure 5. Saliency image generation – Peking University proposal for MPEG

Figure 6. Saliency based signature extraction – Peking University proposal

15

2.2.3.6 Mitsubishi Electric R&D Centre Europe and NEC Corp joint

proposal

On the second meeting for the MPEG-7 video signature tools presentation, Mitsubishi

and NEC joined forces to come up with a proposal that will take the advantages of both

systems and achieve with a more robust and dependable proposal [13]:

The frame signature is extracted from each decoded frame of a video stream. The

signature is made of from a 340 dimensional vector of ternary values {+1, 0, -1} that

describes relations and inter-relations between pixel regions in the frames. Each

dimension can be characterized as zero, first and second order operators. Ternary values of dimensions #1-12 are calculated by quantizing the average intensity (luminance) value

of the associated sub-region and the ternary values of dimensions #13-340 are calculated

by quantizing the differences between the average intensities of the associated two sub-

regions. Thus, the frame signature is composed of a ternary value vector with 12 average

element dimensions, and 328 difference element dimensions. The 328 difference element

dimensions can be further categorized by their patterns of sub-regions into 7 different

pattern-types.

That signature, is used to form Q n-bit words, with Q = 5 and n = 5. To form each word a

small, ordered set of elements in the signature is concatenated. Thus, the process of word

formation is a projection from an Φ-dimensional space to a Ψ-dimensional space, with

16

Ψ<<Φ. For two video frames, the distance between two corresponding words, i.e. the bit pattern of ordered selected corresponding coefficients, is an approximation of the distance of the full frame descriptors. All possible combinations of every possible value

of the ordered coefficients that make up a word form the vocabulary for that word.

Figure 7. Mitsubishi Electric – NEC Corp. video signature bitstream

For a sequence of video frames the frequency of occurrence is calculated for the

different words in each of the five vocabularies to obtain the bag of words. More

specifically, from each frame we extract words ∈0, 1 with 5, each

corresponding to one vocabulary. Then, for each vocabulary , a histogram of the words found in the frame sequence is plotted. Such a histogram shows the frequency with which words appear in the frame sequence and is referred to as a bag-of-words. Then, each bag-of-words (histogram) is binarised according to Equation 5 :

1 if ha 1 ha 0 otherwise

Equation 5. Binary convertion function – Mitsubishi Electric and NEC Corp

2.3 Video Tomography

17

Video tomography is first presented as a way to extract lens zoom, camera pan and camera tilt information using modified motion analysis [11], by introducing tomographic techniques into a motion estimation algorithm. The images generated through this method resemble the flow patterns of ridges in human fingerprints, hence the idea of using them as an identification method.

In video tomography content identification, a single line is extracted from each frame of the Y component of a video. These single lines are added up to a new tomography image

[2] as illustrated by Figure 8.

Figure 8. Video Tomography Extraction

This image is then processed through a canny edge detector [15] which extracts the edges of each grayscale section.

18

Tomography images generated from different scan patterns, vertical and horizontal on

one side, similar as shown inFigure 8, and diagonal left and diagonal right (diagonals

connecting the upper right corner and the lower left corner and vice versa) on the other,

are superposed (using an OR operator). If a mismatch in the tomography image

dimensions is encountered then only the centered common area is used.

The number of level changes (edges) at these two composites is then counted on 8

specific vertical and 8 specific horizontal lines evenly distributed along the edge tomography, producing 16 counts on the horizontal-vertical composite and the other 16 edge counts on the diagonal composite to form a 32 short integer signature for each shot.

The signature size is always 64 bytes irrespective of the number of frames in the shot.

Matching of two of such signatures is achieved by finding the minimum Euclidean

Distance between the points according to Equation 6:

Equation 6. Euclidean Distance formula

The system developed in [2], has two types of signatures, one extracts sample lines from

each frame and generates a signature per shot, the second only uses the first 90 frames of

the shot to generate the signatures, considering that a shot is defined as a sequence of

consecutive images that belongs to the same camera. Then the resulting signatures are

19

used to build the database for comparison.

The dataset of videos used to test this system, consisted of four well known movies

(Shrek, Shrek 2, Pirates of the Caribbean, and a broadcast football game). From each video the first 50k frames where used for signature extraction. This specific dataset was

selected because of its notoriety and diverse content (animation, movie and sports), also

taking into account that two of the videos share content, in this case Shrek and Shrek 2.

In the results, the system demonstrated the power of the tomography based signatures, by

achieving 100% of recall with a 97% of precision for both types of signatures.

2.4 Advertisement auditing systems

Monitoring of media usage can be relevant in several situations. An advertising company

or its client may want to confirm that advertisements have been played correctly by

broadcasters, according to the contracted air time. Also an advertiser, apart from

monitoring their commercials, would like to measure the competitors marketing activity

and take relevant actions based on that information.

A patent exists [16] that proposes a solution for this task, more specifically a server with

web capabilities is loaded with software that allows the server to record all the events that

occurred on a broadcast transmission. To do so the software relies on the assumed

20

existing metadata, populating the database with the metadata details.

A similar development for content auditing is presented in [17]. This solution uses 3 different features to distinguish each commercial from general broadcast content and ads.

The considered features are scene change, edge pattern histogram and MPEG-7 dominant color descriptor. Then, to determine if a predefined commercial is present in the broadcast signal, the system extracts first the scene change information through a simple pixel difference algorithm, and then locates candidates that fit that shot length; with those candidates it compares the edge pattern histogram and dominant color features of the query video. The similarity function is Equation 7.

1|DCC DCq|2| |

Equation 7. Similarity function – ETRI Institute

To test the system performance a database of features is constructed from 260 TV

commercials, then the system is fed with a video stream of 91 commercials, each had 4

minutes of duration. On Table 1 are the performance results for this system.

Table 1 Query results - ETRI

Length of Success False False Processing Query video Positive Negative Time 364 minutes 87.51% 9.27% 3.22% 175 minutes

21

Chapter 3 PROPOSED SOLUTION

3.1 Introduction

This chapter contains the information related to the proposed solution to generate robust

and unique video signatures based on tomography. A detailed description of the blocks

used to implement the system will be given, as well as the flow chart for the main system

and supporting functions.

3.2 General Description

The proposed solution follows the requirements presented in MPEG’s Call for Proposals

[12] which are: uniqueness, robustness, independence, fast matching, fast extraction,

compactness, non-alteration, self-contained, and coding independence. Of these, the first three requirements affect the functionality of a video signature and the remaining requirements affect the implementation of video identification systems. Uniqueness and

independence are related attributes and are necessary for the broad applicability of

content. Robustness, on the other hand, affects the operating conditions for video

22

identification (i.e. resolution changes, brightness changes, etc.).

To address the specific conditions required in the MPEG Call for Proposals [12], the

following developments were designed taking into account the contest timeline,

presented on section 2.2.3.1.

The proposed approach to video signatures is based on video tomography. The signature

has two components: a shot or segment level signature which is globally unique, and a

frame signature that is locally unique. The combination of shot and frame signatures is

used to accurately identify video clips.

In order to locate accurately the shot change location, a Shot Change Detector was

developed using a Motion Estimation engine to enhance differences between frames that

belong to different shots, and similarities between frames that belong to the same shot.

3.3 Signature design

For this system, the video tomography technique works over the Y component of the

video content, a single line is extracted from each frame, and then is sequentially

transposed to create a new tomography image. Figure 9 shows the process of generating a

tomography image. This image is then processed through a Canny edge detector which extracts the edges to reveal patterns in the spatio-temporal domain.

23

Figure 9. Video tomography extraction for 1 of 6 components

This tomography generation was modified slightly for our design. First, the video is scaled to a resolution of 360x240, Then, three composite tomography images are generated from six different sample patterns as shown in Figure 10; two upper diagonals

(from the upper corners to the middle), two lower diagonals (from the middle to the lower corners) and two regular diagonals (joining opposite corners). Both diagonals of each set are superimposed and a composite signature image is created using the OR operation.

Figure 10. Tomography line pattern 24

The tomographic edge images extracted from these six patterns have a complex structure reminiscent of fingerprints as shown in Figure 11. Originally the idea was to exploit tools in fingerprint analysis to extract the features that would build the signatures. Fingerprint analysis uses combination of ridge endings and ridge bifurcations to match fingerprints

[18]. Ridges and bifurcations in tomographic images are formed when lines representing spatio-temporal changes intersect.

From this analysis a first signature version was developed to address the first milestone in the Call for Proposals [12].

3.3.1 First Video Signature version

As mentioned before, the development was focused in finding a way in which the tomography images would create patterns similar to a fingerprint. One simple way of accomplishing this is to combine tomographic images created from different scan patterns. Nine different tomographic image combinations were created to determine which would give the most unique information:

1. Star pattern: Using OR operations image pairs (1,2), (3,4) and (5,6) are combined

resulting in 3 different composite images.

2. Large left diagonal pattern: Tomography image 1 is selected without

modification.

25

3. Large right diagonal pattern: Tomography image 2 is selected without

modification.

4. Top left diagonal pattern: Tomography image 3 is selected without modification.

5. Top right diagonal pattern: Tomography image 4 is selected without modification.

6. Bottom left diagonal pattern: Tomography image 5 is selected without

modification.

7. Bottom right diagonal pattern: Tomography image 6 is selected without

modification.

8. Left diagonal pattern: Tomography images 1, 3 and 6 are selected separately

without modification.

9. Right diagonal pattern: Tomography images 2, 4 and 5 are selected separately

without modification.

The three composite images created for the star pattern combination are depicted in

Figure 11. As can be observed there, the composite images are visually as complex as a

fingerprint.

26

Figure 11. Tomography, edge and composite images

Before well-known fingerprint analysis was applied, simpler metrics inspired by the

minutiae in fingerprint analysis were developed. The key constraint here is the ability to

extract the features from exactly the same position in the composite image irrespective of

the distortion a clip may suffer due to compression and other transformations. One metric

used was the number of level changes at discrete points or in specific areas in the

composite images i.e, the black to white transitions representing the number of edges was

counted. The second one was the number of white pixels in particular areas of the image.

The level changes were measured along horizontal and vertical lines at predetermined

points in composite images as seen in Figure 12, where eight horizontal and vertical positions are used; or along all the horizontal or vertical pixels in specific areas of the image as seen in Figure 13, where the image was divided in 16 blocks or areas.

The white pixel count is completed along the same 16 areas.

27

Figure 12. Sub-signature extraction: Level changes along lines

Figure 13. Sub-signature extraction: block or area assignment

The number of areas or lines selected for the metric count determines the complexity and length of a signature. This count can be as high as half the width of the edge image (360) and is stored as an 8-bit bit integer. This count aids in the construction of a 48-byte signature which can be constructed either by applying one metric to several different scan pattern image combinations (like the star pattern) or by applying several metrics to one scan pattern (like the large left diagonal). The signature size is always 48 bytes irrespective of the number of frames in a clip. 28

3.3.1.1 Shot Tomography signature

The signature extraction differs from [2] because there the signature extraction was done for the whole video, whereas here the video is segmented into a series of clips or shots with a maximum duration of one second. The segmentation procedure is done through a

Shot Change Detector, which increases the uniqueness of the signature by determining the starting frame location of sequences that belong to the same camera, this translates into a signature made of highly correlated information. In Figure 14 a tomography image of a long sequence is shown, on which a shot change is present; this change is marked by

a continuous horizontal line that breaks the behavioral pattern.

Figure 14. Tomography image with a shot change present

29

Each segment’s maximum length is one second.

Through the use of this strategy, the location for a partial video would have a maximum error of ± one second.

One special characteristic obtained by segmenting the video based on the shot changes, is

that it would allow the synchronization of the signatures in a common place, so when a

query request is placed the search doesn’t need to be exhaustive but it can focus first by

determining candidates on the shot locations.

Figure 15 shows the shot signature generation process.

Figure 15. Shot Signature Process

A total of 12 different types of signatures were generated using this procedure. These

signatures are described in Table 2.

30

Table 2. Signature types

Pattern Metrics Numbe r Signatur of Line White e Name images Count Area Count Count 1 Star Pattern 3 YES 2 Star Pattern 3 YES 3 Star Pattern 3 HORIZONTA L 4 Star Pattern 3 VERTICAL 5 Large right diagonal 1 H & V YES 6 Large left diagonal 1 H & V YES 7 Top right diagonal 1 H & V YES 8 Top left diagonal 1 H & V YES 9 Bottom right diagonal 1 H & V YES 10 Bottom left diagonal 1 H & V YES 11 Right diagonals 3 YES 12 Left diagonals 3 YES

3.3.1.2 Frame Tomography signature

The shot signatures described in the previous section can be used to locate all shot signatures in the database that closely match the query video. To identify the precise location of the query video in the shot, a local signature is necessary. This local signature can also be used to detect shot boundaries in order to speed up the shot identification process.

31

The frame tomography signature is extracted from the same tomography image (pre-

composition image) used in shot signature generation. Since these tomography images

are generated using a specific pattern, each shot and frame signature can easily be

extracted as the analysis progresses through the video, requiring only one complete video analysis.

After retrieving the tomography lines for a frame, the frame tomography signature is obtained by dividing these lines into 4 segments and counting the edges among each of them. The result is a 48 byte 24-dimensional signature.

Figure 16. Frame signature extraction

32

3.3.1.3 Signature complexity

Generating the signatures for a video clip has a relative low complexity. Most of the procedures used to generate the signature are very simple, except for the Canny Edge detection. On a 2.4 GHz Intel Core 2 PC; the system took about 65 milliseconds to generate a video signature for a 180 frame video clip. The complexity is independent of video resolution since the tomography images extracted are independent of it. At 30 frames per second, the processor usage by the signature generation is negligible and can be implemented in standard video player without sacrificing playback performance. The extra computational expense to generate the frame signature is also irrelevant.

Shot-level signatures (48-bytes) are generated for every one second or less of video, depending on the shot change detection and frame signatures (24 bytes) are generated for every frame. The size of the signatures is therefore 48+24*30 bytes per second (6.144

Kbps for a 30 fps video). Thus, The proposed approach produces a compact signature.

3.3.1.4 Additional tools used for signature generation

As described on 3.3.1.1 a shot detection mechanism was developed to achieve a better correlation inside the signature resulting on an enhanced level of uniqueness. This is possible because the spatio-temporal information inside each shot is consistent with the motion, texture and color present in that camera, taking into account that a shot is defined as a collection of consecutive images that belongs to the same camera, then the extracted

33

signature will represent accurately the shot’s spatio-temporal behavior.

Shot change detection

Among the basic methods for shot detection, pixel difference and histogram techniques

are worth mentioning. Both of them rely on finding differences between frames to judge

if a frame boundary has been found.

Pixel difference

The pixel difference methods are one of the most straightforward approaches to find shot

changes. They calculate a value, which represents the overall change in pixel intensities in the frame [19].

1 | , , | ,

Equation 8. Frame difference function (Pixel wise)

[20] makes use of the sum of absolute pixel intensity differences between two frames as a

frame difference. [21] and [22], on the other hand, calculate the number of pixels that

change their value more than a predefined threshold.

In any case, the total sum of these pixel differences are compared against a scene change 34

threshold to determine if a shot change has been found.

The downside of these techniques is their sensitivity to noise and camera motion as well

as to lighting changes such as a flashlight [19].

Histogram

A histogram shows the distribution of pixel values in a frame. The most straightforward method in histogram comparison is the simple summation of the pair wise bin differences

[19].

, | |

Equation 9. Summation of histogram differences

Examples of these can be found in [20] and [22]. Linear combinations with different weights of the bin-wise differences can also be used, if some gray levels or colors are

considered of greater importance [19].

Histogram techniques are very robust against noise and object movement. However, the

histogram describes only the distribution of color or gray scale values. It doesn’t take into

account any spatial information of the image. In the same manner, small, visually

significant image regions may not produce strong peaks in the histograms and hence can

35

be neglected in the frame comparison [19].

A third method was implemented, considering the tomography analysis for camera work

presented on [11]. As mentioned in section 2.3, through the analysis of tomography

images, it’s possible to locate shot changes; those changes are represented by a straight

line that crosses horizontally the tomography image. From this analysis the crater method was conceived.

Crater

The crater method uses the frame signatures to find the frame that will indicate a shot

change To do this the Euclidean distance between the current frame signature and the

previous one is measured, as well as between the signatures of the current and following frame. For a current frame position to be defined as shot change, both measurements have to be above certain threshold, being the first distance bigger than the second one.

This method proved to be fairly accurate, but when the video has been modified by luminance change or the appearance of new data, as it’s the case when an object from behind another, then the system detects false scene changes.

The name crater was derived from the wave form generated by the differential comparison between consecutive measured distances, showing a large depression where the scene change was present.

36

3.3.2 Second Video Signature version

After the MPEG contest’s Evaluation of responses stage passed, a new target for the system was set. The goal was to explore and implement a real life application, in this case a Media Usage Monitoring system, that considered the lessons learned from the MPEG competition.

For this second system version, the main signature extraction technique was kept intact andthe work focused on the upgrade and performance enhancement of the different tools used to determine critical features. In order to achieve this, video coding techniques were implemented considering that the technology involved has already solved specific feature extraction in a robust and dependable way.

3.3.2.1 Additional tools used for signature generation

Preprocessing tools have been an important element for the signatures dependability, and considering the previous signature development, new techniques have been introduce to these tools to achieve that goal.

Motion Estimation based Scene Change Detector

For this solution, a new technique for Scene change detection was developed. Taking into

37

account previous methods [25], pixel difference scene change was selected as the main

strategy, it’s useful but it needs some adjustments. When a video with a great amount of

motion is analyzed by a typical shot change detector, some or several false detections

appear because the information inside is changing so fast that the differences between

frames would be bigger than any static threshold could anticipate. But if a the comparison

between frames is done against a motion compensated image, generated from the

previous frame, then the differences would be reduced significantly if the analyzed

frames belongs to the same shot. But if the images belongs to different shots, then motion

compensation is not going to be able to do much for the previous image and the resulting

frame difference would be big in any case. For that purpose a full search motion

estimator is used, configured with an 8x8 block size, search range of 32 pixels and

quarter pixel accuracy. The specified motion range is relatively small, as not to severely

increase the calculation complexity of the whole system.

Row sum, Column sum technique

Now, a second problem arises for pixel difference and histogram based scene change

detectors: how to overcome a flash or sudden brightness changes. One feature that is almost invariant with the amount of light is the texture. To measure texture a simple procedure has been used, based on the method exposed on [24], a variation for the texture energy measurement was defined as RsCs or Row sum, Column sum; for this approach the image is divided into blocks of 4x4 pixels, and then an average of the differences

38

between columns and rows of the same block is performed.Then by calculating the size

of the resulting vector using the Rs value and the Cs value as vector coordinates, a good

enough texture measurement is obtained.

∑∑ |, ,|

Equation 10. Row sum equation

In Equation 10 m and n are the number of columns and rows in the image and p is the pixel value in the Y component of the image, located in the position, .

Figure 17. Resulting image after Row sum procedure

∑∑ |, ,|

Equation 11. Column sum equation

39

Figure 18. Resulting image after Column sum procedure

Equation 12. Texture energy

Figure 19. Resulting image after texture energy measurement

This procedure is done for both, motion compensated and current image. To determine the existence of a shot change, the overall frame difference value obtained with Equation

8, between the current frame and the motion compensated previous frame, is compared 40

against a predefined threshold, which was obtained empirically. If the comparison result

in the calculated value is greater than the threshold then that frame is marked as a candidate for shot change.

If a candidate was found then a second check is done but this time the frame difference is

done between the MC and the current frames texture energy images. If the resulting value

is also greater than an empirically found threshold, then a shot change is confirmed. This last procedure is very helpful to filter several types of editing and special effects, like

flashes and fades.

41

Chapter 4 IMPLEMENTATION

4.1 Introduction

The most relevant information about system implementation aspects is included in this chapter. The software used and detailed explanations for the implementation of the different parts of the algorithm are also provided.

The algorithm implementation was done in C and C++, some extra functionality was also written in Assembly language, as mentioned before the Intel IPP 6.0 and FFMPEG libraries where also used for the development of this solution.

The solution consists of an executable which receives as an input the video file, then depending on the file name it generates the main shot/segment signature file and the frame signature file.

42

Figure 20. Signature extraction process and external tools used

4.2 Signature extraction

When the application starts to run, the first task is the extraction of the video details, like frame size, total number of frames, frame rate, among others. To do that the FFMPEG libraries have a function called initializeFFmpeg which receives all the empty structures

43

and video file, and from the decoding of the video details it fills all the required variables.

With the variables initialized, the next step is to allocate memory for the remaining steps, which in general are trivial, but for the IPP functions that use MMX and SSE specialized functions the following commands are necessary:

• ippiMalloc_8u_C1: Returns an unsigned character array with extended size,

greater than the original image. This memory space is going to receive the

unmodified image to be used as source; and also returns the size of the new width.

• ippiMalloc_16s_C1: Returns a signed short array of the same size of the previous

allocation. The purpose of this structure is to allow calculations at pixel level

without getting problems by going out of unsigned char boundaries.

With memory allocated, the system goes to a while loop, which checks if all frames in the video have been analyzed through the FFMPEG function av_read_frame. This function locates the nearest reference frames to the current frame and prepares the environment in case decoding is needed.

While there are frames to be analyzed, the system grabs the current frame by using the decode function, also from FFMPEG, and returns a pointer with the frame. Now that the frame is ready, the application determines if there is need for resizing the image. In that case there are two functions, upsample and downsample.

44

For upsampling, a six tap filter similar to the one used by H.264 [28]. For this particular

method the system first upscales in the horizontal axis and later it upscales in the vertical

axis.

2, 51, 20, 201, 52, 3, 32

Equation 13. Six tap filter for pixel interpolation

On Equation 13, the interpolation filter is presented, where x and y denotes the

coordinates for the original reference pixel.

For downsampling, a high contrast five tap filter is used for image decimation

3, 91, 16, 91, 3, 32

Equation 14. High contrast five tap decimation filter

It’s important to notice that on the decimation filter, the second reference pixel to both

sides of the filter has been multiplied by 0, which means it is not used.

After the image has been resized or normalized to 320 x 240, then the shot change

procedure is called, from which there is an overview in Figure 22, returning a flag

indicating if the current frame is a shot change or not. If is not a Shot change or the tomography image is less than one second of video, then the application extracts the

45

sample lines from the current image, and appends each line to the corresponding

tomography image.

To extract the lines, a function called pattern was created. The main purpose of this

function is to generate the pixel positions for each line in the star pattern, then when the

line is being extracted, the positions were already calculated and the values can be copied

fast and accurately.

In case it’s a shot change or the appended tomography image corresponds to a second of

video, then the tomography images are ready to be processed by the Canny Edge detector. To run the Canny function, first running a Sobel operator [29] in vertical and

horizontal directions is needed, to then use the resulting processed images into the Canny

edge detector. For the Sobel operator, the IPP functions are: ippiFilterSobelNegVertBorder, and IPPIFilterSobelHorizBorder.

The next step is to run the Canny function. To do that the function ippiCanny_16s8u_C1R is called, which receives as input the Sobel images and the original image.

If the reason for the canny processing was that a shot change was found, then the shot change flag is activated, which is then used for index generation and header information in the signature file.

46

After the information has been collected then it is send to the file writer, where the

information is binarized.

The shot signature file has the following structure:

Version: 1 char.

Frame rate: 1 double.

Number of shots signatures in the file: 1 unsigned integer.

Number of hard cuts: 1 integer.

List of shot position for the hard cuts: 1 integer times number of hard cuts

detected.

Free space (not used): 2 unsigned chars

The following part repeats for the total amount of shots in the signature.

Frame start: 1 Integer.

Shot length: 1 Integer.

48 elements in the signature: 1 unsigned short each.

The frame signature has the following structure:

24 elements for each frame: 1 unsigned char.

47

Figure 21. Signature generation Flow chart

4.2.1 FFMPEG libraries

FFMPEG [26] is open source free software for video and audio manipulation, through it

48 is possible to, decode, encode and grab audio and video files and/or streams; it is licensed under LGPL and GPL. The main purpose for using this software and set of libraries is to allow the system to be able to extract signatures from any type of video, without any inconvenience in the format.

FFMPEG libraries used for this development are:

• Libavutil – contains general purpose functions and data structures like random

number generator and mathematic routines.

• Libavcodec - contains decoders and encoders for audio and video codecs.

• Libavformat - contains demuxers and muxers for multimedia container formats.

4.2.2 Intel IPP 6.0 Canny Edge detector

The Canny Edge procedure is the most computer expensive function in the whole solution, for that reason the Intel Integrated Performance Primitives where used, the

OpenCV optimized functions include Canny Edge Detection. These functions make use of all the special characteristics available on Intel processors to increase the performance where possible.

49

4.3 Shot Change Detector flow chart.

Figure 22. Shot Change Detector flow chart

50

Chapter 5 EXPERIMENTS AND RESULTS

5.1 Introduction

In this chapter the experiments that allow the evaluation of the proposed signatures are

described together with their results.

The signatures were evaluated under two specific scenarios; the first one, a set of MPEG tests that assess its performance with respect to two characteristics namely, the signature’s uniqueness and its robustness.

The second scenario involves the signature’s performance evaluation under a plausible use case: an ad (commercial) audit.

5.2 MPEG Experiments

In MPEG’s “Updated Call for Proposals on Video Signature Tools” [12] the evaluation

mechanism for video signatures is well defined for a variety of signature characteristics.

For this work in particular, the signature’s uniqueness and its robustness are of interest.

51

5.2.1 Data set

A database of original clips, consisting of 1883 original varied content such as: sports,

news, film, soap opera, etc; was released for the development and test of the proposals.

All these videos were collected and received copyright permissions under the context of

MPEG-7 development. This database can be obtained by request to the MPEG group.

The data set is divided in the following types:

• Captured: VCR & Camera recaptured clips.

• Dummy Clip: Clips used for creating the partial matching data.

• Modified: Contains framerate reduction data for 4 and 5 fps.

• Set A: The video files used to create the independent data.

• Masks: Masks used for the logo overlay modification.

After the original content was received and checked for errors, then the generation of the

unmodified query clips and transformed clip videos started.

5.2.1.1 Software

To generate the different query clips, the following software was used by all the participants for the Call for Proposals:

52

• Procoder 3.

• Cyberlink Video/SP Decoder (PDVD7) or (PDVD8).

• AC-3 audio decoder included in K-Lite codec megapack.

• AviSynth 2.5.

• PrepareIndependenceTest – A set of programs and scripts, developed by MPEG

group for the automatic clip generation.

5.2.2 Procedure for testing

To carry out the test, three Dell Optiplex GX745 machines, with a 2.66GHz Intel Core 2

6700 CPU and 3GB of RAM were used. The signatures from the original clips were distributed evenly on each machine, while the complete set of query clip signatures were copied fully on all the machines, allowing the parallel processing of the whole database of query clips.

5.2.3 Uniqueness test

The independence tests are designed to verify that a video signature can uniquely identify videos. Independence is verified by comparing all possible clip pairs in a database of videos. If a match is detected between an unrelated pair it’s marked as a false positive.

53

Two different matching scenarios were defined; the direct content matching and the

partial content matching scenario. Direct content matching is a case in which the whole

segment of the query clip matches with a part of the original clip. The algorithm is required to output the start point of the matched segment in the original clip.

In partial content matching only a part of the query clip matches with a certain segment of the original clip, meaning that a query clip can contain additional content not present in the original clip; the algorithm is given only the minimum duration of the segment to be matched and should output the start point and the end point of the matched segment in

both the original clip and the query clip.

5.2.3.1 Experiment Setup

For this test’s data creation the specific instructions defined in [14] needed to be

followed: A set of Avisynth scripts were provided and with the help of Procoder a

database of 1883 three minute video clips with Internet content was processed dividing

each clip into 6 30s segments. In addition, three query videos were created using the first

2, 5, and 10 seconds of each 30 second segment. Three additional queries were defined

by inserting these 2, 5, and 10 second segments in a 30 second video that isn’t in the

database. For the experiments featured here, 1640 original clips were randomly selected

out of the 1883 original database to constitute a new subset database in order to reduce

preparation time. From the 2s, 5s and 10s query clips, 54 clips were selected. The total

54 number of clip pairs compared for this test case is c= 85’560.

Figure 23. Query construction for independence tests

5.2.3.2 Algorithm

The steps in the algorithm of video identification for this experiment were the following:

1. Generate shot and frame signatures and shot patterns for the videos in the

database

2. For each query, generate shot pattern, video signatures, and frame signatures

3. Using shot patterns and video signatures, search the database for candidate videos

4. For each candidate shot, using frame signatures to localize the frame matches.

Accurate comparison is accomplished by using the Euclidean distance between the two given signatures. This approach allows fast queries, as the shot patterns and shot level

55

signatures can localize video matches. The frame signature can then be used to identify

and match the frames in the query from the average distance and standard deviation of the frame set.

5.2.3.3 Metrics and Results

The performance of the algorithm was measured following the metric and required by the

Call for Proposals. The key performance indicators are:

1. Recall

2. Precision

3. Prediction precision: for the clips correctly identified as dependent, the prediction

precision takes the ratio of clips found within one second of the ground truth and

the total number of clips.

The elements in the table are:

1. Total Instances: Total number of signature pairs compared.

2. Independent Clip pairs: Total number of signature pairs that are not related.

3. Identified as Independent: Number of pairs that the system determined as

independent.

4. True Positive: From those pairs identified as independent, it indicates how many

are truly independent.

5. False Positive: From those pairs identified as independent, it indicates how many

56

were falsely identified as independent.

6. False Negative: From the remaining pairs that were not identified as independent,

it indicates how many were falsely identified as dependent.

7. True Negative: From the remaining pairs that were not identified as independent,

it indicates how many were correctly identified as dependent.

8. Identified as dependent: Total number of pairs identified as related. That the clips

compared are associated.

Table 3. Uniqueness Test Performance Summary

Description 2 Sec 5 Sec 10 Sec Total Instances 88’560 88’560 88’560 Independent clip pairs 88’506 88’506 88’506 Identified as Independent 87’716 87’521 85’146 True Positive (tp) 87’713 87’518 85’144 False Positive (fp) 3 3 2 False Negative (fn) 739 934 3’308 True Negative (tn) 51 51 52 Identified as dependent 790 985 3360 Recall: tp/(tp+fn) 99.16% 98.94% 96.26% Precision: tp/(tp+fp) 100% 100% 100% Prediction Precision 94.44% 94.44% 96.30%

Additionally, the algorithm was evaluated in terms of its speed, specifically the time required for a query clip to complete steps 2 through 4 of the algorithm.

Table 4. Uniqueness Test Time complexity Performance

Step Description Value 2 Shot pattern and signature generation 156ms 3 Database search 15ms

57

4 Frame level comparison 185ms Final result for database narrowing search 13ms

5.2.3.4 Performance Improvement

To determine if the second generation signatures design was able to outperform the original one, a similar scenario to the one presented in the beginning of this section was

used with the following modifications:

• The shot detection mechanism was improved as specified in section 0.0.0.0.

• The master database consisted on 350 videos of 3 minutes each, and the query

database had 100 videos, each query video belonged to one video from the master

database. Then an exhaustive search is performed by comparing each query

signature against the master database.

• Depending on the comparison threshold it’s determined if there is a match and if

it’s true, on what position. A tolerance of +- 1 second has been determined for the

correct time location of the query sequence in their matching master video.

The performance results can be viewed in Table 5.

Table 5. Modified Uniqueness Test Performance Summary

Description 2 Sec 5 Sec 10 Sec Total instances 35’000 35’000 35’000 Independent clip pairs 34’900 34’900 34’900 Identified as Independent 34’806 34’750 34’422 True Positive (tp) 34’805 34’748 34’421 58

False Positive (fp) 1 2 1 False negative (fn) 5 52 379 True negative (fp) 99 98 99 Identified as dependent 94 150 478 Recall: tp/(tp+fn) 99.99% 99.85% 98.91% Precision: tp/(tp+fp) 100% 100% 100% Prediction Precision 99.60% 98.68% 99.79%

The performance comparison with the original signature choice is summarized in Table

6.

Table 6. Uniqueness test Performance Comparison

Recall Original 99.16% 98.94% 96.26% New 99.99% 99.85% 98.91% Precision Prediction Original 94.44% 94.44% 96.30% New 99.60% 98.68% 99.79%

5.2.3.5 Summary

The results show that the proposed signatures exhibit strong independence required of

video signatures, as specified by the independence criteria defined in the MPEG Video

Signatures CFP. The algorithm has low complexity and can be integrated into real-time

copy detection systems.

An additional upgrade in the performance of can be achieved with the implementation of

the second version of the signature generation, having an average improvement of 1.46% in Recall and 4.29% in Precision prediction.

59

5.2.4 Robustness test

In addition to uniqueness, [12] requires the signatures to be robust to all common editing operations. These include:

• Text/logo overlay

• Severe compression

• Resolution reduction

• Frame-rate reduction

• Capturing on camera

• Analog VCR recording & recapturing

• Color to monochrome conversion

• Brightness change

• Interlaced / Progressive conversion

Its detailed presentation in the test is included in Table 7.

Table 7. Video Modifications and Levels for Robustness Assessment

Levels Coding Heavy Medium Light Modifications Text/logo overlay * MPEG-2 30% 20% 10% Severe compression (at CIF resolution) AVC 64kbps 256kbps 512kbps Resolution reduction (from SD) MPEG-2 N/A QCIF CIF Frame-rate reduction (from 30/25 fps) AVC 4fps♣ 5fps 15fps Capturing on camera (at SD resolution) ** MPEG-2 10% 5% 0% 60

Analog VCR recording & recapturing MPEG-2 3 times 2 times 1 time (100% of image captured) *** Color to monochrome conversion I = 0.299R MPEG-2 N/A N/A + 0.587G + 0.114B Brightness change▼ MPEG-2 +36 -18 +9 Interlaced / Progressive conversion PIP MPEG-2 N/A N/A IP * - percentage of text/logo area. ** - percentage of extra background area. *** - number of times of analog->digital capture. (e.g. 3 times = d->a->d-> a->d->a->d) ▼ - the values are clamped to the range of [0-255]. ♣ - 4 fps is chosen because it involves non-integral rate-conversion and is therefore significantly harder than 5 fps.

5.2.4.1 Experiment Setup

In the robustness test, 545 video clips representing various types of content (film, news, documentary, cartoons, sport, home video, etc.) were selected. The duration of each clip is 3 minutes. From each clip, 3 segments of duration 2 seconds, 5 seconds and 10 seconds are selected. Each of these segments is then combined with other materials to form a total of 3 new combined segments with 30 seconds duration. The format (image resolution, frame rate, interlaced/progressive) of the combined segments is determined by the format of the selected segment (i.e. that of duration of 2, 5 or 10 seconds). As a result, 6 new short clips of durations 2 seconds, 5 seconds, 10 seconds, and 3×30 seconds are derived, each corresponding to the 6 query scenarios Figure 23. Each of these short clips is subjected to the modifications listed in Table 7 to create query clips.

61

5.2.4.2 Algorithm

The same procedure made for the Uniqueness test was used for the Robustness test.

5.2.4.3 Metrics and Results

The same metrics used for the Uniqueness test was used here. Next are the results obtained.

Text/Logo overlay

100 95 90 85 80 75 Heavy 70 Medium

Success Rate (%) 65 Light 60 55 50 Direct2 Direct5 Direct10 Partial2 Partial5 Partial10 Test type

Figure 24. Success ratios for Text/Logo overlay test

62

Severe compression

90 80 70 60 50 Heavy 40 Medium 30 Success Rate (%) Light 20 10 0 Direct2 Direct5 Direct10 Partial2 Partial5 Partial10 Test type

Figure 25. Success ratios for Sever compression test.

Resolution Reduction

100 95 90 85 80 75 Medium 70 Light Success Rate (%) 65 60 55 50 Direct2 Direct5 Direct10 Partial2 Partial5 Partial10 Test type

Figure 26. Success ratios for Resolution reduction test

63

Frame rate reduction

90 80 70 60 50 Heavy 40 Medium 30 Success Rate (%) Light 20 10 0 Direct2 Direct5 Direct10 Partial2 Partial5 Partial10 Test type

Figure 27. Success ratios for Frame rate reduction test

Capturing on Camera

90

88

86

84

82 Heavy

80 Medium

Success Rate (%) Light 78

76

74 Direct2 Direct5 Direct10 Partial2 Partial5 Partial10 Test type

Figure 28. Success ratios for Capturing on Camera test

64

Analog VCR Recording

85

80

75

70 Heavy 65 Medium

Success Rate (%) 60 Light

55

50 Direct2 Direct5 Direct10 Partial2 Partial5 Partial10 Test type

Figure 29. Success ratios for Analog VCR Recording test

Color to monochrome conversion

96

94

92

90

88 Light

Success Rate (%) 86

84

82 Direct2 Direct5 Direct10 Partial2 Partial5 Partial10 Test type

Figure 30. Success ratios for Color to monochrome conversion test

65

Brightness change

96

94

92

90

88 Heavy

86 Medium

Success Rate (%) Light 84

82

80 Direct2 Direct5 Direct10 Partial2 Partial5 Partial10 Test type

Figure 31. Success ratios for Brightness change test

Interlaced / Progressive conversion

96

94

92

90 Light 88 Success Rate (%)

86

84 Direct2 Direct5 Direct10 Partial2 Partial5 Partial10 Test type

Figure 32.Success ratios for Interlaced / Progressive conversion test

66

5.2.4.4 Mitsubishi and NEC joint proposal result comparison

As part of the performance analysis of the system, a comparison between the results

obtained by the other participants in the CFP, particularly with the results from the group

who won the competition. From Figure 33 to Figure 41, a comparison between both

systems shows that the proposed solution using tomography based signatures are close to the performance obtained by Mitsubishi – NEC proposal. Most of the scenarios are relative close, some others like Sever Compression performance are not optimal, but for

Text/logo overlay and capturing on camera the improvement is notorious.

Text/Logo overlay

100

90

80

70 FAU 60 N & M Success Rate (%)

50

40 Direct2 Direct5 Direct10 Partial2 Partial5 Partial10 Test type

Figure 33. Result comparison for Text/Logo overlay test.

67

Severe compression

120

100

80

60 FAU 40 N & M Success Rate (%)

20

0 Direct2 Direct5 Direct10 Partial2 Partial5 Partial10 Test type

Figure 34. Result comparison for severe compression test.

Resolution Reduction

120

100

80

60 FAU 40 N & M Success Rate (%)

20

0 Direct2 Direct5 Direct10 Partial2 Partial5 Partial10 Test type

Figure 35. Result comparison for Resolution reduction test.

68

Frame rate reduction

90 80 70 60 50 40 FAU 30 N & M Success Rate (%) 20 10 0 Direct2 Direct5 Direct10 Partial2 Partial5 Partial10 Test type

Figure 36. Result comparison for Frame rate reduction test.

Capturing on Camera

100 90 80 70 60 50 FAU 40 N & M

Success Rate (%) 30 20 10 0 Direct2 Direct5 Direct10 Partial2 Partial5 Partial10 Test type

Figure 37. Result comparison for Capturing on Camera test.

69

Analog VCR Recording

100 90 80 70 60 50 FAU 40 N & M

Success Rate (%) 30 20 10 0 Direct2 Direct5 Direct10 Partial2 Partial5 Partial10 Test type

Figure 38. Result comparison for Analog VCR Recording test.

Color to monochrome conversion

100

95

90

FAU 85 N & M Success Rate (%) 80

75 Direct2 Direct5 Direct10 Partial2 Partial5 Partial10 Test type

Figure 39. Result comparison for Color to monochrome conversion test.

70

Brightness change

100

95

90

FAU 85 N & M Success Rate (%) 80

75 Direct2 Direct5 Direct10 Partial2 Partial5 Partial10 Test type

Figure 40. Result comparison for Brightness change test.

Interlaced / Progressive conversion

100

95

90

FAU 85 N & M Success Rate (%) 80

75 Direct2 Direct5 Direct10 Partial2 Partial5 Partial10 Test type

Figure 41. Result comparison for Interlaced / Progressive conversion test

71

5.2.4.5 Summary

In the robustness test t’s clear that the system lacks some strength against frame rate

reduction. That’s the reason why in severe compression and frame reductions tests it

doesn’t perform that well. But on Text\logo overlay where the information is being

hidden with a mask, the system is able to overcome that because it captures more spread

out information, not depending on tight areas very susceptible to this type of

transformations.

5.3 Content Auditing

The practical application of this system to solve a current industry challenge is analyzed for the field of ad auditing.

5.3.1 Experiment setup

During this test, 15 different advertisement videos where extracted from a HD 24 hour

video recorded from FOX channel broadcast on May 23. Signature generation was performed on the original video and on the 15 commercials followed by the match procedure.

72

5.3.2 Algorithm

For the comparison procedure, the Euclidean distance was measured between the first

shot signature from the selected commercial and each shot that has been detected after a

hard cut from the 24h video. Each commercial transition is very well defined and a very clear hard cut is detected before the beginning of the advertisement video, this allows to decrease the amount of comparisons and increase the speed for the proposed solution.

5.3.3 Results

The results of this content audit were extracted for two signature types:

• Type 1: Is the count of transitions (black to white) along 16 specific positions at

the tomography shot image.

• Type 2: Is extracted by dividing the image in 16 blocks (4 by 4 grid), and then for

each block the total number of horizontal transitions is counted among the whole

block.

73

Table 8. Content audit: Signature comparison results for signatures of Type 1.

Commercial index # of appearances # of correct detections # of false detections Commercial 1 4 4 0 Commercial 2 1 1 0 Commercial 3 1 1 0 Commercial 4 1 1 0 Commercial 5 2 1 1 Commercial 6 1 1 0 Commercial 8 1 1 0 Commercial 9 1 1 0 Commercial 13 5 5 0 Commercial 14 1 1 0 Commercial 15 1 1 0 Commercial 16 1 1 0 Commercial 17 7 7 0 Commercial 18 1 1 0 Commercial 20 1 0 1

Table 9. Content audit: Signature comparison results for signatures of Type 2.

Commercial index # of appearances # of correct detections # of false detections Commercial 1 4 4 0 Commercial 2 1 1 0 Commercial 3 1 1 0 Commercial 4 2 2 0 Commercial 5 1 0 1 Commercial 6 1 1 0 Commercial 8 1 1 0 Commercial 9 1 1 0 Commercial 13 5 5 0 Commercial 14 1 1 0 Commercial 15 1 1 0 Commercial 16 1 1 0 Commercial 17 7 6 1 Commercial 18 1 1 0 Commercial 20 1 1 0

During the procedure, the average time per shot comparison measured was 28us.

74

5.3.4 Summary

As seen on Table 8 and on Table 9, this method proved to be well suited for auditing advertisement content in real time. This procedure could be extended to monitor diverse broadcast content like movies, musical videos, and other prerecorded material.

Because the database for this type of signatures is very small (5kB per commercial) it can also be integrated to setup boxes and be updated as needed without using large amounts of memory inside the system. The comparison method can be optimized by measuring the time for each shot detected and compared with the database, which means that in case the time is too different it can be immediately discarded.

75

Chapter 6 CONCLUSIONS AND FUTURE WORK

6.1 Conclusions

This thesis presented a system for video identification based on enhanced tomography signatures, exemplified with a real life application of advertisement auditing.

• The proposed system is capable of:

o Determining if a video sequence of at least 2 seconds of duration belongs

to another with one second tolerance in the time position.

o The system is robust against several types of transformation.

o Generate a small signature which is portable and easy to integrate to

hardware systems.

• Modular software design, which allows the exploration of new easier and faster

processes and algorithms.

• Next generation signature implements several techniques used in video coding,

making it even more robust, among this are:

o Shot detection, for spatiotemporal coherence inside shot\segment

signatures.

o Motion Estimation, for superior robustness in temporal segmentation. 76

• This implementation was a response to MPEG’s call for video signature

technologies and participated in MPEG video signature competition.

o The system proved to be a viable solution for video signatures, achieving

the requirements for the uniqueness test and almost all the robustness

requirements.

o The system was able to outperform the winner of the contest in categories

such as Text\logo overlay and Capture on Camera.

6.2 Future Work

• During the development of this solution roadblocks were encountered, such as the

massive amount of data collection, which is hard to manage and control. The

implementation of a database is necessary, that collects and manages the different

types of signatures and contains the calculations and generated reports.

• The implementation of a frame rate normalization technique to make the system

robust to severe compression codecs or frame rate changes.

• Because of the matrix-like type of data being managed by the solution, a CUDA

implementation would be worth exploring.

• The use of MMX and or SSE 2 to 4 instructions to make the system even faster,

introducing the solution into the era of parallel computation.

• Explore the possibility to migrate the system to a cloud environment, where

77

signatures could be grouped by type of content and then distributed into the cloud so the performance and reporting system can be enhanced.

78

BIBLIOGRAPHY

[1] YouTube fact sheet, online information at http://www.youtube.com/t/fact_sheet,

last access on 06/16/2010.

[2] H. Kalva, G. Leon, S. Yellamraju and B. Furht “Video Identification Using

Video Tomography” , Florida Atlantic University, April 2008.

[3] G. Leon, “Content Identification using video tomography”, M.Sc. Thesis, College

of Engineering and Computer Science, Florida Atlantic University, August 2008.

[4] T. Kalker, “Considerations on Watermarking Security”, IEEE Fourth Workshop

on Multimedia Signal Processing, 2001.

[5] G. Doerr and J.L. Dugelay, “A guide tour of video watermarking,” Signal

Processing: Image Communication, Volume 18, Issue 4, April 2003, Pages 263-

282.

[6] X. Fang, Q. Sun, and Q. Tian, “Content-based video identification: a survey,”

Proceedings of the Information Technology: Research and Education, 2003.

ITRE2003. pp. 50-54.

[7] J. Law-To, L. Chen, A. Joly, I. Laptev, O. Buisson, V. Gouet-Brunet, N.

Boujemaa,and F. Stentiford, “Video copy detection: a comparative study,” In

Proceedings of the 6th ACM international Conference on Image and Video

Retrieval, CIVR '07, pp. 371-378. 79

[8] Y. Yan, B.C.Ooi, and A. Zhou, “Continuous Content-Based Copy Detection over

Streaming Videos,” 24th IEEE International Conference on Data Engineering

(ICDE), 2008.

[9] N. Guil, J.M. Gonzalez-Linares, J.R. Cozar, and E.L. Zapata, “A Clustering

Technique for Video Copy Detection,” Pattern Recognition and Image Analysis,

LNCS, Vol. 4477/2007, pp. 451-458.

[10] G. Singh, M. Puri, J. Lubin, and H. Sawhney, “Content- Based Matching of

Videos Using Local Spatio-temporal Fingerprints,” – ACCV

2007, LNCS vol. 4844/2007, Nov. 2007, pp. 414-423.

[11] Akutsu and Y. Tonomura, “Video tomography: An efficient method for camera

work extraction and motion analysis,” Proceedings of the 2nd international

Conference on Multimedia, ACM Multimedia 94, pp. 349-356, 1994.

[12] MPEG Video Subgroup, “Updated Call for Proposals on Video Signature Tools,”

MPEG2008/N10155, October 2008, Busan, KR.

[13] P. Brasnett, K. Iwamoto, S. Pascahalakis, R. Oami, M. Bober, “MPEG-7 Video

Signature Tools – Proposal”, Mitsubishi Electric R&D Centre Europe, NEC Corp,

April 2009.

[14] MPEG Video Subgroup, "Experimental Data Preparation Guide",

MPEG2008/N10155, October 2008.

[15] J. Canny, “A computational approach to edge detection.” Pattern Analysis and

Machine Intelligence, IEEE Transactions on, Nov. 1986, PAMI-8(6):679–698.

[16] R. Rowland. “Advertisement Airing Audit System and Associated Methods”.

Patent IPC8 Class: AH04H2014FI. October 2008.

80

[17] S. H. Lee, W. Y. Yoo, Y. Yoon, “Real-Time Monitoring System for TV

Commercials Using Video Features” LNCS vol. 4161/2006, pp. 81-89.

[18] R.M. Bolle, A.W. Senior, N.K. Ratha, and S.Pankanti, “Fingerprint minutiae: A

constructive definition,” LNCS, vol. 2359/2002, pp. 58–66.

[19] J. Korpi-Anttila, “Automatic color enhancement and scene change detection of

digital video”, Licentiate Thesis, Department of Automation and Systems

Technology, Helsinky University of Technology, November 2002.

[20] A. Nagasaka, Y.Tanaka, "Automatic Video Indexing and Full-Video Search for

Object Appearances", Visual Database Systems, II, Elsevier Science Publishers,

1992, pp. 113 – 127.

[21] K. Otsuji, Y. Tonomura, Y. Ohba, "Video Browsing Using Brightness Data",

Visual Communications and Image Processing, SPIE 1606, 1991, pp. 980 – 989.

[22] H.J. Zhang, A. Kankanhalli, S.W. Smoliar, "Automatic Partitioning of Fullmotion

Video", Multimedia Systems, Vol. 1, 1993, pp. 10 – 28.

[23] F. Dufaux, F. Moscheni,”Motion Estimation Techniques for Digital TV: A

Review and a New Contribution”, Proceedings of the IEEE, vol 83. June 1995.

[24] M. Mushrif, S. Sengupta, A.K. Ray, “Texture Classification Using a Novel, Soft-

Set Theory Based Classification Algorithm”, LNCS, vol 3851/2006, pp 246-254.

[25] A. F. Smeaton, P. Over, A. R. Doherty, “Video Shot Boundary Detection: Seven

Years of TRECVid Activity”, Computer Vision and Image Understanding, vol

114, April 2010, pp 411-418.

[26] F. Bellard, “FFMPEG”, http://www.ffmpeg.org/

[27] Intel Corporation, “Intel® Integrated Performance Primitives for Intel®

81

Architecture”, Reference Manual, vol 2: Image and Video processing, March

2009.

[28] K. Turkowsky, “Filters for Common Resampling Tasks”, Filters for Common

Resampling Tasks, Apple Computer, April 1990.

[29] R. Jain, R. Kasturi, B. G. Schunck, “Chapter 5. Edge Detection”, Machine Vision,

McGraw-Hill, 1995.

82