SIGNATURE SYSTEM FOR VIDEO IDENTIFICATION
by
Sebastian Possos Medellin
A Thesis Submitted to the Faculty of
The College of Engineering and Computer Science
in Partial Fulfillment of the Requirements for the Degree of
Master of Science
Florida Atlantic University
Boca Raton, Florida
August 2010
ACKNOWLEDGMENTS
I would like to thank my wife Adriana, because without her I would not have been able to reach this far and keep going. Also to Dr. Hari Kalva, who showed me a new world of knowledge, exciting and full of challenges, his guidance and friendship make this work possible. To Dr. Maria Petrie, who gave me the opportunity to be in this excellent institution and Dr. Oge Marques and Dr. Bassem Alhalabi for their help and support by being part of my Thesis Committee.
Finally I would like to thank my family, friends and colleagues who were always present, not only during this time but always with me, giving me their support and giving me a hand when I needed it.
iii
ABSTRACT
Author: Sebastian Possos Medellin
Title: Signature system for video identification
Institution: Florida Atlantic University
Thesis Advisor: Dr. Hari Kalva
Degree: Master of Science
Year: 2010
Video signature techniques based on tomography images address the problem of video identification. This method relies on temporal segmentation and sampling strategies to build and determine the unique elements that will form the signature. In this
thesis an extension for these methods is presented; first a new feature extraction method,
derived from the previously proposed sampling pattern, is implemented and tested,
resulting in a highly distinctive set of signature elements, second a robust temporal video
segmentation system is used to replace the original method applied to determine shot
changes more accurately. Under a very exhaustive set of tests the system was able to
achieve 99.58% of recall, 100% of precision and 99.35% of prediction precision.
iv
SIGNATURE SYSTEM FOR VIDEO IDENTIFICATION
TABLES ...... viii
FIGURES ...... ix
EQUATIONS ...... xi
Chapter 1 INTRODUCTION ...... 1
1.1 Overview and Motivation ...... 1
1.2 Problem Statement and Objective ...... 2
1.3 Main Contributions ...... 4
1.4 Overview of the Thesis ...... 5
Chapter 2 BACKGROUND AND RELATED WORK ...... 6
2.1 Introduction...... 6
2.2 Video Identification ...... 6
2.2.1 Digital Watermarks ...... 7
2.2.2 Video content based identification ...... 7
2.2.3 Signature based identification ...... 9
2.3 Video Tomography ...... 17
v
2.4 Advertisement auditing systems ...... 20
Chapter 3 PROPOSED SOLUTION ...... 22
3.1 Introduction...... 22
3.2 General Description ...... 22
3.3 Signature design ...... 23
3.3.1 First Video Signature version ...... 25
3.3.2 Second Video Signature version ...... 37
Chapter 4 IMPLEMENTATION ...... 42
4.1 Introduction...... 42
4.2 Signature extraction ...... 43
4.2.1 FFMPEG libraries ...... 48
4.2.2 Intel IPP 6.0 Canny Edge detector ...... 49
4.3 Shot Change Detector flow chart...... 50
Chapter 5 EXPERIMENTS AND RESULTS ...... 51
5.1 Introduction...... 51
5.2 MPEG Experiments ...... 51
5.2.1 Data set...... 52
5.2.2 Procedure for testing ...... 53
5.2.3 Uniqueness test ...... 53
5.2.4 Robustness test ...... 60
vi
5.3 Content Auditing ...... 72
5.3.1 Experiment setup ...... 72
5.3.2 Algorithm ...... 73
5.3.3 Results ...... 73
5.3.4 Summary ...... 75
Chapter 6 CONCLUSIONS AND FUTURE WORK ...... 76
6.1 Conclusions ...... 76
6.2 Future Work ...... 77
BIBLIOGRAPHY ...... 79
vii
TABLES
Table 1. Uniqueness Test Performance Summary ...... 57
Table 2. Uniqueness Test Time complexity Performance ...... 57
Table 3. Modified Uniqueness Test Performance Summary ...... 58
Table 4. Uniqueness test Performance Comparison ...... 59
Table 5. Video Modifications and Levels for Robustness Assessment ...... 60
Table 6. Content audit: Signature comparison results for signatures of Type 1...... 74
Table 7. Content audit: Signature comparison results for signatures of Type 2...... 74
viii
FIGURES
Figure 1. Region pair distribution signature extraction – NEC Corp. proposal ...... 10
Figure 2. Neighborhood and pixel relationship – Mitsubishi Electric proposal ...... 11
Figure 3. Multi-resolution signature structure – Mitsubishi Electric proposal ...... 13
Figure 4. Temporal hierarchical structure – University of Bresia proposal ...... 14
Figure 5. Saliency image generation – Peking University proposal for MPEG ...... 15
Figure 6. Saliency based signature extraction – Peking University proposal ...... 15
Figure 7. Mitsubishi Electric – NEC Corp. video signature bitstream ...... 17
Figure 8. Video Tomography Extraction ...... 18
Figure 9. Video tomography extraction for 1 of 6 components ...... 24
Figure 10. Tomography line pattern ...... 24
Figure 11. Tomography, edge and composite images ...... 27
Figure 12. Sub-signature extraction: Level changes along lines ...... 28
Figure 13. Sub-signature extraction: block or area assignment ...... 28
Figure 14. Tomography image with a shot change present ...... 29
Figure 15. Shot Signature Process ...... 30
Figure 16. Frame signature extraction ...... 32
Figure 17. Resulting image after Row sum procedure ...... 39
Figure 18. Resulting image after Column sum procedure ...... 40 ix
Figure 19. Resulting image after texture energy measurement ...... 40
Figure 20. Signature extraction process and external tools used ...... 43
Figure 21. Signature generation Flow chart ...... 48
Figure 22. Shot Change Detector flow chart ...... 50
Figure 23. Query construction for independence tests ...... 55
Figure 24. Success ratios for Text/Logo overlay test ...... 62
Figure 25. Success ratios for Sever compression test...... 63
Figure 26. Success ratios for Resolution reduction test ...... 63
Figure 27. Success ratios for Frame rate reduction test ...... 64
Figure 28. Success ratios for Capturing on Camera test ...... 64
Figure 29. Success ratios for Analog VCR Recording test ...... 65
Figure 30. Success ratios for Color to monochrome conversion test ...... 65
Figure 31. Success ratios for Brightness change test ...... 66
Figure 32.Success ratios for Interlaced / Progressive conversion test ...... 66
Figure 33. Result comparison for Text/Logo overlay test...... 67
Figure 34. Result comparison for severe compression test...... 68
Figure 35. Result comparison for Resolution reduction test...... 68
Figure 36. Result comparison for Frame rate reduction test...... 69
Figure 37. Result comparison for Capturing on Camera test...... 69
Figure 38. Result comparison for Analog VCR Recording test...... 70
Figure 39. Result comparison for Color to monochrome conversion test...... 70
Figure 40. Result comparison for Brightness change test...... 71
Figure 41. Result comparison for Interlaced / Progressive conversion test ...... 71
x
EQUATIONS
Equation 1. First descriptor element – Mitsubishi Electric ...... 11
Equation 2. Second descriptor element – Mitsubishi Electric ...... 11
Equation 3. Third descriptor element – Mitsubishi Electric ...... 11
Equation 4. Fourth descriptor element – Mitsubishi Electric ...... 12
Equation 5. Binary convertion function – Mitsubishi Electric and NEC Corp ...... 17
Equation 6. Euclidean Distance formula ...... 19
Equation 7. Similarity function – ETRI Institute ...... 21
Equation 8. Frame difference function (Pixel wise) ...... 34
Equation 9. Summation of histogram differences ...... 35
Equation 10. Row sum equation ...... 39
Equation 11. Column sum equation ...... 39
Equation 12. Texture energy ...... 40
Equation 13. Six tap filter for pixel interpolation ...... 45
Equation 14. High contrast five tap decimation filter ...... 45
xi
Chapter 1 INTRODUCTION
1.1 Overview and Motivation
The current technological advances allow people to copy and distribute video content with almost no specialized knowledge. Today, with just a click of a button any person is able to grab video from any source and post it on the Internet; just a simple search in
Google or Yahoo will retrieve several tutorials and tools at almost no cost or even free explaining how to create, edit or duplicate multimedia content.
Also the World Wide Web offers means for publishing and distributing this type of content. The perfect example would be YouTube, their statistics from March 2008 tell us that 78.3 million of videos were available for any person to see, 150 thousand to 200 thousand videos were uploaded every day, and on average 13 hours of video content were uploaded every minute. Now in 2010 the popularity of YouTube is so big that that last statistic has been increased on 85%, having a total of 24 hours of content being uploaded every minute. [1]
But with that huge amount of data, the Entertaining and IT industries are facing new challenges, on one side is the hosting and bandwidth consumption related to the constant 1
availability of this huge amount of content, and on the other, is how to control and
monitor the uploaded content, more specifically how to determine if an uploaded content
is under copyright protection, hence an automatic copyright enforcement system is highly
needed, taking into account the time required for a person to see all the video content
posted on the Internet would be larger than several human lives, for the same reason the
complexity of the system should be low for both, the signature extraction and the
comparison process; without compromising the uniqueness and robustness properties of
the signature.
Alternatively, monitoring systems play a crucial role in the broadcasting operations for any network. They provide a means to capture continuously the transmitted programming, details, schedules and events. All this information when stored as logs or databases permits the analysis of performance, quality of service and retrieval of any content that was made available at any particular time. Advertisers and the advertisement industry have become more interested in these types of applications where they can follow all details of the contracted air time.
1.2 Problem Statement and Objective
The solution proposed continues the work done by Gustavo Leon on his thesis “Content
Identification using Video Tomography” [3], where a new technique for signature
generation has been designed using tomography images, but based on three new aspects:
2
• A new pattern for sample extraction; the idea with this new pattern is to increase
the consistency level of the signatures by extracting sample lines of the same size
always, unlike the previous method where to be able to compose horizontal and
vertical tomography images, it was necessary to crop the horizontal image so it
would match the vertical. Now because all the extracted lines are the same size
no cropping is needed, resulting in an enhanced uniqueness property.
• Improved Temporal segmentation; to enhance the dependability, the temporal
segmentation of the video cannot be done arbitrarily, if the portions to be
compared are not synchronized, then the features extracted are not going to be
similar leading to a false identification; here the solution is to implement a shot
detection algorithm which allows to synchronize the video segments accurately
increasing the correlation between them.
• Frame normalization; most of the video hosting sites reduce the video frame size
to reduce the amount of data that has to be saved, because storage is a problem.
But because of this transformation, the signature confidence can be affected. To
solve this, a normalized frame size is proposed: all video analysis is going to be
done for a 360 x 240 frame size. The resizing has to be done through the use of a
decimation filter or an interpolation filter depending on the scenario. If the video
image is bigger than the proposed normalized size, a decimation filter has to be
used, where not only the reduction is done but also for the video is freed of
aliasing effects, that can introduce undesired noise in the extracted signature. In
3
case the video is smaller, then an interpolation filter to up sample the frame is
needed. Here the issues are different because the up sampling procedure
introduces different artifacts to the image, but to enhance the reliability this filter
has to increase the contrast of the new image with the purpose of augmenting the
edge detection stability and ensure a consistent feature extraction.
As an extension of this system, an implementation of a real life application is tested, in this case an automatic advertisement tracking and auditing system based on video signatures.
1.3 Main Contributions
The following are the main contributions of this work:
• This system was a response to MPEG Call for Proposals on Video Signature
technology.
• Implementation of a new signature design, increasing its robustness and
dependability.
• Implementation of a shot detection or temporal segmentation module, which
allows the real synchronization between signatures of compared videos.
4
1.4 Overview of the Thesis
The remaining chapters of this thesis are structured as follows: Chapter 2 provides background information on video identification and advertisement auditing, Chapter 3 describes in detail the proposed solution taking into account this technique main elements and the relevant used tools to increase the system performance, Chapter 4 presents the implementation of the system and the description of the algorithms used for this approach, Chapter 5 contains the experiments and results based on the implementation, and Chapter 6 presents conclusions and possibilities for future work.
5
Chapter 2 BACKGROUND AND RELATED WORK
2.1 Introduction
In this chapter a general review is done on video identification systems, on what techniques they are based on, and their components. Also on what the current
Advertisement Audit systems consist of.
2.2 Video Identification
There are two main ways to identify a video [2]:
1. Based on a digital watermark
2. Based on the video content
A further development of the video content identification has evolved on the creation of signatures, the signature method looks for specific elements in the image to extract a unique descriptor.
6
2.2.1 Digital Watermarks
Digital watermarking based approaches rely on an embedded watermark that can be
extracted anytime in order to determine the video source, as exposed on [4],
watermarking technology followed the idea of “if you can’t see it, and if it is not removed
by common processing, then it must be secure”. They were first proposed as a solution
for identification and tamper detection in video and images by G. Doer et al [5].
However, they are not usually designed to identify unique clips from the same video
source.
Their biggest drawback of this approach is the embedding of a robust watermark in the
video source and the fact that a large collection of “un-watermarked” files already exist.
2.2.2 Video content based identification
Content based identification, on the other hand, uses the content of the video to compute a unique signature based on various video features. A content based video identification system survey is presented by X. Fang at el. [6] and by J. Law-To at el. [7].
A proposal for copy detection in streaming videos is presented by Y. Yan at el. [8], where a video sequence similarity measure is used, which is a composite of the frame fingerprints extracted for individual frames. Partial decoding of the incoming video is
7 performed and the DC coefficients for key frames are used to extract and compute frame features.
[9] and [10] are also based on key frame analysis: [9] proposes a clustering technique where the authors take key frames for each cluster of the query video and perform a key frame based search for similarity regions in the target videos. [10], on the other hand, uses local features; it extracts key frames to match against a database and then matches the local spatial-temporal features to match the videos.
As can be inferred from the above, many of the content based video identification methods use video signatures that are computed using features extracted from individual frames, but these frame based solutions are complex and add a large amount of overhead, especially in long duration videos, as they require feature extraction and comparison on a frame basis.
Additionally, they are characterized by the use of key frames for temporal synchronization and subsequent video identification. But determining key frames either relies on underlying compression algorithms or requires additional computation.
Several video identification system surveys are presented in [6] and [7].
8
2.2.3 Signature based identification
Signature identification [12] is a system based on the extraction of specific elements or fingerprints from media items, which in the end would lead to uniquely identify it. The main idea behind these signatures is that the extraction of those features that uniquely describe the content would be able to identify it even if the media has been modified under a wide range of common editing operations.
One big advantage that the signature descriptors have is that they are passive means of identification; there is no need to alter the original content, in contrast to a watermark which must be added actively.
2.2.3.1 MPEG Call for Proposals on Video Signature technology.
Five candidates submitted proposals to the MPEG Call for Proposals. Between them an earlier version of this Thesis work was presented, which will be explained in section 2.3.
But first, the timeline of the Call for Proposals is overviewed next, along with the four rival proposals.
Timeline
• Updated Call for Proposals on Video Signature Tools issued: 2008.10.17 9
• Registration of Interest open: 2008.07.25
• Complete dataset distribution to registered participants by post: 2008.10.24
• Registration of Interest closed: 2008.11.30
• Submission deadline: 2009.01.26
• Evaluation of Responses: 2009.01.31
2.2.3.2 Video signature based on feature difference between various
pairs of sub-regions (NEC Corp).
In this proposal, a set of sub regions were defined, the mean value is calculated from each sub region and then a relationship is defined between pairs of regions, following the distribution presented on Figure 1
Figure 1. Region pair distribution signature extraction – NEC Corp. proposal
10
2.2.3.3 Video Signature based on Multi-resolution decomposition
(Mitsubishi Electric R&D Centre Europe).
For this approach a frame is divided into 2x2 neighborhoods, on each 2x2 neighborhood, the relationship between the descriptor elements {a’, b’, c’, d’} and the pixel values {a, b,
c, d} of the neighborhood are given by equations (1)-(4)
Figure 2. Neighborhood and pixel relationship – Mitsubishi Electric proposal