Automatic Detection of Commercial Blocks in Broadcast TV Content

Alexandre Ferreira Gomes

Thesis to obtain the Master of Science Degree in Electrical and Computer Engineering

Supervisors Prof. Maria Paula dos Santos Queluz Rodrigues Prof. Fernando Manuel Bernardo Pereira

Examination Committee Chairperson: Prof. José Eduardo Charters Ribeiro da Cunha Sanguino Supervisor: Prof. Maria Paula dos Santos Queluz Rodrigues Members of the Committee: Prof. João Magalhães

November 2016

Acknowledgements

In first place, I would like to thank to my father Mário, my mother Carmen and my sisters Filipa and Inês for supporting me in every moments of this journey and for always do everything to help me. Anything I say is not enough to express my sense of debt to you. At the same level, a special word to Sara Mendes, whose grace and love made everything easier and makes me to improve every single day. A special thanks to professors Paula Queluz and Fernando Pereira, for all the availability, guidance and for helping me to better understand the importance of having a critical view about every situation – valuable lessons that will be useful in all my life; and to André Alexandre and Luis Nunes for the company and for sharing some tears with me! I would also thank to my uncle Rodrigo and my grandmother Isabel for the precious late meals and all the breakfasts; my grandparents Silvino and Deolinda for helping me to grow up as a better person; my uncles Pedro and Tânia and my beautiful cousins Afonso and Xavier for the amazing and relaxing Saturday afternoons in their home; to my cousin Ricardo for helping me out in a very sensitive moment; to Lurdes and her good mood. Finally, but not least at all, a heartfelt thanks to all my good friends that have been with me in past few years and will for sure remain in the next few decades: Alexandre Gabriel, Ricardo Sousa, Tiago Sebastião, Miguel Ramos, Pedro Gama, Guilherme Gil, João Melo, João Silva, David Oliveira, Ricardo Joaquinito and Carlos Silva. Also, a special word to João Brogueira. To my favourite Civil Engineering guys: Beatriz Loura, Filipe Vale, Mariana Antunes, João Rafael and also Ana Santos for her importance in some crucial moments. Finally, a warm hug to André Antunes, Gonçalo Vieira, Joana Freitas and Bárbara Santos.

i

Abstract

As the global economy evolves, companies need to improve their marketing solutions in order to get some advantage over competitors; TV advertising commercials have emerged as a major tool for achieving this goal. From the video content point of view, TV commercials have some specific characteristics as they all target to capture the viewers´ attention. Naturally, it is also these characteristics that make it possible to automatically detect advertising content and eventually skip it. Commercials are always packed and broadcasted together in the so-called commercial blocks, containing a given amount of individual commercials. Moreover, their structure depend not only on the country and its relevant legislation, but also on the specific broadcaster, according to their advertising strategy and style. Motivated by the solutions proposed along the last few years for TV commercials detection, this Thesis presents an overview of the available state-of-the-art - notably to understand the current weaknesses - and proposes a new and effective solution. The proposed method for TV commercials detection is based on the presence or absence, in the screen, of a TV channel logo, which is a specific type of Digital on-Screen Graphic (DoG), as this logo is never present in commercial blocks. After segmenting the video, using a shot change detector, the resulting video shots are analyzed in terms of and shape, to conclude on the existence or not of DoGs on the video content. A DoGs Database system containing the DoGs acquired over time is built and continuously updated. A systematic control of the DoGs Database is performed to conclude about the nature of each DoG and to classify each video segment as Regular Program or Commercial Block. For the used video dataset, that resulted from recordings of three different Portuguese TV channels, a minimum accuracy of 93,9% on commercials detection was achieved; furthermore, the measured and reported processing time suggests that the proposed solution could enable real time (i.e., while recording) detection of commercial blocks.

Keywords: TV advertising; commercial blocks; shot detection; Digital on-Screen Graphics; logos detection; video processing.

ii

Resumo

À medida que a economia global se desenvolve, as empresas têm a necessidade de melhorar as suas soluções de marketing de modo a obter alguma vantagem sobre a concorrência; neste âmbito, os anúncios televisivos têm emergido como uma ferramenta essencial para atingir este objetivo. Do ponto de vista do conteúdo, os anúncios publicitários têm algumas características específicas para que possam captar a atenção dos telespectadores. Naturalmente, são também essas características que permitem detetar automaticamente o conteúdo comercial. Os anúncios publicitários são habitualmente combinados e transmitidos pelos operadores televisivos em blocos comerciais que contêm um conjunto de anúncios sucessivos a diferentes marcas e entidades. A estrutura e o modo como a publicidade é transmitida em televisão dependem não apenas do país e da legislação em vigor, mas também do operador específico e da sua estratégia e abordagem à questão da publicidade. Motivado pelas soluções propostas ao longo dos anos, nesta Tese apresenta-se o estado- da-arte na área da deteção de blocos publicitários, analisando-se as debilidades dos métodos existentes, e propõe-se uma solução nova e eficaz. A solução proposta é baseada na presença (ou ausência), no ecrã, do logo de um canal televisivo, já que este nunca está presente em blocos publicitários; este logo é um caso particular de DoG – Digital on-Screen Graphic. Após segmentar o vídeo a analisar, utilizando um detetor de mudança de shots, os segmentos vídeo resultantes são analisados em termos de forma e cor, de modo a concluir-se sobre a existência, ou não, de DoGs no conteúdo vídeo. Neste contexto, é construída uma base de dados de DoGs cujo objetivo é armazenar os DoGs adquiridos ao longo do tempo e que é continuamente atualizada à medida que a análise do vídeo avança. É também realizado um controlo sistemático da informação que está na base de dados de DoGs de modo a que se conclua sobre a natureza de cada DoG. Finalmente, classifica-se cada segmento de vídeo previamente fragmentado como Programa ou Bloco Comercial, tendo em conta a classificação atribuída aos DoGs. Para o conjunto vídeos de teste utilizado, e que resultou de gravações de três canais de televisão portugueses, obteve-se uma exatidão mínima de 93,9% na deteção de tramas pertencentes a blocos comerciais; adicionalmente, o tempo de processamento medido sugere que a solução proposta permitirá a deteção de segmentos comerciais em tempo real (isto é, durante a gravação).

Palavras-chave: Publicidade em TV; blocos comerciais; deteção de shots; Digital on-Screen Graphics; deteção de logos; processamento de vídeo.

iii

Table of Contents

Acknowledgements ...... i Abstract ...... ii Resumo ...... iii Table of Contents ...... iv Index of Figures ...... vii Index of Tables ...... ix List of Acronyms ...... x Chapter 1 - Context and Objectives ...... 1 1.1 Motivation ...... 1 1.2 Objectives ...... 2 1.3 Main Contributions ...... 3 1.4 Thesis Outline ...... 3 Chapter 2 - TV Commercials: Legal Framework and Characterization ...... 4 2.1 Legal Framework ...... 4 2.1.1 Advertising Legal Framework in the European Union ...... 4 2.1.2 Legal Framework for Advertising in ...... 4 2.2 Typical Structure of a Commercial Block ...... 5 2.3 Intrinsic Characteristics ...... 6 2.3.1 High Scene Cut Rates ...... 6 2.3.2 Text Presence ...... 6 2.3.3 Audio Jingles ...... 7 2.3.4 Audio Level ...... 7 2.4 Extrinsic Characteristics ...... 7 2.4.1 Commercial Block Separator ...... 7 2.4.2 TV Channel Logo ...... 8 2.4.3 Frames ...... 8 2.4.4 Time Duration ...... 9 2.4.5 Commercials Repetition ...... 9 Chapter 3 - Overview of TV Commercials Detection Schemes ...... 10 3.1 Knowledge-based Detection ...... 10 3.1.1 The First Steps - Black Frames and Silence ...... 10 3.1.2 Going Deeper – Cut Rates ...... 12 3.1.3 Motion Analysis ...... 14 3.1.4 Logo Detection ...... 14 3.1.5 Audio Analysis ...... 16

iv

3.1.6 Text Detection ...... 17 3.1.7 Still images detection ...... 19 3.2. Repetition-based Detection ...... 19 3.2.1. Lienhart et al. (1997) ...... 20 3.2.2 J. M. Gauch and A. Shivadas (2005) ...... 21 3.2.3 Li et al. (2008) ...... 22 Chapter 4 - Proposed Solution: Architecture and Algorithms ...... 24 4.1 Learning about Commercials and Logos with Real TV Content ...... 24 4.2 Characterizing TV Channel Logos ...... 27 4.3 Proposed System Architecture ...... 29 4.3.1 Designing the System...... 29 4.3.2 Architecture walkthrough ...... 30 4.4 Shot Change Detection and Segmentation ...... 32 4.4.1 Luminance Histogram Operations ...... 33 • Luminance Frame Histogram Computation...... 33 • Luminance Histogram Distance Computation ...... 33 4.4.2 Adaptive Threshold Computation ...... 33 4.4.3 Hard Cut Detection Decision ...... 34 4.4.4 Forced Segmentation ...... 34 4.5 DoG Acquisition Algorithm ...... 34 4.5.1 Video Segment Edges & ...... 34 4.5.2 DoG Detection ...... 42 4.6 DoGs Database Updating & DoG Type Decision ...... 46 4.6.1 DoGs Matching ...... 48 4.6.2 DoGs Insertion in DoGs Database ...... 49 4.6.3 Database Update & Management ...... 50 4.6.3.1 Basic Solution Rationale ...... 50 4.6.3.2 Advanced Solution Rationale ...... 51 4.7 Video Segment Classification ...... 53 Chapter 5 – Performance Evaluation ...... 55 5.1 Test Material ...... 55 5.1.1 Shot Change Detection Assessment Dataset ...... 55 5.1.2 DoG Acquisition Assessment Dataset ...... 56 5.1.3 Global Solution for Detecting Commercials Assessment Dataset ...... 58 5.2 Performance Assessment Methodology and Metrics ...... 59 5.2.1 Shot Change Detection Assessment ...... 60 5.2.2 DoG Acquisition Assessment ...... 60

v

5.2.3 Global Solution for Detecting Commercials Assessment ...... 61 5.3 Results and Analysis ...... 62 5.3.1 Shot Change Detection Assessment Experiment ...... 62 5.3.2 DoG Acquisition Algorithm Assessement ...... 64 5.3.3 Global Solution Assessment ...... 68 Chapter 6 - Summary and Future Work ...... 72 6.1 Summary ...... 72 6.2 Future Work ...... 74 Appendix A - DoGs Detection – Example of the complete process ...... 75 1. Video test sequence characterization ...... 75 2. Screenshots extracted from each shot ...... 75 3. Key Frames Edge Fusion step ...... 77 4. SPMs Intersection step ...... 78 5. Color Map of the detected DoG ...... 78 6. Heat Map ...... 78 Appendix B - SPMs Intersection step results...... 79 Bibliography ...... 80

vi

Index of Figures

Figure 1.1 - Some well-known commercials produced for Super Bowl………...... 1 Figure 2.1 - Typical structure of commercial block in the Portuguese TV channels………..……….…………5 Figure 2.2 - Examples of (A) fading, fade-out first, then fade-in [15]; (B) Dissolving [16]...... 6 Figure 2.3 - Example of a TV commercial with text in different places and with different fonts...... 7 Figure 2.4 - Example of RTP (Portuguese public television) initial commercial block separator...... 8 Figure 2.5 - TV Channel Logo present in the top left corner...... 8 Figure 3.1 - Conditions imposed by Sadlier et al. to detect commercial blocks by using BF/SF series (image from [23])...... 11 Figure 3.2 – Comparison of the Cuts per Minute metric for a commercial block and a movie [13]...... 12 Figure 3.3 - Comparison of the Cuts per Minute metric for a commercial block and a newscast [11]...... 12 Figure 3.4 – The time averaged gradient (in the first row); Binary mask obtained after morphological processing (in the second row)...... 15 𝑆𝑆𝑆𝑆 𝐿𝐿𝐿𝐿 Figure 3.5 – Binary image obtained after step 2.c. (top right corner); result (bottom right corner)...... 18 Figure 3.6 – The result after step 2.e. (see the text boxes in the left image)...... 18 Figure 3.7 – Examples of FPMI, each line representing a different type [12]...... 19 Figure 3.8 – Discount Tire Co.’s “Thank you!” commercial...... 20 Figure 3.9 – Confidence Level Assignment...... 23 Figure 4.1 - (a) Lamborghini’s TV Commercial screenshot, where a flash of the brand logo can been seen in the lower left corner (extracted from https://www.youtube.com/watch?v=Xd0Ok-MkqoE); (b) Screenshot extracted from a Portuguese TV commercial: the commercial brand logo appears in the upper right corner during the whole commercial…………………………………………………………………………….……...... 25 Figure 4.2 - Screenshot from a Portuguese TV series: in the upper left and upper right corners, the TV channel logo and the TV series logo, respectively...... 26 Figure 4.3 - Screenshot from a Portuguese TV news program: in the upper left corner, the TV channel logo; in the lower right corner, the news program logo (“BomDia Portugal”), the current time (“06:38”) and live traffic information...... 26 Figure 4.4 - Screenshot from a Portuguese broadcaster self-promotion commercial: the program logo is placed in the upper right corner during the whole self-promotion...... 26 Figure 4.5 - Difficult logo examples: (a) The TV channel logo in the upper left corner is over a highly textured zone, making it hard to detect; (b) The TV channel logo in the upper left corner is over the sky, making it almost impossible to detect as its color is quite similar to the background color...... 28 Figure 4.6 - Difficult logo example: in the upper left corner, the TV channel logo contains a dark shadow surrounding a colored graphical object...... 28 Figure 4.7 - Example of texture variations in a logo along time, notably in the central letter “I” and in the and regions...... 28 Figure 4.8 - Example of a dynamic logo in terms of shape with snowflakes constantly falling on the logo. 29 Figure 4.9 - RTP1 logos: the old and the new...... 29 Figure 4.10 - Global System architecture...... 31 Figure 4.11 - Shot Change Detection and Segmentation module flowchart...... 32 Figure 4.12 - Schematic representation of the Luminance Histogram Distance Computation between consecutive frames...... 33 Figure 4.13 - Video Segment Edges& Color Analysis module flowchart...... 35

vii

Figure 4.14 - (a) SL = 53; NKF = ceil0.1 × SL = 6 ; KFs indexes: 1, 11, 21, 31, 41, 51 (b) SL = 76; NKF = ceil0.1 × SL = 8; KFs indexes: 1, 11, 21, 31, 41, 51, 61, 71 (c) SL = 110; NKF = 10 ; KFs indexes: 1, 12, 23, 34, 45, 56, 67, 78, 89, 100...... 36 Figure 4.15 - DoG Areas Definition: the boxes signalized as ULC, URC, LLC and LRC...... 37 Figure 4.16 - output for the video frame in Figure 4.15...... 38 Figure 4.17 - Screenshot extracted from the video segment “rtp1_demo1” [ref: https://www.dropbox.com/s/9rlzlcsajgad9dz/rtp1_demo1.mp4?dl=0], with a TV channel logo in the upper left corner...... 38 Figure 4.18 - KFs edges maps obtained for the video sequence “rtp1_demo1”...... 39 Figure 4.19 - Key Frames Edges Fusion output for the ULC region (video sequence “rtp1_demo1”)...... 40 Figure 4.20 - Key Frames Edges Fusion output for the ULC region after dilation (video sequence “rtp1_demo1”)...... 40

Figure 4.21 - Heat map representing the mean chrominances (Cr and Cb) variance...... 41 Figure 4.22 - Static Pixels Map (SPM) for the video sequence used as example...... 42 Figure 4.23 - DoG Detection module flowchart...... 43

Figure 4.24 - Result of SPMs Intersection for various values of Nseg: (a) Nseg = 4; (b) Nseg= 5; (c) Nseg = 6. . 43 Figure 4.25 - DoG Presence Verification: (a) DoG in DDB; (b) SPMs Intersection map under verification; (c) Final result after DoG Presence Verification step, which contains the pixels classified as Similar in this stage...... 45 Figure 4.26 - DoGs Database Updating & Logo Type Decision module flowchart...... 47 Figure 4.27 - Video Segment Classification module flowchart...... 54 Figure 4.28 - Output structure that should to be corrected...... 54 Figure 5.1 - Structure of the “sicNotGS” test sequence...... 58 Figure 5.2 - Structure of "tviGS" test sequence...... 58 Figure 5.3 - Structure of the “rtp1GS” test sequence...... 59 Figure 5.4 - Strong brightness change in consecutive frames, visible in the upper right corner...... 63 Figure 5.5 - Examples of situations where the DoGA algorithm failed with false negatives...... 66 Figure 5.6 - Sequence of frames (in different shots) showing failures, notably false negatives on the lower left corner and the lower right corner); also a “RTP1” logo shape transition (in the stripes part) can be observed on the upper left corner...... 67 Figure 5.7 - Example of a situation where the algorithm fails by detecting false positives in the lower corners, due to highly textured background...... 68 Figure A.1 - Screenshot extracted from the first shot of “rtp1_example” video test sequence...... 75 Figure A.2 - Screenshot extracted from the second shot of “rtp1_example” video test sequence...... 76 Figure A.3 - Screenshot extracted from the third shot of “rtp1_example” video test sequence...... 76 Figure A.4 - Screenshot extracted from the fourth shot of “rtp1_example” video test sequence...... 76 Figure A.5 - Screenshot extracted from the fifth shot of “rtp1_example” video test sequence...... 76 Figure A.6 - Key Frames Edges Fusion step output obtained for each shot of “rtp1_example” video test sequence ...... 77 Figure A.7 - Key Frames edges maps that represent the fifth shot of the sequence...... 77 Figure A.8 - SPMs intersection output for the sequence "rtp1_example"...... 78 Figure A.9 - Color map of the detected DoG...... 78 Figure A.10 - Heat map corresponding to the Key Frames Edge Fusion map of shot 5...... 78 Figure B.1 - Some results from the SPMs Intersection step for each TV channel tested in DoG Acquisition Algorithm Assessement...... 79

viii

Index of Tables

Table 2.1 - Comparison between EU and Portuguese rules for TV commercials broadcasting...... 5 Table 4.1 – Commonly observed characteristics in TV logos...... 27 Table 4.2 - Three TV logo examples and their characterization...... 27 Table 5.1 – Shot Change Detection test sequences characterization...... 56 Table 5.2 – used in logo types classification...... 56 Table 5.3 - Logos used to test the DoGA module...... 56 Table 5.4 – DoGA’s test sequences characterization...... 57 Table 5.5 – Global Solution test sequences characterization...... 58 Table 5.6 – DoGA’s test sequences characterization...... 61 Table 5.7 – Parameters for the SCD assessment experiment...... 62 Table 5.8 – SCD module performance results...... 62 Table 5.9 - Ratio between SCD run time and duration of each video sequence...... 64 Table 5.10 – Parameters for the DoGA assessment experiment...... 64 Table 5.11 – DoGA module: performance results...... 64 Table 5.12 – Ratio between DoGA run time and duration of each video sequence...... 68 Table 5.13 - GS module: performance results for sicNotGS video test sequence...... 69 Table 5.14 - GS module: performance results for tviGS video test sequence...... 69 Table 5.15 - GS module: performance results for rtpGS video test sequence...... 70 Table 5.16 – Ratio between GS run time and duration of each video sequence...... 71

ix

List of Acronyms

Acc Accuracy ASCI Audio Scene Change Indicator BF Black Frame Cb CC Connected Component CCV Color Coherence Vector CM Center of Mass Cr Red Chrominance CRS Confidence Based Recognition System DA DoG Area DDB DoGs Database DoG Digital on-Screen Graphic DoGA DoG Acquisition Algorithm DPV DoGs Presence Verification EBC Edge-Based Contrast ECR Edge Change Ratio EPR Energy Peak Rate ESR Edge Stable Ratio EU European Union F1 F1-Score FMPI Frames Marked with Product Information FN False Negative FNR False Negative Rate FP False Positive FPR False Positive Rate GS Global Solution for Commercials Detection HMM Hidden Markov Model HSI , Saturation, Intensity KF Key Frames LLC Lower Left Corner LRC Lower Right Corner MB Macroblock MFCC Mel-Frequency Cepstral Coefficients MGD Maximum Gradient Difference MPEG Moving Picture Experts Group MVL Motion Vector Length Pre Precision RTP Rádio e Televisão Portuguesa (Portuguese public television) RGB Red, , Blue RMS Root Mean Square RO Repeating Objects

x

SCD Shot Change Detection and Segmentation SF Silent Frame SIC Sociedade Independente de Comunicação - Portuguese private television SL Video Segment Length SPM Static Pixels Map SVM Support Vector Machine TN True Negative TNR True Negative Rate TP True Positive TPR True Positive Rate TVI Televisão Independente - Portuguese private television ULC Upper Left Corner URC Upper Right Corner TV Television ZCR Zero Crossing Rates

xi

Chapter 1 Context and Objectives

This chapter provides the scope and objectives of this Thesis, and presents the motivation and the global context inspiring the problem to be solved. Finally, the Thesis’ main contributions and organization are also outlined.

1.1 Motivation

As the global economy evolves, companies need to improve their marketing solutions in order to get some advantage over competitors; TV advertising commercials have emerged as an essential tool for achieving this goal. This solution has been used almost since the starting of television broadcasting and is a business in constant changing and progress. Television is an important publicity space for companies, especially the most powerful ones, and the visibility achieved by using this remarkable communication medium is something for what most companies fight for. To give an idea, the toughest and most expensive struggle for an advertisement time slot in the world happens annually in the Super Bowl, the final of the American Football championship. Super Bowl XLIX (in 2015) was the most watched program in American TV history. In average, each moment was seen by about 115 million people. The average cost of a 30-second commercial was $4.5 million [1] (see Figure 1.1). Obviously, the companies expect a huge income later; it is precisely the long term expected result the main reason why, nowadays, a lot of commercials can be seen on TV, with more and more , sounds, catch phrases and complex actions. As there are no free lunches, this is also a service with extreme importance for the broadcasters’ business model, especially private TV stations, considering the huge associated income. Typically, this is a win-win situation creating synergies between who is paying with an eye in the future and who receives good money for providing a service to finance the TV programs. Since the advertising business has grown to multi-million dollars/euros, some legislation had to be created to avoid and penalize potential abuses and to guarantee a certain quality of service provided by the broadcasters to the viewers. It should also not be forgotten that the creation of an advertisement has a critical artistic component not withstanding its fundamental marketing objective. In fact, while users would like to skip most commercials, and they do it more and more with specific technology, there are also commercials that became extremely popular due to their artistic quality and impact.

Figure 1.1 – Some well-known commercials produced for Super Bowl.

The advent of digital TV and the progress on computationally efficient image and video processing tools allowed the development of automatic methods and applications to detect, localize and identify/recognize specific commercials in TV transmissions. Curiously, there are two different faces of the same coin struggling here. In first place, as mentioned above, the advertisers

1 who want to check if their contracts with the broadcasters have been fulfilled, i.e., guaranteeing clauses like “which”, “when” and “how many times” some commercials shall be broadcasted. On the other hand, the viewers who typically wish to eliminate the transmitted commercials from their recorded TV programs or even from real-time programs. This conflict has a critical role in the whole advertisement business due to its impact on the TV business model, particularly when considering how automatic detection and identification of TV commercials is used. Besides, in an imposed neutral position are the regulators, whose work is to guarantee the compliance of all stakeholders with the law and to manage potential conflicts. From the video content point of view, TV commercials are a special type of content, with some rather particular characteristics that target to capture the viewers´ attention. Naturally, it is also these characteristics that make it possible to automatically detect advertising content. These characteristics can be considered as intrinsic - if associated to the advertising content itself, such as high scene cut rates, louder audio volume and large amount of text in different positions of the scene, or extrinsic - if external to the advertising content itself, such as the inclusion of black frames between consecutive commercials, their duration (usually a multiple of a certain time interval), and the lack of the broadcaster logo. Commercials are always packed and broadcasted together in the so-called commercial block, containing a certain amount of individual commercials. Moreover, its structure depends not only on the country and its relevant legislation, but also on the specific broadcaster, depending on their advertising strategy and style. In recent years, several solutions for the automatic detection and identification of TV commercials have been proposed in the literature. Currently, there are several commercial solutions available to detect, skip and/or permanently remove commercial blocks from real-time broadcasted and recorded video streams. There are nowadays not only some software tools created by independent developers, but also some well-known software tools created by big companies [2], such as Windows Media Center [3] (a DVR and media player created by Microsoft), SageTV [4] and MythTV [5] (both open source programs), which allow the detection of commercial blocks (after installing third-party add-ons) and the generation of an output file (in text format) with the events whose detection was required – the so-called “Events Report”. Feeding some other programs – e.g., MEncoder – with these text output files, allows deleting the commercial segments, thus creating advertising free content.

1.2 Objectives

As mentioned before, much work has already been developed on the automatic detection and identification of TV commercials, the main topic of this dissertation. However, most of the published work is rather dated, or cannot be directly applied to many of the existing TV broadcasters (including the Portuguese ones) because some of the assumed commercial characteristics are no longer valid; in fact, some of the advertising characteristics change over time due to marketing, aesthetical and legal reasons. In this context, the main objective of this Thesis is to design, implement and assess an improved solution for the detection of commercials when operating with current TV commercial broadcasting content, with emphasis on Portuguese TV stations, private and public. This shall be done by implementing a mechanism allowing to correctly detect the beginning and the end of each commercial block present in the provided TV content and to generate an events report identifying all the detected commercials and the respective occurrence times.

2

1.3 Main Contributions

The main contributions of this Thesis are related to the way the proposed algorithm deals with Digital on-Screen Graphics that are broadcasted by TV channels, not only in terms of detection but also by being able to distinguish them as being, or not, TV channel logos. The proposed algorithm presents good results in detecting static logos, dynamic logos in terms of texture and also semi-transparent logos. It is shown that other difficult cases, such as logos having low contrast with the background, may also be correctly detected; finally, non-static logos, in specific conditions, may also be identified as such. In another plan, this Thesis proposes two methods to distinguish a TV channel logo from other types of logos; namely, it is shown how it is possible to distinguish DoGs that are used by some companies on their advertisings, from DoGs that are used by broadcasters in some specific programs. An optimized way to manage the database with the different DoGs that are broadcasted and detected along the time is proposed, allowing to categorize each one of them and to distinguish Regular Programs from Commercial Blocks.

1.4 Thesis Outline

This thesis is organized in six chapters, with this first one introducing the work in terms of context, motivation and main objectives. Chapter 2 presents a review of the most relevant concepts used in the technological context associated to this Thesis. Firstly, the legal framework for TV advertisement in Europe and Portugal is summarized; next, the usual structure of a TV commercial block is presented; finally, the typical characteristics of a TV commercial – both intrinsic and extrinsic – are described. Chapter 3 briefly reviews the state-of-the-art on the techniques for the automatic detection of TV commercials; besides presenting the basic technical approach followed by each selected method, their main positive aspects and limitations are also identified. Chapter 4 presents the solution proposed and implemented for TV commercials detection, notably its overall architecture and the details of all the processing modules. Chapter 5 describes the performance assessment methodology, the test conditions and the metrics used to evaluate the proposed algorithm performance. Also, the assessment results are presented and discussed. Chapter 6 concludes this Thesis with a summary and suggestions for future work.

3

Chapter 2 TV Commercials: Legal Framework and Characterization

In this chapter, several concepts related to the topic of this Thesis are introduced and discussed. First, the legal framework used to regulate the broadcasting of TV commercials in Europe and more specifically in Portugal, is addressed. Next, the typical structure of a commercial block is presented, since it is a critical element for the following work. Finally, two different types of commercials characteristics - the intrinsic and extrinsic characteristics - which are instrumental for the correct detection of individual commercials within a commercial block are described.

2.1 Legal Framework

In this section, some relevant legal issues arising from the European Union and Portuguese legislations are presented.

2.1.1 Advertising Legal Framework in the European Union

In 11 December 2007, the European Parliament and the Council of the European Union (EU), amended the previous Council Directive 89/552/EEC to establish a Directive [6] concerning TV broadcasting activities. This document defines a set of rules and laws which must be implemented and adopted by each EU’s Member State (MS). Later, in 2010, a new Directive [7] concerning the provision of audiovisual media services gathered the original version and its amendments in one single version. It is this last document that is now in effect. The relevant information for the purposes of this Thesis is limited to Articles 19 and 23 of [7]. Article 19 establishes that “Television advertising and teleshopping shall be readily recognizable and distinguishable from editorial content (…)" and "television advertising (…) shall be kept quite distinct from other parts of the program by optical and/or acoustic and/or spatial means.”. Clearly, this first article requires the insertion of some video and/or audio elements which make obvious to the viewers that specific TV content has advertising purposes. Naturally, these elements may also be exploited for the automatic detection of the commercials. Article 23 establishes that “The proportion of television advertising spots and teleshopping spots within a given clock hour shall not exceed 20 %.”, which limits the total advertisement time per hour.

2.1.2 Legal Framework for Advertising in Portugal

In 2011, a whole new TV regulation, well known as “Lei da Televisão” [8], has been developed in Portugal, since the previous legislation was not appropriately matching the technical evolution occurred along the years, and [7] had imposed some new standards and rules. Particularly, the above mentioned Articles 19 and 23 from [7] were translated and included in [8] as Article 40-A and Article 40 - Point 1, respectively. However, the Portuguese law introduces some specific rules, extending the European Community directives. For example, Article 40-A states that a commercial block must be clearly identified and distinguishable from the remaining programs using optical and acoustic means with the word “Publicidade” (the Portuguese word for “Advertisement”) in the initial separator; this text element is an innovation over the directives

4 defined in [7]. Another relevant novelty comparing to the directives in [7] is the requirement established by the Portuguese law that the insertion of TV commercials should not increase the audio volume, comparing to the remaining programs. This restriction has a particular importance for this work, as analyzing the background sound, particularly if music, and checking the volume difference between different content elements, e.g. a commercial and a regular program, are procedures often implemented to detect the commercial block. Table 2.1 summarizes the relevant Portuguese rules, notably in comparison to the European counterparts.

Table 2.1 - Comparison between EU and Portuguese rules for TV commercials broadcasting.

Application Region Condition EU Portugal Maximum Share of 20% per hour 20% per hour Commercial Time Clearly distinguishable by optical Commercial Block Clearly distinguishable by optical and/or acoustic and/or spatial means Identification and/or acoustic and/or spatial means The word “Publicidade” must appear in the initial separator Sound volume difference between commercials and Not specified Volume must be the same regular programs

2.2 Typical Structure of a Commercial Block

The typical structure of a commercial block in Portugal contains five main elements as presented in Figure 2.1: • Initial commercial block separator, notably containing the word “Publicidade”. • Commercials. • Broadcaster self-promotion (advertising referring to broadcaster’s own products, services, programs or channels). • Institutional commercials (promotional messages aimed at creating an image, building goodwill or advocating the philosophy of an organization), specific from the public TV channels. • Final commercial block separator.

Figure 2.1 – Typical structure of commercial block in the Portuguese TV channels.

By direct observation, it was concluded that no regular pattern is used for the insertion of the various elements in the commercial structure, i.e., they may appear in any order and may be repeated. As mentioned above, any Portuguese commercial block must start and end with a visual and acoustic separator, with the word “Publicidade” included in the initial one. These separators are transition sequences between the regular transmission and the commercial block itself; in those separators, the used colors are generally associated with the TV channel but they can also change according to the time of the year (e.g. in Christmas); their maximum time duration is five seconds (most of them are shorter than three seconds). It was also noticed that, for the Portuguese TV channels, there is a clear predominance of commercials with a time duration

5

multiple of five seconds, which may be explained by the way TV stations sell their time, as the common ground of price tables for the Portuguese TV is five seconds slots. Near 100% of the commercials last between 5 and 60 seconds; the exceptions present a duration of 120 seconds. As in many other countries, the TV channel´s logo is removed during the commercial blocks, and inserted again after the final commercial block separator. 2.3 Intrinsic Characteristics The intrinsic characteristics are those specifically related to the process of making a commercial, notably its content elements, in which several advertising and marketing techniques are applied. Some features used to attract the viewers’ attention can be analyzed and used to detect the presence of commercials; for this, well defined mathematical features with a high distinguishing power from regular programs are measured. 2.3.1 High Scene Cut Rates The use of large scene cut rates in commercials is a well-known method adopted to stimulate the viewers’ attention. A motionless, stationary commercial will hardly ensure as much awareness as a commercial full of action and multiple transitions, colors and movements. The video transitions can be divided in three main types: i. Hard cuts - instantaneous shot transitions. ii. Fades – visual effect in which an image is replaced by a uniform area (fade-out) or vice- versa (fade-in). iii. Dissolves – gradual transition from one shot to another. The difference between fades and dissolves is illustrated in Figure 2.2. Although different types of transitions require different detection processes, it is their rate that most impacts the viewer´s attention, due to their higher predominance comparatively to regular programs. As this is an important criterion to define the commercial boundaries, this characteristic has already been exploited in several works [13] [14] [25]–[28] to improve the commercial block detection rate. Subsection 3.1.2 presents some of the techniques used to measure this feature.

Figure 2.2 - Examples of (A) fading, fade-out first, then fade-in [15]; (B) Dissolving [16].

2.3.2 Text Presence

As the advertisers’ main goal is to place the product they are selling in the consumers’ mind, providing some key information in a clear way and in a short time is a major goal; thus, the number of frames with text is much higher than in regular programs (see Figure 2.3). Besides the frequent presence of text in this kind of video content, another relevant characteristic is that it may appear in any part of the displayed image - and even move around it -, what can be exploited for detection purposes with appropriate tools. While regular programs may include a significant amount of text,

6 notably translating captions, their position and font are constant along time. Therefore, these are factors to take into account when developing an algorithm to detect the presence of commercials. Several algorithms have been proposed to accurately detect text in images and video frames [17]–[20].

Figure 2.3 - Example of a TV commercial with text in different places and with different fonts.

2.3.3 Audio Jingles

Most commercials contain background music. To make the commercials more interesting and attractive, the brands not only complement the commercials with music to get a specific atmosphere but there are also jingles specifically created for a particular product or for the brand itself (some of them very famous in specific countries). Several commercial detection solutions have been developed considering this additional information, essentially using features to distinguish audio and speech [12][24][25][27].

2.3.4 Audio Level

Another common solution to raise the viewer's attention is to increase the audio volume during the commercial blocks. This characteristic has also been used in previous works to help detecting the commercial boundaries [11]. However, as mentioned in Section 2.1 (see also Table 2.1), the Portuguese legal framework does not allow to increase the audio volume during commercials (and the same or similar rule is applied in other countries), which means that this feature cannot be explored for the detection of commercials in the Portuguese TV channels. An actual advantage taken from the audio level is related to the evidence that delimiting black frames (from now on referred to as BFs) are usually accompanied by silence [23]; this association is a potentially relevant clue to determine the limits of each commercial [24]. Subsection 3.1.1 presents a solution exploiting this combination of features.

2.4 Extrinsic Characteristics

The extrinsic characteristics of a commercial are those not related to the commercials’ message and content, and also not to the advertising techniques themselves. These characteristics are normally related to the structure and composition of the commercial block, e.g. temporal duration and insertion of black frames. Some of the most relevant extrinsic characteristics are described in the following. 2.4.1 Commercial Block Separator A commercial block separator is an audiovisual sequence that introduces or finalizes a commercial block. It has a short duration and it is used to flag the beginning and the end of a commercial block, and its use is mandatory according to the European directives. The colors used are usually related with the station’s logo colors, and the initial separator may contain the word “Publicidade”, which is mandatory in Portugal (see Figure 2.4). The initial and final separators are

7 not necessarily equal, but they are typically quite similar, both in terms of the audio and visual content. This feature is naturally an important hint to develop tools identifying the limits of commercial blocks.

Figure 2.4 - Example of RTP (Portuguese public television) initial commercial block separator. 2.4.2 TV Channel Logo The presence or absence of the channel logo in one of the screen corners is another important feature for the detection of the commercial block boundaries [25]–[28]. In fact, the channel logo is typically suppressed during the commercial block, making it easier to detect its temporal limits. This is a rather transversal characteristic, not only in the Portuguese TV channels but also in USA and most European countries.

Figure 2.5 - TV Channel Logo present in the top left corner.

TV channel logos (see an example in Figure 2.5) may be divided in three categories: opaque, transparent and animated [28]. Intuitively, it seems obvious that opaque logos are the easiest to detect, since their pixel values remain constant along time. By comparing the detected logo mask with some previously known channel logos, it should be easy to detect the presence or absence of a logo. However, this approach reveals some problems. First, it is very sensitive to the logo position; for example even small errors of about 1-2 pixels due to synchronization issues are frequent, causing errors in the process. Also, some apparently opaque logos are actually transparent, a fact that may prompt low detection performance. Another relevant issue is the logo changing that some TV stations make in special events or seasons, like Christmas and the New Year’s Evening. A well-designed logo detection system should have these variations into account and should be prepared to adapt itself and to learn new logos, constructing and developing a database able to perform the detection with no performance loss. Some methods designed to overcome these problems will be reviewed in Subsection 3.1.4.

2.4.3 Black Frames

Black frames (BF), also known as dark monochromatic frames, are a classical hint used to detect the limits of a single commercial [13][14][24], as this type of frames are commonly inserted at the beginning and at the end of each commercial to clearly flag it. Although this is a detection

8 method still in use, not every broadcaster uses BF to delimit the commercial blocks [21] and other parts of the video content may also contain BFs, particularly in the middle of some commercials. Methods designed using this feature will be reviewed in Subsection 3.1.1.

2.4.4 Time Duration

The time duration of a commercial is a feature difficult to guess a priori, as multiple values have been adopted. However, even though most commercials in the Portuguese TV channels have a time duration in the range of 5 to 60 seconds (with a particular incidence of 15, 30 and 45 seconds commercials), there is a common characteristic in the individual commercial time duration: it uses to be a multiple of five seconds. This may be an important characteristic to take into account when detecting the various commercials boundaries. It is also possible to obtain, by direct observation, statistics of the commercial duration, e.g. the average duration of a commercial block and a single commercial and their variance, which may help when deciding thresholds and selection criteria [24].

2.4.5 Commercials Repetition

A single TV commercial may be broadcast several times in a single commercial block, during a day, a week or a month, and there are some special cases of popular commercials broadcast over the years. This means that, in a broadcast video stream with enough temporal duration, any TV commercial is inevitably repeated; this has led to some technical approaches to the identification of TV commercials based on this feature [29]–[34]. Section 3.2 reviews several repetition-based detection schemes exploiting this characteristic.

9

Chapter 3 Overview of TV Commercials Detection Schemes

In this chapter, the most relevant solutions in the literature targeting the detection of TV commercials are presented. There are two main different approaches, knowledge-based detection and repetition-based detection, which will be presented and discussed in the following.

3.1 Knowledge-based Detection

The knowledge-based schemes for the detection of TV commercials are those based on the a priori knowledge of specific characteristics, either intrinsic or extrinsic, e.g. insertion of black frames or higher amount of text. In practice, these methods tend to use simultaneously both intrinsic and extrinsic characteristics; several combinations of characteristics have been exploited with appropriately designed and tested algorithms, as presented in the following.

3.1.1 The First Steps - Black Frames and Silence

In 1997, Lienhart et al. [13] released one of the most important works in the area of the detection and recognition of TV commercials. For the first time an algorithm with this particular goal was created and tested with promising results, making it a reference work even nowadays. The algorithm makes a set of assumptions based on observations from the German TV and considers two types of features: the directly measurable ones – including the average duration of the commercial blocks/individual commercials and the presence of BFs separating each commercial (i.e., extrinsic characteristics) – and the indirectly measurable ones – related to the human perception of motion and energy in a commercial (i.e., intrinsic characteristics). The most intuitive starting point chosen in [13] is the detection of dark monochrome frames, thus exploiting the evident hint they bring. The algorithm proceeds as follows: 1. Feature extraction - This step considers two phases: a. First, the standard intensity deviation, , of the pixels of each frame is computed, as this deviation should be zero for a perfect monochrome frame; to provide some 𝜎𝜎𝐹𝐹 resilience to noise, a small threshold, , is adopted. b. Second, the average intensity, , of the pixels of each frame is computed; to detect 𝑡𝑡𝑀𝑀𝑀𝑀𝑀𝑀 the black frames a threshold, , is used. 𝜇𝜇𝐹𝐹 𝑡𝑡𝑀𝑀𝑀𝑀𝑀𝑀 1 = 𝑁𝑁 ( ) (3.1) and = . 𝑁𝑁 (3.2) 2 𝜎𝜎𝐹𝐹 � 𝐼𝐼𝑛𝑛 − 𝜇𝜇 𝐹𝐹 𝜇𝜇 𝐹𝐹 � 𝐼𝐼𝑛𝑛 2. Frame Classification𝑛𝑛=1 – After, each frame is classified according to 𝑁𝑁the𝑛𝑛 =rule1 expressed by (3.3):

dark monochrome frame if ( ) ( ) ( , ) = non dark monochrome frame if ( ) ( > ) (3.3) 𝜎𝜎𝐹𝐹 ≤ 𝑡𝑡𝑀𝑀𝑀𝑀𝜎𝜎 ∧ 𝜇𝜇 𝐹𝐹 ≤ 𝑡𝑡𝑀𝑀𝑀𝑀𝜇𝜇 frame in any other case 𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹 𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶 𝜎𝜎𝐹𝐹 𝜇𝜇 𝐹𝐹 � − 𝜎𝜎𝐹𝐹 ≤ 𝑡𝑡𝑀𝑀𝑀𝑀𝜎𝜎 ∧ 𝜇𝜇 𝐹𝐹 𝑡𝑡𝑀𝑀𝑀𝑀𝜇𝜇 In the formulas above, represents the intensity of nth pixel and is the total number of pixels in each frame. Lienhart et al. tested this algorithm with seven different German TV video 𝐼𝐼𝑛𝑛 𝑁𝑁

10

sequences, obviously including the respective commercial blocks; they concluded that no commercial block was missed and that 99.98% of the BF sequences identified as potentially belonging to commercial blocks were indeed part of a commercial block. However, about 15% of the overall commercial blocks length was missed, notably commercial block introduction, broadcaster self-promotions, previews and the first and last commercials, because these elements were not separated with BFs; this shows that this method alone is not enough to achieve a high performance result. Later, in 2002, Sadlier et al. [24][35] developed a different method to detect BFs. This solution includes two main stages, the first associated to BF detection and the second corresponding to silence detection, as follows: 1. BF Detection – this operation is divided in three steps: a. Frame Division - An MPEG-1 video frame is divided into slices, which are subdivided into macroblocks (MBs). Assuming a chrominance subsampling of 4:2:0, each MB contains 6 blocks of 8x8 samples transformed by a 2D-DCT, two of which providing chrominance information and the remaining four blocks providing luminance information. b. Information Filtering - The four luminance blocks (also referred as Y-blocks) of each MB provide the most relevant information, since they include a DC coefficient each, representing the mean luminance intensity. The AC coefficients are ignored as they correspond to higher frequencies. c. Decision Making - A threshold is applied to decide whether a frame is black or not. This threshold was obtained by trial and error examination of various commercial blocks. 2. Silence Detection - To complement and improve its solution, Sadlier et al. have also designed a silent-frame (SF) detection method. This method is motivated by the observation that frames separating commercials are not only black but they also use to be silent (mute, in the limit), which leads to the conclusion that the simultaneous occurrence of BF and SF is a strong indicator of the presence of a commercial block. The SF detection method proposed consists in the following steps: a. TV audio signal conversion to .mp2. b. Audio Power Level Computation – For each frame time period, the Audio Power Level (APL) is computed through the sum of the scale factors of the audio samples. A threshold to classify a frame as silent or not has been empirically defined. A set of conditions, established after empirical observation of several commercial blocks, are defined to minimize the number of false positives (i. e., frames that although not part of commercials, are identified as such) (see Figure 3.1): i. At least six consecutive BF/SF are required to consider it as a potential separation between individual advertisements. ii. The distance between two consecutive BF/SF must be within a time window of 90 seconds. iii. At least 4 consecutive series of BF/SF are required to consider a commercial block as such.

Figure 3.1 - Conditions imposed by Sadlier et al. to detect commercial blocks by using BF/SF series (image from [23]).

11

Using this method, Sadlier et al. got a 100% precision1, which means that there was not a single second of commercials falsely identified. However, a recall2 of 89.3% was obtained, meaning that not all the commercial time was identified as such. These results are quite acceptable for TV channels which adopt this BF/SF flagging system. However, according to [24], for the channels not adopting this method, the results are not interesting at all since the basic assumption is not fulfilled. Considering the low complexity associated, it is usually a good start solution to detect the BF/SF pairs, naturally when the corresponding assumption is valid.

3.1.2 Going Deeper – Cut Rates

As stated in Subsection 2.3.1, using high cut rate scenes is an effective method to catch viewer’s attention. Figure 3.2 and Figure 3.3 show how the action varies in commercials and regular broadcasting (in terms of cut rate per minute) highlighting the importance of the methods presented in this subsection. In terms of computational cost, although the computational complexity is higher than for the BFs/SFs detection, it is still not a major burden.

Figure 3.2 – Comparison of the Cuts per Minute metric for a commercial block and a movie [13].

Figure 3.3 - Comparison of the Cuts per Minute metric for a commercial block and a newscast [11]. Lienhart et al. [13] improved the detection performance of their first solution by also detecting the presence of hard cuts and fades, common in commercials. An effective way to detect hard cuts is by thresholding the difference in color histograms between successive frames [36]. The process proceeds as follows: 1. Frame Color Histogram Computation - A 64-bin color histogram is computed for each frame; only the two most significant bits of each color component are considered and normalized by the number of pixels in the frame.

( ) . 1 = 100 ( ) . . 𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿ℎ 𝑜𝑜𝑜𝑜 𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐 𝑏𝑏𝑏𝑏𝑏𝑏𝑏𝑏𝑏𝑏 𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠 − 𝑁𝑁𝑁𝑁 𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠 𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚 𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃 𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿ℎ 𝑜𝑜𝑜𝑜 𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐 𝑏𝑏𝑏𝑏𝑏𝑏𝑏𝑏𝑏𝑏 𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠 − 𝑁𝑁𝑁𝑁 𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠 𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚+𝑁𝑁𝑁𝑁 𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠 𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓 𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖 ( ) . 2 = 100 ( ) 𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿ℎ 𝑜𝑜𝑜𝑜 𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐 𝑏𝑏𝑏𝑏𝑏𝑏𝑏𝑏𝑏𝑏 𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠 − 𝑁𝑁𝑁𝑁 𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠 𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚 𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅 𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿ℎ 𝑜𝑜𝑜𝑜 𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐 𝑏𝑏𝑏𝑏𝑜𝑜𝑜𝑜𝑜𝑜 𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠 12

2. Color Histogram Differences Computation - The color histogram differences between each two consecutive frames is calculated; if this difference exceeds a certain threshold, , a shot boundary is detected. 3. Hard Cut Type Discrimination – A new threshold, , is applied. This step 𝑡𝑡𝐻𝐻𝐻𝐻𝐻𝐻𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑 establishes the difference between a weak hard cut and a strong hard cut. The idea behind 𝑡𝑡𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆−𝐻𝐻𝐻𝐻𝐻𝐻𝐻𝐻𝐻𝐻𝐻𝐻𝐻𝐻 it is that since commercial blocks contains a set of non-related commercials, strong hard cuts between them are expected. Thus, not only commercials blocks can be more easily identified with , but it may also be a clue for detecting individual commercials themselves. 𝑡𝑡𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆−𝐻𝐻𝐻𝐻𝐻𝐻𝐻𝐻𝐻𝐻𝐻𝐻𝐻𝐻 This differentiated concept for each kind of cut is used in other works. For example in [11], Chen et al. propose a similar algorithm, with just a slight variation: the comparison is made between the (f-1)-th and the (f+1)-th frames and not between consecutive frames. Being the color histogram of frame f and M the total number of histogram bins, the average histogram 𝐻𝐻𝑓𝑓 difference, D(f-1,f+1), is obtained by (3.4): ( ) ( ) =1 +1 1 (3.4) ( 1, +1) = 𝑁𝑁 . ∑𝑖𝑖 �𝐻𝐻𝑓𝑓 𝑖𝑖 −𝐻𝐻𝑓𝑓− 𝑖𝑖 � In [9] Colombo et al. designed𝐷𝐷 𝑓𝑓− a𝑓𝑓 different way to𝑁𝑁 detect video hard cuts (in this case no distinction between different kinds of cuts is done). The process is the following: 1. Frame Division - Each frame is divided into 9 segments, and each segment is represented by its HSI (hue, saturation, intensity) color histogram. Actually, only H and S are taken into account to reduce the influence of the lightning conditions. 2. Frames Differences – The cut detection is performed by analyzing the differences between two consecutive frames. For this, the volume, , of the differences between the j-th segment histogram ( , ) of two successive frames, i 𝑗𝑗and i+1 is obtained: 𝑣𝑣𝑖𝑖

ℋ 𝐻𝐻 𝑆𝑆 = ( , ) ( , ) . (3.5) 𝑗𝑗 𝑗𝑗 𝑗𝑗 𝑖𝑖 𝑖𝑖 𝑖𝑖+1 3. Thresholding - A cut𝑣𝑣 in frame� ��𝐻𝐻 i is 𝐻𝐻detected𝑆𝑆 − 𝐻𝐻 if the𝐻𝐻 average𝑆𝑆 � 𝑑𝑑𝑑𝑑 𝑑𝑑value𝑑𝑑 of for its nine segments is above a pre-defined threshold. 𝑗𝑗 𝑣𝑣𝑖𝑖 In [13], Lienhart et al. also considers fades in determining the video cut frequency. Since a fade is a gradual shot transition from (fade-in) or to (fade-out) a monochrome frame, the following conditions are verified: i. Either the first or the last transition frame presents a standard luminance intensity deviation, , close to zero, while the other end point exhibits a larger . ii. Between the first and last frames, presents a monotone increasing/decreasing behavior. 𝜎𝜎𝐹𝐹 𝜎𝜎𝐹𝐹 Additionally, a set of conditions and𝐹𝐹 thresholds for some parameters were defined, e.g.: (i) 𝜎𝜎 the minimum amount of frames to consider a sequence as a potential fading event (Lienhart et al. defined it as 10); (ii) either in the first or last frame the value of has to represent a monochrome frame. Lienhart et al. also concluded that for commercial blocks the fade rate was 𝜎𝜎𝐼𝐼 0.5 fades per minute, while for films the fade rate obtained was 0.02 fades per minute; these results legitimize the use of this feature. In [37] Feng and Neumann uses the edge-based contrast (EBC), designed to detect shot boundaries delimited by dissolve, the most difficult video transition case to deal with. The algorithm is based on an edge map for which EBC is computed to assess how strong the edges in each frame are. Also for detecting dissolves, a complex and original method based on corner statistics was suggested by Colombo et al. [9]. The assessment of the individual metrics presented so far is difficult since they are typically integrated in algorithms with other metrics and tested with different video contents. In [13], some

13 results are presented based on the “Cuts per Minute” feature. A false positive detection of commercials of 0.09% and a detection of about 96.14% of the total commercials in the video are reported.

3.1.3 Motion Analysis

Some authors have referred the study of motion within a shot as a trustful parameter to assess its action level. A commercial can be distinguished from other video content by comparing not only the cut rate (as stated in Sections 2.3.1 and 3.1.2) but also the action level within each is shot, which is in generally higher for advertising content. Associating these two criteria, higher confidence levels can be achieved when classifying content as commercial or non-commercial. For motion analysis, several features have been proposed. The most referred in the literature is the Edge Change Ratio (ECR), proposed by Zabih et al. [38]. In [13], Lienhart et al. compute the ECR as follows: let be the number of edge pixels in frame n, the number of entering edge pixels in frame n and the number of exiting edge pixels in frame𝑖𝑖𝑖𝑖 n-1. Then the ECR is 𝜎𝜎𝑛𝑛 𝑋𝑋𝑛𝑛 computed as: 𝑜𝑜𝑜𝑜𝑜𝑜 𝑋𝑋𝑛𝑛−1

= , (3.6) 𝑖𝑖𝑖𝑖 𝑜𝑜𝑜𝑜𝑜𝑜 . 𝑋𝑋𝑛𝑛 𝑋𝑋𝑛𝑛−1 This metric expresses structural𝐸𝐸 𝐸𝐸𝐸𝐸changes𝑛𝑛 𝑚𝑚𝑚𝑚 in𝑚𝑚 �the𝜎𝜎𝑛𝑛 shot,𝜎𝜎𝑛𝑛−1 � notably by measuring the dynamics exhibited within each video shot. This feature is conceived to analyze moving objects and fast camera operations. In [37], Feng and Neumann also use the ECR but another ratio, the edge stable ratio (ESR), is considered. ESR is the ratio between the number of preserved edge pixels and the number of total edge pixels in adjacent frames. If and are the number of edge pixels in frames f and f-1, respectively, the ESR is given by: 𝑋𝑋𝑓𝑓 𝑋𝑋𝑓𝑓−1

= . (3.7) 𝑋𝑋𝑓𝑓−1 ∩ 𝑋𝑋𝑓𝑓

A complementary ratio is proposed𝐸𝐸 𝐸𝐸𝐸𝐸by Colombo𝑋𝑋𝑓𝑓−1∪ 𝑋𝑋 𝑓𝑓et al.; it is called rhythm, r( , ), between two frames f1 and f2, where #cuts and #dissolves are measured in the same interval: 𝑓𝑓1 𝑓𝑓2 # # ( , ) = . (3.8) 𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐+ 𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑 Another well-known feature related𝑟𝑟 𝑓𝑓1 𝑓𝑓 2to action𝑓𝑓 2is− 𝑓𝑓fast1+1 object movement, which can be assessed by the so-called Motion Vector Length (MVL) [13]. The method is similar to a motion compensation algorithm used by MPEG encoders, called “Exhaustive Search Method”, and proceeds as follows: 1. Frame Division - Each single frame of the video is divided into MBs of 16x16 pixels. 2. Frames Comparison - A matching operation between each MB mb ( , ) of a frame and each possible position of the MB in the next frame, mb’ (x’, y’), within an area of p×p pixels 𝑥𝑥 𝑦𝑦 around the original location, is performed. 3. MVL Generation - The expected result is a MVL with the length of the distance between the positions of a block in two consecutive frames. As commercials are usually much more bustling content than other types of video contents, higher MVL values are expected in TV commercials.

3.1.4 Logo Detection

The absence of TV channel logos during the commercial blocks is another characteristic that can be exploited for the automatic detection of commercials, and several methods have been developed with interesting results. In [25], Glasberg et al. propose a Static-Area Descriptor for logo detection. The algorithm proceeds as follows:

14

1. Logo Zone Identification - The four areas in the corners of the frames, where usually the TV logos are displayed, are scanned. The brightness values of the first frame (of a

processing window with N frames) are saved in .

2. Frames Comparison - For each consecutive frame, the current values are 𝐿𝐿𝑏𝑏𝑏𝑏𝑏𝑏𝑏𝑏𝑏𝑏𝑏𝑏 compared with and the darkest values of the search areas are saved. 𝐿𝐿𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎 3. Thresholding - When the processing window ends, the final values of are compared 𝐿𝐿𝑏𝑏𝑏𝑏𝑏𝑏𝑏𝑏𝑏𝑏𝑏𝑏 with a threshold and a binary image is generated. If the average brightness value, , is 𝐿𝐿𝑏𝑏𝑏𝑏𝑏𝑏𝑏𝑏𝑏𝑏𝑏𝑏 greater than zero in at least one of the corners, static pixels are detected and a high µ B probability of logo presence exists. The tests performed by Glasberg et al. show that 90 non-static areas (i.e., no corner presenting static pixels) were detected, out of 98 commercials. As the remaining 8 commercials had a static company logo, they were not identified as commercial material with this method only; however, it is proved that the logo detection algorithm works when the assumption is valid. In [28], Albiol et al. propose the following Logo Mask Extraction scheme, based on the idea that a logo exists if there is, in the image, an area with stable contours: 1. Shot Detection and Key Frame Extraction – After detecting the video shots limits, one frame (per shot) is extracted. 2. Logo Zone Detection - One area per frame corner is defined as potential “logo zone”. 3. Computing – Let be the number of shots processed; the gradient image of a frame is obtained and the gradient temporal average, , is computed as: 𝑺𝑺𝒊𝒊 𝑖𝑖 𝐺𝐺𝑖𝑖 𝑓𝑓𝑖𝑖 1 = + 𝑆𝑆𝑖𝑖 (3.9) 𝑖𝑖 − 𝐺𝐺𝑖𝑖 4. Thresholding - A binary mask 𝑆𝑆is𝑖𝑖 obtained𝑆𝑆 𝑖𝑖by−1 thresholding . 𝑖𝑖 𝑖𝑖 5. Filtering - A morphological filtering is applied to the binary mask, in order to fill holes and to 𝑆𝑆𝑖𝑖 reduce spurious pixels; a new processed binary mask, , is obtained (see Figure 3.4).

𝐿𝐿𝑖𝑖

Figure 3.4 – The time averaged gradient (in the first row); Binary mask obtained after morphological processing (in the second row). 𝑆𝑆𝑛𝑛 𝐿𝐿𝑖𝑖 Then, two decision operations are applied:

i. If the size of is not large enough to contain a logo, the search is reset. ii. If the mask remains stable for three minutes, it is considered that a correct mask extraction 𝐿𝐿𝑖𝑖 was performed; otherwise, the search is reset. The shot labeling (commercial against non-commercial content) of this detection system is modeled using a Hidden Markov Model (HMM) [39], which is a recurrent tool in this research field. For example, in [26], different tests were accomplished and the best result was obtained (an accuracy3 of 98.4%) using two HMMs (one modeling ‘commercial’ and the other ‘non-commercial’ situations) and two visual descriptors: logo detection (which extracts the information of logo occurrence and separating blocks at the same time) and the already referred hard cut frequency. The methods described above do not consider the possibility of animated TV logos; Esen et al. [40] and Mikhail et al. [41], were, however, aware of those cases where the TV logo presents

. + . 3 = 100 𝑁𝑁𝑁𝑁 𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠 𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐 𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖 𝑁𝑁 (𝑁𝑁 𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠) 𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚 𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴 𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿ℎ 𝑜𝑜𝑜𝑜 𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐 𝑏𝑏𝑏𝑏𝑏𝑏𝑏𝑏𝑏𝑏 𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠 15 motion. In [40] Esen et al. proposed a method to detect animated logos in real-time; the idea is to handle all frames belonging to the animated logo and process them as a unity. There is a training stage where the boundaries of the animated logo from each frame are placed in a single set; during detection, a voting-based decision scheme is performed in order to determine the presence of the trained logo and a set of refinement steps are applied, including the incorporation of negative clues and delay for eliminating false positives. It is reported, for F1-score measure, values above 97%. In [41], Mikhail et al. propose an interesting approach to detect and remove the TV channel logos from the screen and to replace it with texture as similar as possible to the actual background. Different methods for detecting static and dynamic logos are presented and assessed. Ozay et al. [42] propose a method to automatically detect TV logos and to recognize them, by comparison with a logos database.. The method proposed for logos detection in this work is based on the usual edges analysis approach, similar to the one performed by Albiol et al. [28] described above.

3.1.5 Audio Analysis

Since background music is a recurrent characteristic in TV commercials, its analysis appears as one more mean to achieve good detection performances. Abrupt transitions in speech are a characteristic used in some commercials too, as the silent moments in the transition between two commercials. Thus, some methods to distinguish and signalize these different moments have been presented. In [43], Panagiotakis and Tziritas propose a speech-music discrimination solution based on computing the root mean square (RMSs) and zero crossing rates (ZCRs) of audio samples. The algorithm partitions the audio data into segments, classifying each one as speech or music. Its results are promising (92% of music and 97% speech were correctly identified as such) making this a very popular and highly cited work (e. g. in [11]). In [12], Duan et al. propose an Audio Scene Change Indicator (ASCI). An audio scene can be modeled as a collection of sound sources, some of which dominating the others. It is noted that ASC occurs when there is a change of dominant sources in sound. This definition makes it hard to find an ASC transition pattern because large amounts of samples are required and classes labeling is a subjective process. Alternatively, the examination of the distance metric between audio features in two temporal windows is suggested, producing a metric-based quantitative indicator: 1. HMM Training - HMM is used to train two models for describing two dynamic patterns: ASC and Non-ASC. 2. Decision Taking - The highest posterior probability model assigns each segment a classification. An overall accuracy of ASC and Non-ASC of 87.9% has been achieved. In [22], Lu et al. proposed to use machine learning tools, notably a Support Vector Machine (SVM) [35][36], as the audio classification tools. SVM learns an optimal separating hyper-plan to minimize the likelihood of misclassification and it can be divided into linear-SVM and kernel-SVM. The second type is recommended for audio data, whose feature distribution complexity can imply some audio classes overlapping. In this study, each segment is identified as one out of five classes: speech, pure-speech, non-pure speech, environment sound and music. This is achieved by combining two types of features: i. Mel-Frequency Cepstral Coefficients (MFCCs) – Commonly used in speech recognition systems and audio classification systems; ii. Perceptual Features.

16

These two classes are normalized and concatenated into a single feature vector. As perceptual features, the selected options may include: i. Speech/Music Detection - ZCR, Short-Time Energy (which is the total spectrum power of a frame), sub-band power distribution, brightness and bandwidth – one for each frame; ii. Speech/Music/Environment Sounds Differentiation - Spectrum flux (defined as the average variation value of spectrum between two adjacent frames); iii. Music/Environment Separation - Band periodicity and noise frame ratio (defined as the ratio of noise frames in a given audio clip. A frame is considered as a noise frame if the maximum local peak of its normalized correlation function is lower than a pre-set threshold). The reported average accuracy of speech/non-speech distinction was 96.36%; for music/background sound, it was 94.67%; and for pure speech/non-pure speech, the result was 89.64%. The results of this work were compared with two other methods: KNN (K-Nearest Neighbor) – used in [46] as a pre-classifier for speech and non-speech discrimination – and GMM (Gaussian Mixture Model); the conclusion is that SVM presents better results in different test sets and testing unit lengths. The best results are obtained when using a testing unit of one second. Some other methods rely only on audio volume measurements, like the Energy Peak Rate (EPR) [21], having as basic assumption that TV channels increase the audio volume during commercial blocks. As it has been mentioned before, this solution cannot be applied in countries where the audio volume must be the same in both regular programs and commercial blocks.

3.1.6 Text Detection

As stated in Subsection 2.3.2, several text detection methods have been proposed in the literature, boosted by different application areas and including commercials detection. In 2009, Meng et al. [47] proposed an algorithm based on Maximum Gradient Difference (MGD). The authors claim their algorithm has fast execution time and is able to deal with difficult cases, such as scenes with highly textured background and scenes with small text. The procedure uses the luminance component, and is applied as follows: 1. Maximum Gradient Difference Computation – This step is divided in the following operations: a. Horizontal Luminance Gradient Computation – A filter with mask [-1, 1] is convolved with the image. b. MGD Calculation – The difference between the maximum and minimum gradient values (briefly, MGD) is performed at each pixel location, using a one-dimensional window of size n (n=21 in this work), applied in both horizontal and vertical directions (resulting, respectively, in a horizontal and in a vertical MGD). This operation explores the fact that text regions have both large positive and negative gradients in a local region due to the even distribution of character strokes, which leads to locally large MGD values. c. Thresholding – Each horizontal and vertical MGD value is threshold in order to get a binary image (see Figure 3.5). 2. Dilation and Erosion – These two morphological operations are applied to the binary mask that results from step 1. Dilation is used to pick up potentially missed text pixels from adjacent pixels, to form continuous text segments; erosion eliminates the non-text noisy background pixels within the text segments. 3. Text Blocks Detection – Two methods are used (see Figure 3.6): a. Flood Fill – This algorithm, to determine an area connected to a given node, is implemented to detect the text blocks. It is stack-based recursive flood-fill, with four directions.

17

b. Projection – Vertical projection is applied to merge the adjacent lines in row-segments; the same procedure is applied horizontally, to get column segments. 4. Text Filtering a. Merging Neighbor Blocks – When the horizontal distance between different text blocks is less than some specified value, a merging operation of the two blocks is performed. b. Geometric Analysis – The existent blocks are filtered based on their geometry, in order that non-text noisy artifacts are discarded. To evaluate the success of the text detection process, 100 color video images with spread text were used. A recall of 82.21% and a precision of 96.44% are reported. When applying the system to commercials, text detection is performed at intervals of 10 frames. This method was tested with a “shot change frequency” feature, thus it is difficult to say accurately how important each component is in the final result. However, Meng et al. reported better results when applying both features than when applying “shot change frequency” alone, which means that text detection can be an important help in an integrated algorithm.

Figure 3.5 – Binary image obtained after step 2.c. (top right corner); edge detection result (bottom right corner).

Figure 3.6 – The result after step 2.e. (see the text boxes in the left image).

In [17], Dimitrova et al. proposed a video classification system based on HMM, using text and faces detection and tracking. Faces detection is not particularly important for commercials detection, but as this work intended to distinguish four different kinds of content (news, commercial, sitcom and soap content), an integrated approach was chosen. The text tracking process is constructed based on the assumptions that text may survive cuts (and therefore the tracking should not be restricted to within each shot) and it may move around the screen. To solve these questions, text tracking is accomplished by two independent steps: a. Frame-by-frame Text Detection – The text detection method referred above is used on a frame-by-frame basis and then a model for each detected text box is constructed. The model includes center location, height and width of the text. b. Trajectory Extraction – Models over adjacent frames are compared so that each trajectory contains a series of text models over consecutive frames. This step includes:

18

i. Initialization – Models on the first frame containing text are considered as the “leading” models of the initial set of text trajectories. ii. Matching – Each model in each frame is compared with the model of the previews frame. If a good match is found, i.e. if the center position, height and width are similar, it is appended to the corresponding trajectory of the previous model; otherwise, the text box is considered to be independent of the previous one, a new trajectory is created and the model is considered to be a new “leading” one. The next phase consists in training and classifying four HMM models, one for each kind of considered content. The overall accuracy of this work (classifying the four kinds of content) is 85.2%, using 26 video clips (one minute length each) applied to both training and testing sets.

3.1.7 Still images detection

In [12], Duan et al. suggest to consider Frames Marked with Product Information (FMPI) as a component to have into account to detect commercial blocks. This solution considers the evidence that most commercials end with so-called “still shots”, corresponding to images with text and references to the product. FMPIs are used to describe images containing visual information about some advertised product or service. As shown in Figure 3.7, three different types of FPMIs can be found: in the first line are the simplest ones, generally with a single color background, and with some text referring a company, a particular product or a phone number; in the examples of the second line, the product is highlighted in the foreground; in the third line, live video, graphics and high presence of text are evident. In the proposed method, a sequence frames is considered as a FPMI shot if at least one FPMI frame is identified in the shot; it is used as an indicator of probable commercial block boundary occurrence, due to its usual presence in the end of commercials, as referred above. A FPMI recognizer is constructed considering the combination of texture, edge and color features. This method presents a recall of 88.25% and a precision of 91.00% for the 1046 FMPIs and 2987 Non-FMPIs used in the tests

Figure 3.7 – Examples of FPMI, each line representing a different type [12].

3.2. Repetition-based Detection

The repetition-based detection methods rely on the fact that each TV commercial is an individual video stream piece that is (or may be) repeated several times along a certain period of time. As stated in Subsection 2.4.5, the same commercial can be transmitted more than once on a single commercial block, a couple of times per day or just once a week, depending on the advertiser’s goals. The durability can vary a lot too: some commercials are on air just for some

19 days (e.g., to advertise a specific event), some others for months (e.g., to advertise a product) and there are cases of commercials broadcasted over decades: in 2005, the Guinness Book of Records awarded Discount Tire Co. for having the longest world’s running TV commercial, first aired in USA in 19754 (see Figure 3.8).

Figure 3.8 – Discount Tire Co.’s “Thank you!” commercial.

Repetition-based methods are unsupervised and more generic than the knowledge-based ones, which represents an important advantage. However, the associated computational burden is higher too. In order to reduce that cost, repetition-based methods are typically merged with some of the techniques that rely on intrinsic and extrinsic characteristics presented in Section 3; some works also use knowledge-based methods as pre-filter to define the areas where repetition- based methods are then applied; this avoids the application of heavy operations over all the stream. In other studies, commercials’ specific characteristics are used in the end of the process to classify the detected repeated material either as commercial or non-commercial. Repetition-based methods can be divided into two categories: those which use databases (also known as libraries in the literature) with known commercials built in first place (with the ability to learn new commercials which later appears) and those which need no beforehand constructed databases (thus constructing a database along the time). This section is divided into three subsections, each one presenting one work.

3.2.1. Lienhart et al. (1997)

In [13], Lienhart et al. establish a set of concepts that are essential to understand the fundaments of repetition-based methods, like the concept of fingerprint. A commercial’s fingerprint is a representation of a set of relevant features, computed per frame. They are defined by: i. Character – The representation of the value of each feature. ii. Alphabet – The domain of possible characters. iii. String – A sequence of characters. To be effective, a feature used to obtain a fingerprint should meet the following basic requirements: i. It should be tolerant to slight differences between two fingerprints of the same commercial, broadcasted at different times and by different stations. ii. Its computation should be simple and fast. iii. It should have a strong discriminative power. In order to achieve these three specifications, Lienhart et al. chose, as a fingerprint, the color coherence vector (CCV). The CCV is similar to the color histogram, but a little bit more complex and discerning. The major difference is that CCV not only counts the number of pixels of each color, but also differentiates pixels of the same color, depending on the size of the color region they belong to. To do this, a pixel is considered as either coherent or incoherent: it is coherent if it belongs to a region (frequently referred to as connected component, CC) larger than a threshold ; otherwise, it is incoherent. The uses only the two most significant bits

𝐶𝐶𝐶𝐶𝐶𝐶 𝑡𝑡 4 http://www.guinnessworldrecords.com/world-records/longest-running-tv-commercial

20 of each RGB color component. Finally, for each color i, is the number of coherent pixels and is the number of incoherent pixels. A CCV is defined as follows: 𝛼𝛼𝑖𝑖 𝑖𝑖 ( , ), … , ( , ) . (3.10) 𝛽𝛽 Given a query string A of length〈 P𝛼𝛼 1and𝛽𝛽1 a longerα𝑛𝑛 subject𝛽𝛽n 〉 string B of length Q (on the database) the fingerprint matching algorithm is performed according to the following sequence of steps: 1. Query String Fingerprint Computation – a sliding window with length L seconds is run over the video, stepping forward from shot to shot; each time, the CCV fingerprint of the window is calculated (note that for the commercials in the database this operation is done beforehand). 2. Approximate Substring Matching - The substring of B that aligns with A with minimal substitutions, deletions and insertions of characters is found in this step. This is done by comparing the window fingerprint obtained in the previous step with the first L+S seconds of each commercial’s fingerprint stored in the database. 3. Minimal Distance – The minimal distance between A and B is defined as the minimal number of substitutions, deletions and insertions of characters transforming A into B, in the temporal windows defined in the previous steps. 4. Similarity Classification - Fingerprint sequences A and B are considered as identical if the minimal distance between those sequences does not exceed a threshold and the ratio between the lengths, P/Q, does not exceed 90%. 𝑡𝑡𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠 5. Decision Making - If two identical sequences are found, the windows are temporally expanded to the whole length of the candidate fingerprint and the two new sequences are compared. If a commercial is found, the sliding window jumps to the end of the found recognized commercial; otherwise, it jumps to the next shot. The experiments were based on 200 commercial spots from several German TV channels and from different transmission times, which provides a diversified set of condition. The results were very promising: all commercials were recognized and no one was missed or falsely detected; in terms of position determination, the difference between the real and detected locations was only five frames on average. The processing time for each matching operation was, in average, 90% of real-time commercial video duration, meaning that the whole process could be performed in real-time. In this work, Lienhart et al. also proposed a hybrid method: a feature-based approach, using BF/SF and cut rates, is firstly applied to reduce the number of potential commercial blocks; next, a recognition-based approach is used to identify the exact borders of each individual commercial. A last experience was conducted by Lienhart et al. in order to find unknown spots automatically. An important and limiting condition is imposed: the database must contain almost every possible individual commercial. Thus, if a new commercial is broadcasted, there is a high probability of being surrounded by known spots (if not in its first appearance, sooner or later it will happen). Accordingly, the algorithm must find the end of the previous commercial and the beginning of the next one, allowing the exact boundaries of the new commercial and its insertion into the database.

3.2.2 J. M. Gauch and A. Shivadas (2005)

In [48], Gauch and Shivadas proposed a procedure to detect commercials without a database constructed beforehand. Three main phases are developed to reach the goal: temporal segmentation, repeated video sequence detection and feature-based video sequence classification (either as commercial or non-commercial material): 1. Temporal Segmentation – The monitored video signal is partitioned into individual shots. Temporal variations in first, second and third color moments (which are mean, standard deviation and skewness) of the RGB histograms are used to detect cuts, fades and dissolves, so 1 to 10 seconds long shots are detected.

21

2. Repeated Sequence Detection – This fingerprinting process is divided into the following components: a. Video Frame Hashing – An hash value, ( ), is calculated for each frame t, by quantizing a color moment vector, ( ), to Q levels. Being T the size of the hash table 𝐻𝐻 𝑡𝑡 and = the relative weights for each color moment, 𝑘𝑘 𝑉𝑉 𝑡𝑡 𝑘𝑘 ( ) = ( ( ). ). . (3.11) 𝑆𝑆 𝑄𝑄 … The shot number and (𝑘𝑘=)1 are9 stored𝑘𝑘 𝑘𝑘 for each frame. When a new shot is added 𝐻𝐻 𝑡𝑡 ∑ 𝑉𝑉 𝑡𝑡 𝑆𝑆 𝑚𝑚𝑚𝑚𝑚𝑚 𝑇𝑇 to the hash table, hash table lookup is done to identify video sequences in the archive 𝑉𝑉 𝑡𝑡 which are potentially similar to the input video. b. Video Sequence Filtering - A filtering process (based on the number of frames with the same hash values, the relative lengths of the shots and the mean color moment vectors for each shot) is applied to the potentially similar shots in order to verify their resemblance. c. Repeated Sequence Validation – The added shot is aligned and once again compared to all similar shots that remain after the previous step. The corresponding color moment vectors are normalized and finally, in order to determine if two clips match, the minimum value of their difference is obtained and compared with a predefined threshold. d. Shots Fusion – Adjacent shots that are repeated in the same order somewhere in the archive are merged (since they represent an identified sequence). By combining multiple 2-3 seconds shots, repeated sequences with typically 10-60 seconds can be obtained. 3. Video Sequence Classification – This step assigns labels to the clips retrieved, based on their content. Three classification algorithms were tested: KNN, Weighted Binary Classifier Voting and Nearest Centroid Classification, with the first one achieving the best results. Five features, selected from each repeated video sequence and involving temporal and color characteristics, were selected to test the algorithms. A relative weight was assigned to each feature, based on its classification accuracy. The algorithm presented above was tested with 72 hours of TV programming; 575 repeated video clips were manually considered as either commercials or non-commercials. The best results were obtained with KNN, which had an F-measure5 of 0.95 for commercial detection and 0.89 for non-commercial detection, making an overall quality measure of 0.92.

3.2.3 Li et al. (2008)

In [34], Li et al. developed a “Confidence Based Recognition System (CRS) for TV Commercial Extraction“. It combines the ideas of the feature-based approach with the recognition- based approach. A new concept, confidence level, is introduced and computed for each commercial candidate. CRS aims to detect both known and unknown commercials: for the first case, a known commercial library is constructed in first place; in the second case, a dynamic buffer is built in order to manage new possible commercials. CRS captures the digital TV streams from TV stations and partitions them into small video segments, being each piece a potential commercial. The verification is performed by comparing the potential commercial with the known commercials database or checking the occurrences of it according to the information stored in the buffer. The process occurs as follows: 1. Fingerprint Construction – Each frame of the video stream is partitioned into = × windows. Each window is ranked according to its computed average intensity. 𝑊𝑊 𝑊𝑊𝑥𝑥 𝑊𝑊𝑦𝑦

× 5 = 2

𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃 𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅 𝐹𝐹 𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃+ 𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅 22

2. Video Segmentation – Three commercial features, each one representing a candidate set, are used in order to reduce the number of potential commercial blocks: BFs (feature A), Restricted Temporal Length (feature B) and Scene Change Ratio (feature C). A confidence level is defined for each commercial block candidate, depending on its relationship with the referred features: if a candidate commercial belongs to all of these three sets, its confidence level is 3; if it is in two of them, its confidence level is 2; and so on (see Figure 3.9). The video stream is then segmented according to the confidence level of each stream block.

Figure 3.9 – Confidence Level Assignment.

3. Scene Break Segmentation – Segmentation in series of small video pieces is performed over the video streams obtained in the previous step, based on scene break changes obtained by comparing RGB histograms in consecutive frames. 4. Recognizing Known Commercials – the following operations are executed: a. An array of lengths of the current scene segments is constructed. b. The length of the fingerprints of each scene in the database is checked against the 𝜆𝜆 ones of each item in the array until either: i. A possible match is found – in this case, the confidence of the commercial block 𝜆𝜆 candidate is increased by one. When it reaches a certain threshold, the candidate is judged as a real commercial block and the process ends. ii. The search over the database ends without having found a good match – a potential unknown commercial is identified. 5. Recognizing Unknown Commercials a. A buffer with fixed size is firstly created to maintain the historical blocks that are potential commercials. b. Given a series of incoming scenes, the new fingerprint of each block is calculated first and then compared with those in the buffer. c. When the new fingerprint matches a stored fingerprint, an unknown commercial is found and marked; otherwise, the new fingerprint is written into the buffer. d. The repeated adjacent blocks are merged in order; by combining multiple blocks and checking the confidence level of the candidate block they belong, the repeated unknown commercial is found. e. The merged block (i.e., the newly found commercial) is moved from the buffer to the known commercial database. To evaluate the effectiveness of CRS, two 48-hour digital TV streams were captured; 5938 unique spots, varying from 5 to 180 seconds, were manually cut and inserted in the known commercial database. For the two 48-hour clips, recall of 99.2% and 99.5% and precision of 96.8% and 97.4% are reported.

23

Chapter 4 Proposed Solution: Architecture and Algorithms

This chapter describes the solution designed and implemented to detect commercial blocks in TV content, using only the visual data; its global architecture as well as the functional and algorithmic descriptions of its main building blocks are presented and the rationale behind them is highlighted.

4.1 Learning about Commercials and Logos with Real TV Content

In Chapter 3, several solutions proposed in the literature for the detection of commercial blocks in TV content have been reviewed. To get a good insight about the current characteristics of commercial blocks, several video segments (from different channels and with different content) were recorded and observed by the author of this Thesis. The exhaustive analysis of this content revealed that some of the assumptions accepted in the past, about TV commercials, are no longer reasonable. Some examples are: 1. Black Frames and Silences (see Subsection 3.1.1) – These factors have no impact nowadays, as transitions between two commercials are now typically abrupt, with a single hard cut and without any BFs or silence between commercials. 2. High Cut Rates (see Subsection 3.1.2) - Although this is still a relevant characteristic, there are more and more commercials with a movie-like filming style, i.e., with long shots and soft transitions. 3. Audio Analysis (see Subsection 3.1.5) – Even though the sound level difference between regular programs and some commercials can be noticed at times, this is not too common. Also, during the development of this Thesis, a new Directive was emanated by the Portuguese Media Regulator (ERC) [49] stating that, from June 1, 2016, loudness standardization should be applied during broadcast transmissions. In practice, this rule limits the sound level variation between a regular program and a commercial block. Thus, exploiting this characteristic for commercials detection will be no longer a valid approach. In Subsection 3.1.4, some commercials detection solutions based on logo detection, available in the literature, have been reviewed. However, these solutions are in general too simplistic as they do not consider the fact that not all observed logos in the TV screen are TV channel logos; in fact, nowadays, other logos can be found in the screen corners with many different purposes. As people’s needs and demands are continuously changing and both broadcasters and advertisers have to adapt themselves to the audience, there is nowadays a recurrently used graphical element known as “Digital on-Screen graphic” (DoG), also named as “bug” in some countries. DoGs are typically placed “in or around a given corner of the viewable area of the video for the entirety of the program” and were created as a way to brand some TV shows with an identity [48][49]; however, their application has been largely extended and DoGs are now also used with commercial purposes and news services. In the following, four common situations in television involving these DoG elements are presented with the purpose to highlight how the problem to be addressed may, in practice, become more complicated due to the increased graphical freedom given to the content creators:

24

1. Commercial brand logo placed in the screen - In the last few years, some studies [52] found that explicit and too large brand images tend to make people zapping, which is undesirable for advertisers. Thus, advertisers started incorporating the brand logo in their commercials in two common ways, which are shown in Figure 4.1:

(a) (b) Figure 4.1 – (a) Lamborghini’s TV Commercial screenshot, where a flash of the brand logo can been seen in the lower left corner [53]; (b) Screenshot extracted from a Portuguese TV commercial: the commercial brand logo appears in the upper right corner during the whole commercial.

a. Repeating brief images of a real brand logo - A real logo (not a video overlaid DoG) appears for some time in real objects which are part of the commercial. A typical example of this approach is the advertising of cars, where the brand logo usually appears only in quick glimpses. Figure 4.1 (a) shows a screenshot extracted from a 2013 Lamborghini’s car commercial; during this commercial, the brand logo can be seen several times, but always for less than one second each time. b. Commercial brand logo placed in the corners - The commercial brand logo is located in one of the screen corners. This solution is not too annoying to the viewer, who probably will not zap just because of the DoG presence; however, its presence may still be effective in terms of brand perception and recognition, which is the major goal of the advertiser. Figure 4.1 (b) shows a screenshot of a Portuguese TV commercial for a Portuguese company of wellness products, called “Well’s"; the commercial brand logo is present in the upper right corner during the whole commercial. This example invalidates the assumption that a logo in the screen corners always implies that there is a regular program on air. Still, commercial brand logos cannot be dissociated from the type of video segments they belong to; thus, by exploiting some well-known TV commercials features, it is possible to conclude about the nature – TV channel logo or commercial brand logo - of a given detected DoG. Examples of relevant features for this analysis are: (i) a TV commercial has a maximum time duration; (ii) no TV commercial is broadcasted more than twice, consecutively; (iii) typically, a TV commercial presents an higher shot cut rate than any other content; and (iv) the fraction of time a commercial brand logo is on air is much lower than for a TV channel logo. 2. Program/Series branding – Nowadays, many regular programs, such as TV series, dramas, sitcoms and even news programs, may contain their own program brand logo, which is usually the program name with stylized letters in one of the screen corners (see Figure 4.2).

25

Figure 4.2 – Screenshot from a Portuguese TV series: in the upper left and upper right corners, the TV channel logo and the TV series logo, respectively.

3. Additional graphical information - In some shows, notably quizzes and news programs, special static areas may also appear in the screen corners, e.g. showing possible answers, current time, live traffic/weather updates, horizontal bars with headlines and sign language areas (see Figure 4.3).

Figure 4.3 - Screenshot from a Portuguese TV news program: in the upper left corner, the TV channel logo; in the lower right corner, the news program logo (“BomDia Portugal”), the current time (“06:38”) and live traffic information.

4. Broadcasters’ self-promotions - Sometimes, when advertising their own programs, broadcasters incorporate the program brand logo in the transmitted video, which may cause a problem similar to the existence of a commercial brand logo during a commercial block; an example of this situation is shown in Figure 4.4.

Figure 4.4 – Screenshot from a Portuguese broadcaster self-promotion commercial: the program logo is placed in the upper right corner during the whole self-promotion.

26

In the context of this Thesis, a DoG can be classified in two possible categories: (i) TV channel logo, when it corresponds to an actual TV channel logo; and (ii) non-TV channel logo, which refer to all other types of DoGs.

4.2 Characterizing TV Channel Logos

TV channel logos may differ in terms of opacity, stillness, shape, animation, color and location. All these possible characterization dimensions are summarized in Table 4.1. Table 4.2 includes three logo examples and their characterization in terms of the identified features to demonstrate how diverse can be the combinations of logo characteristics.

Table 4.1 – Commonly observed characteristics in TV logos. Stillness Opacity Shape # Colors Location Shape Texture

Number(s) Upper Left Corner Opaque Letter(s) Static Single Upper Right Corner

Semitransparent Polygonal

Circular Lower Left Corner Dynamic Multiple Transparent Irregular … Lower Right Corner

Table 4.2 - Three TV logo examples and their characterization. Stillness Logo Channel Opacity Shape # Colors Location Shape Texture

“SIC” + Upper SIC Radical Semitransparent Static Static “RAD” + Single Left “ICAL” Corner

Irregular; circular A Bola Upper Right Opaque Static Dynamic form + “A” + Multiple TV Corner “BOLA” + “tv”

Irregular; “f” Upper Fashion TV Opaque Dynamic Dynamic + Single Left “fashiontv” Corner

The enormous diversity of logos hinders the development of a single solution to detect all of them. In fact, even opaque and static logos, which are the easiest to detect, can be hard to find when the background is very textured for a long time (see Figure 4.5 (a)), or when the contrast between the logo and the background is not sufficient to allow their discrimination, even visually by a human (see Figure 4.5 (b)).

27

(a) (b) Figure 4.5 - Difficult logo examples: (a) The TV channel logo in the upper left corner is over a highly textured zone, making it hard to detect; (b) The TV channel logo in the upper left corner is over the sky, making it almost impossible to detect as its color is quite similar to the background color. Among the selected characteristics, the most critical ones are opacity and stillness; if a logo is not totally opaque, the problems presented in Figure 4.5 (a) and (b) are even more critical as the difficulty in discriminating the logo from the background increases. As an example, Figure 4.6 shows a screenshot of a Portuguese private TV channel, TVI, whose logo contains a dark shadow surrounding a colored graphical element, making difficult to correctly identify what does and what does not belong to the logo itself. With respect to stillness - which can be analyzed in terms of shape and texture - the challenge is to develop a method which is robust to texture changes and simultaneously able to correctly identify a dynamic logo in terms of shape. Figure 4.5 shows a good example of a logo with dynamic shape (note that the logo horizontal lines are white in (a) and blue in (b), with the transition between (a) and (b) produced through a short rotation effect). Figure 4.7 shows four screenshots from the same Portuguese TV channel logo, SIC, which is a good example of texture changes; in particular, there are noticeable shade variations, especially in the central letter “I” and in the orange and red regions. Figure 4.8 shows another set of four screenshots from the same TV channel logo, SIC, to illustrate a dynamic logo in terms of shape - besides the change of colors (the originals are those in Figure 4.7), there are snowflakes constantly falling on it, as this logo was broadcasted during Christmas time.

Figure 4.6 – Difficult logo example: in the upper left corner, the TV channel logo contains a dark shadow surrounding a colored graphical object.

Figure 4.7 – Example of texture variations in a logo along time, notably in the central letter “I” and in the red and orange regions.

28

Figure 4.8 – Example of a dynamic logo in terms of shape with snowflakes constantly falling on the logo. As it has been made clear, TV logo detection is not a trivial task. However, even for the most problematic cases, the amount of possible visual effects is not infinite. For the case of semi- transparent logos, even if the edges or the color are not well-defined, they exist and remain constant over time. Regarding dynamic logos, it has been observed that they have typically a given periodicity in their movements and the motion is confined to a limited area. Therefore, although there are several hurdles to overcome, some characteristics may be exploited to address with success as many situations as possible. The previously presented logo examples are also useful to highlight another relevant issue: TV channels change their logo more often than one may expect. This changes may be temporary (e.g., during special events, national holidays, specific moments of the year or some days after and before the channel’s anniversary) or permanent (e.g., for rebranding purposes). A good example is offered by SIC, whose regular logo, shown in Figure 4.7, has not been changed in the past few years. However, during 2015 Christmas time, the TV channel logo broadcast was the one shown in Figure 4.8. But there are also permanent changes, like the one that occurred with RTP1 (Public Portuguese TV channel) during the time of this Thesis. The screenshots in Figure 4.2 and Figure 4.3 were both extracted from recordings of this channel, despite the fact that the TV channel logos (in the upper left corner) are not the same; in fact, a definitive logo design change (shown in Figure 4.9) has occurred.

Figure 4.9 – RTP1 logos: the old and the new.

4.3 Proposed System Architecture

This section presents a high-level description of the proposed algorithm for commercial blocks detection, together with the motivation for the main design options.

4.3.1 Designing the System The good design and high performance of the targeted commercial blocks detection solution depends on the appropriate consideration of the main characteristics of the commercial blocks. Naturally, complexity is also always an issue and thus the task efficacy-complexity trade-off plays a major role in the design. In the following, the main design considerations are presented: 1. Video segments processing - The first important observation to make is that if a video is divided into small temporal segments (i.e. sets of frames or frame windows), where all the frames are consistent in terms of spatial content, then those frames should be classified in the same way (i.e. all part or not of a commercial block); in particular, those temporal segments may coincide with “video shots”, where a video shot may be defined as a sequence of frames running for an uninterrupted time period and filmed by the same camera. However, as shots in video content such as movies and interviews tend to be longer than in any other content, notably commercial blocks, it may be useful to divide each video shot into smaller video segments. This design approach has the following benefits: (i) too long shots may cause memory leaks due to the amount of data to be processed; and (ii) longer shots increase the risk of misclassifications, as they may result

29

from a shot change misdetection; in this case, the strategy of segmenting long shots into smaller segments allows to achieve higher success rates, as at least a part of the frames may be correctly classified. 2. DoG presence diagnostic - Each video segment passes through a set of procedures aiming to detect potential DoGs. As explained in Section 4.1, the diversity of DoG types makes it hard to develop a generic method, well adapted to every DoG type. Still, the designed solution is not focused on any specific type of DoGs; it actually intends to cover all the possible cases, no matter what the colors, opacity, shape or stillness characteristics are. The rationale behind the designed solution takes into account the following simple assumptions about a DoG: • it is present in one of the screen corners; • it is quasi-stable in terms of edges (including inner and outer limits) and color. A low level analysis is performed for each video segment in order to get a conclusion about the existence of a DoG (or more) on the screen. 3. DoGs Database - A problem to consider is related to the fact that, during a long period of TV broadcasting, many DoGs may occur corresponding to TV channel logos and non- TV channel logos (the need to keep some of the non-TV channel logos in the will be explained in Section 4.6). This observation highlights a major issue to take into account when designing an efficient TV channel logo detection solution: the need to store, organize and manage the detected DoGs and also to differentiate them. These are key issues whose solutions will critically impact the algorithm’s performance, notably in terms of robustness and processing time. With this target in mind, the creation of a “DoGs Database” (DDB), which is empty when the algorithm starts running, seems to be mandatory; naturally, an efficient model to manage and update this database is also critically needed. The DoGs Database shall be continuously updated and checked in order to improve and speed up the whole commercial blocks detection process. The DoGs Database shall also contain all the relevant information about each DoG it stores. This information includes not only color and edge data, but also timing information (e.g., the time that a DoG was on air) and the DoG classification, i.e. TV channel logo or non- TV channel logo. 4. TV channel logos versus non-TV channel logos - To distinguish TV channel logos from non-TV channel logos (in particular, commercial brand logos), the following assumptions (based on viewing inspection) can be made: • the maximum duration of a commercial is two minutes; • the maximum number of consecutive times a commercial is broadcast is two; • the maximum share of commercials per hour imposed in EU is twelve minutes (see Chapter 2); • in the long term (e.g., some hours of broadcasting from the same TV channel), the time a TV channel logo is on the screen is much longer than for a non-TV channel logo. The classification process ends by attributing to each segment one of the following labels, Commercial or Block Regular Program, depending if a commercial block was or was not detected.

4.3.2 Architecture walkthrough The proposed Global System (GS) is shown in Figure 4.10; the overall process includes the following main steps, detailed in Sections 4.4 to 4.7: 1. Shot Change Detection and Segmentation (SCD) – The input video is fragmented in video segments (or time windows) characterized by similar spatial content (i.e. shots), to

30

be analyzed in the subsequent modules; when a shot is considered to be too long, it may be segmented in smaller segments. 2. DoG Acquisition Algorithm (DoGA) – Once a shot is detected, its edge and color characteristics are analyzed in order to detect and characterize possible static areas (DoGs), following two main procedures: A. Video Segment Edges & Color Analysis – Each video segment is processed in terms of edges and color but only in the screen regions that are likely to contain DoGs, i.e., the four screen corners. B. DoG Detection – The possible existence of a DoG in the video segment is evaluated; if a DoG is detected, all the relevant information about it is extracted to characterize it. 3. DoGs Database Updating & DoG Type Decision – When a DoG is detected, some procedures and comparisons with the DoGs database (which is empty when the application is launched) are performed, so that the application can distinguish between TV channel logos and non-TV channel logos and correctly classify each segment as commercial or regular program. The DoGs database updating/management is a key process as it is responsible for classifying a DoG as TV channel logo or non-TV channel logo, which is the major decision to take. 4. Video Segment Classification – The final classification of the segment under analysis, as Commercial Block or Regular Program, is taken based on the type of detected DoG; a report is continuously produced with the logo classification results.

Figure 4.10 – Global System architecture.

Along the algorithm description, several parameters and thresholds are introduced; their values were experimentally obtained and will be presented in Chapter 5 when describing the performance assessment conditions.

31

4.4 Shot Change Detection and Segmentation Figure 4.11 presents the flowchart of the module responsible for the video shots detection and segmentation. However, to avoid too long shots (which may happen in some cases, because they are indeed too long or due to misdetection of some transitions), an additional segmentation step is introduced; accordingly, the output of the Shot Change Detection algorithm is generically referred as video segment. In Subsection 2.3.1 several types of video shot transitions were presented and in Subsection 3.1.2 two solutions for hard cuts detection were described. When designing the algorithm responsible for detecting shot changes, two observations were essential to design the solution: (i) hard cuts are the most common type of video shot transitions; (ii) transitions from Regular Programs to Commercial Blocks (and vice-versa) are typically implemented using hard cuts. Thus, detecting this type of transitions becomes a priority. In this Thesis, the color space was used for representing the video data. This option is not only justified by the fact that the video content is typically represented in this format but also 𝑌𝑌𝐶𝐶𝑏𝑏𝐶𝐶𝑟𝑟 by the need to separate the luminance and color components as these components are used in different stages of the global solution: luminance is used in the SCD module and chrominances are used in operations related to the DoGA. The implemented hard cuts detection algorithm follows a histogram based approach as this allows a good trade-off between detection accuracy and computational complexity [54] [55]; a distance metric is applied to the luminance histogram of consecutive frames which is compared to a threshold to decide about the possible existence on an hard cut. Two types of thresholds may be adopted: fixed (i.e., the same threshold is applied during the whole video) or adaptive (i.e., it dynamically changes, according to the video content). While a fixed threshold is easier to implement, it is also too rigid as it does not take into account the specificities of different video contents. In this Thesis, an adaptive threshold (adaptTh) is used, and computed for each time window with WF frames.

Figure 4.11 – Shot Change Detection and Segmentation module flowchart.

The Shot Change Detection algorithm includes several steps which are described in the next subsections.

32

4.4.1 Luminance Histogram Operations The Luminance Histogram Operations are two simple procedures applied over the input luminance frames, which allow obtaining data that is essential to conclude about the existence of hard cuts in the analyzed video. These operations are the Luminance Frame Histogram Computation and the Luminance Histogram Distance Computation, explained in the following.

• Luminance Frame Histogram Computation The Luminance Frame Histogram Computation procedure computes the luminance histogram of each input video frame.

• Luminance Histogram Distance Computation The Luminance Histogram Distance Computation is applied to each pair of consecutive frames, i and i-1, in order to evaluate how similar they are. Several metrics may be used to compute the distance between two histograms; in this Thesis, the Chi-Square Distance was chosen, since it has proven to be a reliable and simple metric as demonstrated by its popularity as histogram comparison method in several platforms and libraries (namely in the OpenCV library, used along this Thesis).

The Chi-Square distance between two histograms, H1 and H2, is computed as:

( ( ) ( )) ( , ) = 𝑛𝑛 (4.1) ( ) 2 1 + 2 ( ) 1 2 𝐻𝐻 𝑖𝑖 − 𝐻𝐻 𝑖𝑖 𝐷𝐷 𝐻𝐻 𝐻𝐻 � 1 2 where n is the number of histogram bins 𝑖𝑖and 𝐻𝐻Hk(𝑖𝑖i) is 𝐻𝐻the 𝑖𝑖value corresponding to bin i in the histogram Hk. Lower distances represent better matches than higher distances - a perfect match results in a Chi-Square distance of 0; Figure 4.12 shows a graphical representation of this process.

Figure 4.12 - Schematic representation of the Luminance Histogram Distance Computation between consecutive frames.

4.4.2 Adaptive Threshold Computation As already mentioned, the Adaptive Threshold Computation is a key step, as its output controls the definition of the hard cuts along the video sequence. For each temporal window comprising WF frames, the following procedure (based on the method proposed in [56]), takes place:

1. Chi-Square Mean Value Computation –The mean value of Chi-Square Distance between consecutive frames, inside the current window, is computed. Note that, for a window with WF frames, there are only WF-1 Chi-Square Distance scores to take into account (see Figure 4.12). Thus, the Chi-Square Mean Value results from:

( , ) µ = (4.2) 𝑊𝑊𝑊𝑊 . ∑𝑖𝑖=2 𝐷𝐷 𝐻𝐻𝑖𝑖 𝐻𝐻𝑖𝑖−1 𝐷𝐷 𝑊𝑊𝑊𝑊−1 2. Adaptive Threshold Definition – After the Chi-Square Mean Value Computation, the adaptive threshold (adaptTh) is obtained as:

33

= µ × . (4.3)

In (4.3), ThWin is a constant that depends𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎ℎ on µ 𝐷𝐷 according𝑇𝑇ℎ𝑊𝑊𝑊𝑊𝑊𝑊 to:

, if 𝐷𝐷 µ < = . , if µ (4.4) 𝑇𝑇𝑇𝑇𝑇𝑇 𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙 ≤ 𝐷𝐷 ℎ𝑖𝑖𝑖𝑖ℎ𝑊𝑊𝑊𝑊𝑊𝑊𝑊𝑊 𝑇𝑇ℎ𝑊𝑊𝑊𝑊𝑊𝑊 � 𝐷𝐷 as frame windows with different Chi𝑇𝑇𝑇𝑇𝑇𝑇-Square Mean≥ ℎvalues𝑖𝑖𝑖𝑖ℎ𝑊𝑊 𝑊𝑊𝑊𝑊𝑊𝑊have specific characteristics and represent different content that must be dealt with differently. That is the reason why the algorithm ignores the windows where µ < , considering that in those cases the video content is very likely the same. During the experiments, it was observed that frame windows with low µ 𝐷𝐷 𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙 were too sensitive to transmission/recording errors and small brightness changes, which could 𝐷𝐷 result in several hard cut false positives. On the other hand, in windows with the highest values for µ , the hard cuts could be identified with thresholds lower than for the cases with intermediate µ values𝐷𝐷 . This is the reason why there are two thresholds ( and ) applied to two different ranges of µ . 𝐷𝐷 𝑇𝑇𝑇𝑇𝑇𝑇 𝑇𝑇𝑇𝑇𝑇𝑇 𝐷𝐷 4.4.3 Hard Cut Detection Decision Once the adaptive threshold has been computed, the final step is to compare the µ value of each pair of frames inside the window under analysis with adaptTh, to decide about the 𝐷𝐷 existence of a hard cut, and according to:

, if ( , ) > ( ) = , else (4.5) 𝐻𝐻𝐻𝐻𝐻𝐻𝐻𝐻 𝐶𝐶𝐶𝐶𝐶𝐶 𝐷𝐷 𝐻𝐻𝑖𝑖 𝐻𝐻𝑖𝑖−1 𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎ℎ 𝐻𝐻𝐻𝐻𝐻𝐻𝐻𝐻 𝐶𝐶𝐶𝐶𝐶𝐶 𝐷𝐷𝐷𝐷𝐷𝐷𝐷𝐷𝐷𝐷𝐷𝐷𝐷𝐷𝐷𝐷 𝑓𝑓𝑖𝑖 � 4.4.4 Forced Segmentation 𝑁𝑁𝑁𝑁 𝑐𝑐𝑐𝑐𝑐𝑐 Simultaneously with the hard cuts detection procedure, a mechanism is running in order to avoid processing segments considered as too long. The procedure is really simple as it only checks how many frames have occurred since the position (LastEnd) of the last hard cut (or of the last end of segment); comparing that value with the maximum acceptable length for each video segment, it may force a new segment to be created; equation (4.6) summarizes this decision ("i" is the current frame position):

, if ( – ) > ( ) = (4.6) , else 𝐸𝐸𝐸𝐸𝐸𝐸 𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆 𝑖𝑖 𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿 𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀ℎ𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜 𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑛𝑛 𝐷𝐷𝐷𝐷𝐷𝐷𝐷𝐷𝐷𝐷𝐷𝐷𝐷𝐷𝐷𝐷 𝑖𝑖 � 𝐷𝐷𝐷𝐷 𝑛𝑛𝑛𝑛𝑛𝑛ℎ𝑖𝑖𝑖𝑖𝑖𝑖 4.5 DoG Acquisition Algorithm

The DoG Acquisition algorithm refers to all the algorithmic operations performed with the aim to detect and characterize DoGs in the TV content. Those tasks are associated to two main functions: Video Segment Edges & Color Analysis (described in Subsection 4.5.1) and DoG Detection (described in Subsection 4.5.2).

4.5.1 Video Segment Edges & Color Analysis Figure 4.13 presents the flowchart of the module responsible for processing each video segment, considering its edges and color characteristics. Once the first and last frames of a new segment are identified, the procedures described in Subsections 4.5.1.1 to 4.5.1.5 will take place.

34

4.5.1.1 Key Frames Extraction In the Key Frames Extraction step, some frames are selected and extracted as segment representative frames; in the sequel, these frames will be referred to as "key frames" (KF). Intuitively, the DoGs could be detected by performing a frame-by-frame comparison along the video segment; however, this approach presents two main drawbacks: 1. Computational Cost – this process would be computationally very expensive. 2. Redundancy – as mentioned in Section 4.3, the segmentation of the video in segments is performed due to the fact that all the frames in the same segment should belong to the same content type, either Commercial Block or Regular Program; accordingly, they should be classified in the same class, being redundant to process all the frames of each segment.

Figure 4.13 – Video Segment Edges& Color Analysis module flowchart.

To restraint the computational cost without compromising the performance, the number of selected key frames ( ) should be limited; in the proposed algorithm, depends on the video segment length (SL), measured in frames, according to the following expression:𝐾𝐾𝐾𝐾 𝑁𝑁𝐾𝐾𝐾𝐾 𝑁𝑁 ( × ), < = , if (4.7) 𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐 𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃 𝑆𝑆𝑆𝑆 𝑖𝑖𝑖𝑖 𝑆𝑆𝑆𝑆 𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆ℎ 𝑁𝑁𝐾𝐾𝐾𝐾 � The expression (4.7) contains three𝐾𝐾 parameters: SegLenTh𝑆𝑆𝑆𝑆 ≥ 𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆, K𝑆𝑆 𝑆𝑆andℎ PerKF, which are introduced below: 1. SegLenTh – This parameter sets the boundary between what is considered a short and a long segment. 2. PerKF – It sets the percentage of the total number of frames of the small segments that are kept. Higher values of PerKF are useful for the situations where the DoG is difficult to distinguish from the background as the probability of finding DoGs’ edges increases. However, for other situations, such as highly textured segments, a large PerKF may

35

generate too many edges to analyze, turning the Key Frames Extraction process useless (as it would not be selective); Subsection 4.5.1.4 details this idea. 3. K – This parameter sets the total number of frames that are kept for the long segments. The observations about the variation of PerKF are also applicable to K. To ensure continuity between the equations in (4.7), K= ceil (PerKF×SegLenTh).

Once is obtained, KFs are selected; the algorithm always picks the first frame of the video segment and then selects the remaining frames that guarantee equidistance between all 𝑁𝑁𝐾𝐾𝐾𝐾 frames. This selection process aims at a balanced video segment representation, trying to capture 𝑁𝑁𝐾𝐾𝐾𝐾 the video content variations that may happen inside a segment. Figure 4.14 presents three examples of the KFs selection process for three different SL values, and with SegLenTh = 100, PerKF = 10% and K = 10.

53 1 10 20 30 40 50 (a)

76 1 10 20 30 40 50 60 70 (b)

110 1 10 20 30 40 50 60 70 80 90 100 (c) Figure 4.14 - (a) SL = 53; = (0.1 × ) = 6 ; : 1, 11, 21, 31, 41, 51 (b) SL = 76; = (0.1 × ) = 8; KFs indexes: 1, 11, 21, 31, 41, 51, 61, 71 𝐾𝐾𝐾𝐾 (c) SL = 110; = 10 ; 𝑁𝑁 : 1, 12,𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐 23, 34, 𝑆𝑆45,𝑆𝑆 56, 67, 78,𝐾𝐾𝐾𝐾 𝑠𝑠89,𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖 100. 𝑁𝑁𝐾𝐾𝐾𝐾 𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐 𝑆𝑆𝑆𝑆 𝐾𝐾𝐾𝐾 4.5.1.2𝑁𝑁 Key Frames𝐾𝐾𝐾𝐾𝑠𝑠 𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖 Edges Detection In the Key Frames Edges Detection step, an edge detection algorithm is applied to predefined spatial areas (where DoGs are likely to be present) of each KF; the predefined areas (DoG Area, DA) to analyze are defined by two parameters groups: 1. DoG Area Size ( ) a. - number of lines; 𝐷𝐷𝐷𝐷𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆 b. - number of columns. 𝐻𝐻𝐻𝐻𝐻𝐻𝐻𝐻ℎ𝑡𝑡𝐷𝐷𝐷𝐷 2. DoG Area Position ( ) – the DoG area may be positioned at the four 𝑊𝑊𝑊𝑊𝑊𝑊𝑊𝑊ℎ𝐷𝐷𝐷𝐷 corners of the frame, notably: 𝐷𝐷𝐷𝐷𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃 a. Upper Left Corner ( ); b. Upper Right Corner ( ); 𝑈𝑈𝑈𝑈𝑈𝑈 c. Lower Left Corner ( ); 𝑈𝑈𝑈𝑈𝑈𝑈 d. Lower Right Corner ( ). 𝐿𝐿𝐿𝐿𝐿𝐿 Both and were determined based on the previous analysis on several video 𝐿𝐿𝐿𝐿𝐿𝐿 recorded segments. To adapt to different frame resolutions, these parameters are defined as a 𝐻𝐻𝐻𝐻𝐻𝐻𝐻𝐻ℎ𝑡𝑡𝐷𝐷𝐷𝐷 𝑊𝑊𝑊𝑊𝑊𝑊𝑊𝑊ℎ𝐷𝐷𝐷𝐷 percentage of the frame vertical and horizontal resolutions (number of pixels), according to: = 0.25 × (4.8)

𝐻𝐻𝐻𝐻𝐻𝐻𝐻𝐻ℎ𝑡𝑡𝐷𝐷𝐷𝐷 = 0.20 × 𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐻𝐻𝐻𝐻𝐻𝐻𝐻𝐻ℎ𝑡𝑡 (4.9) In (4.8) and (4.9), Frame and Frame are, respectively, the number of pixels in the vertical Height 𝑊𝑊𝑊𝑊𝑊𝑊𝑊𝑊ℎ𝐷𝐷𝐷𝐷 Width 𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝑊𝑊𝑊𝑊𝑊𝑊𝑊𝑊ℎ and horizontal directions of a frame. Figure 4.15 depicts the DAs for a frame extracted from a Portuguese TV news program.

36

Figure 4.15 - DoG Areas Definition: the white boxes signalized as ULC, URC, LLC and LRC.

To extract the edges within each DA, the Canny edge detector [57] is used; this detector is one of the most used edge detectors in video analysis, due to its robustness to noise and edge detection accuracy; it involves the following main steps: 1. Noise Filtering – a Gaussian Filter is applied to the image to smooth it and to reduce the (eventual) noise. 2. Image Gradient Computation – the image gradient norm is computed using a differential operator. 3. Non-maximum Suppression – an edge thinning process is applied so that the “strongest” edges (i.e., those edges with the largest gradient) are preserved and the remaining ones are removed. 4. Hysteresis – to reduce the edge pixels resulting from noise, two thresholds, upper_threshold and lower_threshold, are applied to the output of the previous step. The final decision is taken according to the following conditions, where pixel_gradient(i,j) represents the image gradient at pixel position (i,j): a. If pixel_gradient(i,j) ≥ upper_threshold, the pixel is considered as an “edge”; b. If pixel_gradient(i,j) < lower_threshold, the pixel is considered as a “no-edge”; c. If upper_threshold > pixel_gradient(i,j) ≥ lower_threshold, the pixel is considered as an “edge” only if it is 8-connected with a pixel that verifies condition a). As an example, Figure 4.16 shows the detected edges for the video frame in Figure 4.15, where the pixels considered as “edges” are represented in white.

37

Figure 4.16 - Canny Edge Detector output for the video frame in Figure 4.15.

For a better understanding of the Key Frames Edges Detection output, consider a video segment containing the frame shown in Figure 4.17. Observing this figure, it is hard to distinguish part of the TV channel logo from the background. This problem occurs often with logos without well defined contours (e.g., no black contours are used to delimit the logo) and with a color similar to the background color. The worst case to deal with occurs when the logo has a single and white color, because many video contents, in particular those resulting from outdoor events transmissions and movies, are likely to include sky, clouds or other light elements that will difficult the logo detection – this is the situation shown in Figure 4.17.

Figure 4.17 – Screenshot extracted from the video segment “rtp1_demo1” (see [58]), with a TV channel logo in the upper left corner.

For the considered segment, and using = 5 (the segment length is 43), the edges detector output for the five key frames, and for the ULC area, is shown in Figure 4.18, together 𝑁𝑁𝐾𝐾𝐾𝐾 with the corresponding video frames.

38

(a) index = 1 (b) index = 11 (c) index = 21

𝐾𝐾𝐾𝐾 𝐾𝐾𝐾𝐾 𝐾𝐾𝐾𝐾

(d) index = 31 (e) index = 41

Figure 4.18𝐾𝐾𝐾𝐾 – KFs edges maps obtained for the video sequence𝐾𝐾𝐾𝐾 “rtp1_demo1”. In Figure 4.18, none of the images contains a complete representation of the TV channel logo (“RTP 1”) due to the lack of contrast between the logo and the background. The edges map in Figure 4.18 (c) is the most complete one, while in Figure 4.18 (e) the logo is almost unrecognizable. This evidence justifies the need for the next step of the algorithm.

4.5.1.3 Key Frames Edges Fusion The Key Frames Edges Fusion step consists in combining, for each considered DA, all relevant KF edge maps, by applying a logical ‘OR’ operation over the corresponding edges maps. The goal of this step is to work around the problem visible in Figure 4.18, which is the incomplete detection of the logo edges due to the similarity between the color of the logo and the color of the background. By performing a logical ‘OR’ between all edges maps obtained for each video segment (one for each KF), it is expected to obtain a single, fused edges map containing sufficient edges to define the logo in the segment. This operation has the drawback that the fused map also contains many edges that do not belong to the actual logo; part of these useless edges will be removed afterwards using the procedures explained in Subsections 4.5.1.4 and 4.5.1.5. Figure 4.19 presents the final output of the key frames edged fusion for the example considered in the previous section.

39

. Figure 4.19 - Key Frames Edges Fusion output for the ULC region (video sequence “rtp1_demo1”).

4.5.1.4 Color Analysis of Edge Pixels The Color Analysis of Edge Pixels is the last step before concluding about the existence or not of relevant static pixels in the DA areas of the video segment under analysis. The goal of this step is to refine the edges map obtained in the Key Frames Edges Fusion step, which very likely contains many pixels not belonging to the eventual DoG, notably due to the OR fusion. Before explaining the adopted approach, it is important to keep in mind that, as edges occur in areas with fast color transitions, they are not optimal points to define the DoGs colors; moreover, the edges themselves may not belong to the actual DoG (e.g., for DoGs without well defined limits, or for the cases where the fast color transition that defines the edge is caught outside the DoG). Thus, the color analysis is performed not only for the edge pixels themselves, but also for their neighborhood. The neighborhood of the edge pixels is obtained after dilation (a well-known morphological operation) is applied to the output generated by the Key Frames Edges Fusion step (see Figure 4.20).

Figure 4.20 - Key Frames Edges Fusion output for the ULC region after dilation (video sequence “rtp1_demo1”).

Finally, the variances of the two chrominance components are computed for every pixel belonging to the dilated edge map, using all the segment key frames; a pair of chrominance variances, ( , ) and ( , ), is obtained for each analyzed pixel, where (i, j) are the spatial coordinates of the pixel. The luminance component is not included in this analysis for two reasons: 𝐶𝐶𝐶𝐶𝑣𝑣𝑣𝑣𝑣𝑣 𝑖𝑖 𝑗𝑗 𝐶𝐶𝐶𝐶𝑣𝑣𝑣𝑣𝑣𝑣 𝑖𝑖 𝑗𝑗 to reduce the computational cost, and because semi-transparent DoGs would be penalized on

40 their analysis; in fact, it was observed that luminance component values vary more than chrominance components values for that type of content. Figure 4.21 shows a heat map that allows a better understanding of the importance of this step; it represents the mean value of the chrominance variances for each pixel, computed according to ( , ) + ( , ) ( , ) = (4.10) 2 𝐶𝐶𝐶𝐶𝑣𝑣𝑣𝑣𝑣𝑣 𝑖𝑖 𝑗𝑗 𝐶𝐶𝐶𝐶𝑣𝑣𝑣𝑣𝑣𝑣 𝑖𝑖 𝑗𝑗 The red pixels𝑀𝑀 𝑀𝑀𝑀𝑀𝑀𝑀are those𝐶𝐶ℎ𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟 with a lower𝑉𝑉 𝑉𝑉𝑉𝑉𝑉𝑉𝑉𝑉𝑉𝑉𝑉𝑉𝑉𝑉mean chrominances𝑖𝑖 𝑗𝑗 variance (which means that red pixels belong to edges with more constant chrominance values); the mean chrominances variance value increases for the orange and pixels, while the blue ones are those with the highest variance; the chrominances variance was not computed for the pixels in black, as they do not belong to the edges map for this segment.

Figure 4.21 – Heat map representing the mean chrominances (Cr and Cb) variance.

4.5.1.5 Video Segment Static Pixels Detection To finish the Video Segment Edges & Color Analysis process, a final decision about each pixel state in the DAs, as “static” or “non static”, is taken according to:

, if < < = , , else (4.11) 𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆 𝐶𝐶𝐶𝐶𝑣𝑣𝑣𝑣𝑣𝑣 𝑇𝑇ℎ𝐶𝐶ℎ𝑟𝑟𝑟𝑟𝑟𝑟 𝑎𝑎𝑎𝑎𝑎𝑎 𝐶𝐶𝐶𝐶𝑣𝑣𝑣𝑣𝑣𝑣 𝑇𝑇ℎ𝐶𝐶ℎ𝑟𝑟𝑟𝑟𝑟𝑟 𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆 � 𝑁𝑁𝑁𝑁𝑁𝑁 − 𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆 where is the threshold to which and are compared. Whenever both chrominance variance values are lower than the threshold, the pixel is classified as Static. After 𝑇𝑇ℎ𝐶𝐶ℎ𝑟𝑟𝑟𝑟𝑟𝑟 𝐶𝐶𝐶𝐶𝑣𝑣𝑣𝑣𝑣𝑣 𝐶𝐶𝐶𝐶𝑣𝑣𝑣𝑣𝑣𝑣 a has been assigned to all the pixels under scrutiny, only the information of those classified as Static is preserved and a Static Pixels Map (SPM) is obtained as the final output of 𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆 Video Segment Edges & Color Analysis module. Figure 4.22 depicts the resulting SPM for the example that has been successively considered in this section.

41

Figure 4.22 – Static Pixels Map (SPM) for the video sequence used as example.

Appendix A illustrates the output for the main processing steps of the proposed algorithm, based on a TV recording extracted from the Portuguese public TV channel “RTP1”.

4.5.2 DoG Detection Figure 4.23 presents the flowchart of the DoG Detection module, whose main function is to decide if the static areas identified in the previous stage may be considered as DoGs. With this target, this module receives a set of SPMs (each one corresponding to a temporal segment) and generates as output a decision about the presence of a DoG in the corner under scrutiny.

4.5.2.1 SPMs Intersection As already stated along this Thesis, the DoG’s characteristic which distinguishes them from any other casual static area is the longer time it remains on the screen and its presence in several video shots. A DoG detection algorithm based on a single shot would not be efficient, as it would detect lots of static areas that actually do not belong to any DoG - e.g., a static background - leading to many false positives (i.e., detected DoGs which are not true DoGs). To avoid this problem, a minimum number of video segments, Nseg, must be analyzed, before concluding if a static area can be considered a DoG.

In the "SPMs Intersection" step, a logical ‘AND’ operation is applied to the most recent Nseg

SPMs, which implies that the first decision about the presence of a DoG only occurs after the Nseg -th segment is processed; from then on, every time a new temporal segment is processed a new decision is taken using the last Nseg segments. The higher the value of Nseg, the most reliable is the algorithm in terms of static pixels detection, as shown in Figure 4.24; however, if Nseg is too large, the risk of discarding relevant information and damaging the DoG itself increases. Comparing the results of Figure 4.24 with Figure 4.22, it is clear that this stage allows to “clean” the previous results as it removes undesirable pixels that do not belong to the actual DoG.

42

Figure 4.23 – DoG Detection module flowchart.

(a) (b)

(c)

Figure 4.24 – Result of SPMs Intersection for various values of Nseg: (a) Nseg = 4; (b) Nseg= 5; (c) Nseg = 6.

The amount of persistent static pixels, PersistPix (which is the percentage of occupied by static pixels common to the last Nseg segments), is compared in first place𝐷𝐷 to𝐷𝐷 𝑆𝑆a𝑆𝑆𝑆𝑆 𝑆𝑆threshold,

43

MinPixTh, in order to avoid too small static areas without relevant meaning; if PersistPix is below or equal to this first threshold, the corner under analysis is classified as not containing a DoG; if the PersistPix value is above the next threshold level, StatPixTh, the corner under analysis is classified as containing a DoG:

, if 1 = , if . (4.12) 𝑁𝑁𝑁𝑁 𝐷𝐷 𝐷𝐷𝐷𝐷 𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑,𝑑𝑑 𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃else≤ 𝑀𝑀𝑀𝑀𝑀𝑀𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃ℎ 𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿 𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃 𝐷𝐷𝐷𝐷𝐷𝐷𝐷𝐷𝐷𝐷𝐷𝐷𝐷𝐷𝐷𝐷 � 𝐷𝐷𝐷𝐷𝐷𝐷 𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑 𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃 ≥ 𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆ℎ In summary, each corner in a video segment𝑇𝑇𝑇𝑇 𝑏𝑏𝑏𝑏 𝑐𝑐ℎ 𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒may be at this stage in one out of three possible conditions:

1. No DoG detected status – The processing continues at the Video Segment Classification module (see Section 4.7); 2. DoG detected status – The processing continues at the DoGs Database Updating & DoG Type Decision module (see Section 4.6); 3. To be checked status – This status is assigned when the output of the SPMs intersection contains an area that is not large enough to be considered as a DoG, but it is not also negligible. Then, the subsequent processing stages depend on the DoGs database (DDB) current situation: if the DDB is not empty, the processing continues with the DoG Presence Verification module, a mechanism which tries to recover some of those DoGs that are hard to find and were not immediately identified as such in this SPMs Intersection block (see Subsection 4.5.2.2); on the other hand, if the DDB is empty, the algorithm assigns a No DoG detected status to the video segment, for the corner under analysis, and the processing continues at the Video Segment Classification module. This is a conservative approach as it only tries to further process ‘unclear’ DoGs if there are already DoGs to be taken as guidance in the DDB.

Appendix B presents a set of results after SPMs Intersection step is applied over some video test sequences used in Chapter 5.

4.5.2.2 DoG Presence Verification In the previous subsection, several DoG detection situations have been described to highlight the associated difficulties. The DoG Presence Verification (DPV) module is proposed with the goal to guarantee that no frame corners with DoG are undetected, as long as they contain a number of pixels large enough to be worth to be inspected. The idea is to compare the DoGs under analysis with those already in the DDB (and associated to the same corner) to evaluate if some similarity is found. The steps involved in this comparison are:

• Edges comparison - The goal of this step is to find the edge pixels in common between the binary mask resulting from the previous step and the binary masks for the DoGs already in the DDB. For this, a logical ‘AND’ operation between the SPMs Intersection map and the edges map for the DoGs in the DDB is performed. • Color comparison between common edge pixels - In this step, the edge pixels chrominance components that are common to the SPMs intersection map and to the DoGs in the DDB are compared; in this comparison, each common edge pixel is classified as Similar or Not Similar according to:

, if < < = , else , (4.13) 𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆 𝐶𝐶𝐶𝐶𝑎𝑎𝑎𝑎𝑎𝑎𝐷𝐷𝐷𝐷𝐷𝐷𝐷𝐷 𝑇𝑇ℎ𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶 𝑎𝑎𝑎𝑎𝑎𝑎 𝐶𝐶𝐶𝐶𝑎𝑎𝑎𝑎𝑎𝑎𝐷𝐷𝐷𝐷𝐷𝐷𝐷𝐷 𝑇𝑇ℎ𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶 𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃 𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆 � 𝑁𝑁𝑁𝑁𝑁𝑁 𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆

44 where = [ ( ) ( )] (4.14) and 𝐶𝐶𝐶𝐶𝑎𝑎𝑎𝑎𝑎𝑎𝐷𝐷𝐷𝐷𝐷𝐷𝐷𝐷 𝑎𝑎𝑎𝑎𝑎𝑎 𝐶𝐶𝐶𝐶𝑎𝑎𝑎𝑎𝑎𝑎 𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆 𝐼𝐼𝐼𝐼𝐼𝐼𝐼𝐼𝐼𝐼𝐼𝐼𝐼𝐼𝐼𝐼𝐼𝐼𝐼𝐼𝐼𝐼𝐼𝐼 − 𝐶𝐶𝐶𝐶𝑎𝑎𝑎𝑎𝑎𝑎 𝐷𝐷𝐷𝐷𝐷𝐷 𝑖𝑖𝑖𝑖 𝐷𝐷𝐷𝐷𝐷𝐷𝐷𝐷𝐷𝐷𝐷𝐷𝐷𝐷𝐷𝐷 = [ ( ) ( )]. (4.15)

𝑎𝑎𝑎𝑎𝑎𝑎 𝑎𝑎𝑎𝑎𝑎𝑎 𝑎𝑎𝑎𝑎𝑎𝑎 Figure𝐶𝐶𝐶𝐶 4.25𝐷𝐷𝐷𝐷𝐷𝐷𝐷𝐷 illustrates𝑎𝑎𝑎𝑎𝑎𝑎 𝐶𝐶 𝐶𝐶the results𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆 obtained𝐼𝐼𝐼𝐼𝐼𝐼𝐼𝐼𝐼𝐼𝐼𝐼𝐼𝐼𝐼𝐼𝐼𝐼𝐼𝐼𝐼𝐼𝐼𝐼 after this− step𝐶𝐶𝐶𝐶 . The𝐷𝐷𝐷𝐷𝐷𝐷 pixels𝑖𝑖𝑛𝑛 𝐷𝐷 classified𝐷𝐷𝐷𝐷𝐷𝐷𝐷𝐷𝐷𝐷𝐷𝐷𝐷𝐷 as Similar are kept and used in the next stage of the algorithm.

(a) (b)

(c) Figure 4.25 – DoG Presence Verification: (a) DoG in DDB; (b) SPMs Intersection map under verification; (c) Final result after DoG Presence Verification step, which contains the pixels classified as Similar in this stage.

4.5.2.3 DoG Presence Verification Decision At this point, the percentage of pixels for each DoG in the DDB that is similar to the output of the previous step (PerSimPix) is computed and compared to a given threshold (SimPixTh), allowing to take a new decision on the presence of a DoG on the screen (for the DoG static areas for which it was previously impossible to take a decision):

, if 2 = , if < . (4.16) 𝐷𝐷𝐷𝐷𝐷𝐷 𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑 𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃 ≥ 𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑇𝑇ℎ 𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿 𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃 𝐷𝐷𝐷𝐷𝐷𝐷𝐷𝐷𝐷𝐷𝐷𝐷𝐷𝐷𝐷𝐷 � If PerSimPix reaches the defined threshold,𝑁𝑁𝑁𝑁 𝐷𝐷𝐷𝐷𝐷𝐷 the static𝑃𝑃 pixels𝑃𝑃𝑃𝑃𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆 on the 𝑆𝑆frame𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆 corner𝑇𝑇ℎ under analysis are identified with the DoG (already in the DDB) that they matched; otherwise, the corner is classified as not containing that DoG and the algorithm repeats the procedure for the next DoG in the DDB. This cycle is repeated until one of the following stop conditions occur: 1. A match between the SPMs Intersection map and one of the DDB entries happens; 2. All DoGs in the DDB have been tested with no matches registered, and thus the corner is classified as not containing a DoG.

45

If a DoG is identified in any of the frame corners (meaning that there is already a similar DoG in the DDB), the processing continues in the DoGs Database Updating & DoG Type Decision stage.

4.6 DoGs Database Updating & DoG Type Decision

The DoGs Database is a structure designed to keep all the necessary information about the DoGs detected over time. As this is a key element in the global system proposed in this Thesis, it demands a clear and robust management, as the data it contains is critical to conclude about the final classification of each video segment. The DDB is organized in two areas, according to their type: 1. TV channel logos – As it is clear, all TV channel logos are stored in this DDB area. 2. Non-TV channel logos – All DoGs that do not represent a TV channel logo are stored in this DDB area. Although all these DoGs have the same implication in the context of the implemented solution (they are not TV channel logos), they may still be divided into three different groups according to their particular characteristics: A. Commercial brand logos – These DoGs are used by the advertisers during some commercials, and usually represent the advertisers’ brand logo (as shown in Figure 4.1 (b)). These DoGs have the following specific characteristics: (i) they are broadcasted, at most, during four consecutive minutes (considering the remarks made in Subsection 4.3.1, i.e, the maximum duration of each commercial is two minutes and, at most, a commercial is broadcasted two consecutive times); (ii) a commercial brand logo appears alone (i.e., without other DoGs at the same time). B. Program/series logos – These DoGs are those used by the broadcasters to identify specific programs or series (see Figure 4.2 and Figure 4.4). The following characteristics about these DoGs are relevant: (i) when they appear during the program itself (and thus at the same time as a TV channel logo), they have a typical duration higher than four minutes; (ii) when they are broadcasted during a self-promotion, they are alone and have a typical duration lower than four minutes. C. Other DoGs – This group includes DoGs with varying purposes, e.g. current time, live traffic/weather updates, horizontal bars with headlines, “Live” signs, sports scores and sign language areas (see Figure 4.3). These DoGs use to be broadcasted at the same time as a TV channel logo and have a typical duration higher than four minutes.

The non-TV channel logos area is subdivided in four sub-areas, each one corresponding to a different frame corner. This division results from the observation that this type of DoGs are usually broadcasted on its specific corner, being useless to check its presence in all the corners; naturally, this approach allows saving processing time. On the other hand, it was observed that, in given circumstances, the same TV channel logo may be broadcasted in different corners; thus, it is not useful to separate these DoGs using the same criterion. Figure 4.25 presents the DoGs Database Updating & DoG Type Decision module flowchart, which is responsible for all DDB management tasks. When a DoG is detected in a frame corner, all information about it is handled in this stage.

46

Figure 4.26 – DoGs Database Updating & Logo Type Decision module flowchart.

Two different solutions to execute the tasks associated to DoGs Database Updating and DoG Type Decision are suggested: the Basic Solution (implemented and assessed in Chapter 5) and the Advanced Solution (only described conceptually). The main differences between both solutions are not only reflected in the amount of data that is kept for each DoG, but also in the procedures related with the DoGs Insertion in DoGs Database (see Subsection 4.6.2) and DoGs Type Decision (see Subsection 4.6.3.4). In more practical terms, the Advanced Solution was thought to respond not only to legal issues but also to some of the observations made in the beginning of this Chapter - some examples of conditions included in the rationale to design the Advanced Solution is the legal limitation of twelve minutes of advertising per broadcaster, per hour; the difference established between the different types of non-TV channel logos; the observations related to the simultaneous presence on the screen of different types of DoGs, among others. The detail presented by this last solution is expected to assure additional value comparing to Basic Solution, as it takes into account more aspects related to the actual commercials characteristics; the way it was conceived allows not only to save memory, but also to have an higher confidence on the results obtained (Subsections 4.6.2.2 and 4.6.3.2 clarify this). However, the Advanced Solution is only described conceptually because it was settle and defined only when the Thesis was already in a very advanced state and it was no longer possible to implement the whole new DDB management system. Also, to build and reproduce the conditions allowing properly testing and assessing the new system - which would imply to create and classify new video test sequences – would be very sluggish process that could not be done in the remaining time. Both Basic and Advanced Solutions use the following data to characterize each DoG:

1. DoG type – Primary labeling information, which distinguishes a TV channel logo from a non-TV channel logo; these labels, once attributed, are definitive; there is also a temporary state, corresponding to undefined logos, which may evolve to one of the other two states (see Subsection 4.6.3). 2. Chrominance components per pixel – Average value of each chrominance component over the video segments where the DoG was detected, and for each pixel belonging to the DoG’s edges. 3. Center of mass – Coordinates ( , ) of the DoG’s center of mass which are used to characterize the spatial distribution of the DoG pixels. 𝑖𝑖𝐶𝐶𝐶𝐶 𝑗𝑗𝐶𝐶𝐶𝐶 4. Date of last detection – Date corresponding to the last detection of the DoG. This date is regularly checked and stored in the Database Update & Management module (see

47

Subsection 4.6.3) as it is useful to know when a DoG does not appear for a long time and may be removed from the DDB; it is also used to serialize the logos in the DDB. 5. Time persistence – Percentage of time the DoG has been on the air which is essential information to determine the DoG type; this measure is defined as:

= , (4.17) 𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇 𝑜𝑜𝑜𝑜 𝐷𝐷𝐷𝐷𝐷𝐷 𝑜𝑜𝑜𝑜 𝐴𝐴𝐴𝐴𝐴𝐴 𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇 𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃 where TimeWin is the time window over which𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇 the DoG persistence is evaluated.

In addition, the Advanced Solution also requires the following information about each DoG:

6. Maximum consecutive duration – Maximum consecutive time the DoG was broadcasted, TMaxConsec. This information is useful to distinguish TV channel logos from non-TV channel logos (see Subsection 4.6.3.4). 7. Flag “alone_appearance” – Flag used to know if a given DoG has already appeared alone (i.e., with no more DoGs in the same video segment); it is useful when determining the DoGs’ type, as a DoG only appears in three circumstances: if it is a TV channel logo, a commercial brand logo or a program/series brand logo.

An important remark to make here is that the Time Persistence referred above shall be computed for the same video flow, i.e., for the same TV channel.

4.6.1 DoGs Matching The DoGs Matching (DM) module has the goal to compare the DoG acquired by the DoG Acquisition Algorithm with the DoGs in the DDB to decide if the DoG is already known and stored. This process is applied in both Basic and Advanced Solutions. The DM module is similar to the DPV module as their goals are similar – to associate a DoG in a corner to one of the DoGs in the DDB. So, DoGs identified in the DPV module are not processed again in DM; in those cases, the processing continues in the Database Update & Management module (see Subsection 4.6.3). The remaining corners, with DoGs detected, are processed here. Comparing to DPV, which includes an ‘AND’ operation and color comparison (see Subsection 4.5.2.2 for a complete description of the process), an additional condition is introduced here to increase the robustness of the matches found at this stage. This new condition (which could not be used in DPV because, as said before, DoGs that goes through that step are not complete or well-defined) depends on the distance between the center of mass (CM) of the DoG detected and the DoG in DDB, a measure identified as and computed as follows:

𝐶𝐶𝐶𝐶𝐷𝐷𝐷𝐷𝐷𝐷𝐷𝐷

= ( ) + ( ) (4.18) 2 2 𝐶𝐶𝐶𝐶𝐷𝐷𝐷𝐷𝐷𝐷𝐷𝐷 𝑖𝑖𝐶𝐶𝐶𝐶𝐷𝐷𝐷𝐷𝐷𝐷 𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎 − 𝑖𝑖𝐶𝐶𝐶𝐶𝐷𝐷𝐷𝐷𝐷𝐷 𝑖𝑖𝑖𝑖 𝐷𝐷𝐷𝐷𝐷𝐷 𝑗𝑗𝐶𝐶𝐶𝐶𝐷𝐷𝐷𝐷𝐷𝐷 𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎 − 𝑗𝑗𝐶𝐶𝐶𝐶𝐷𝐷𝐷𝐷𝐷𝐷 𝑖𝑖𝑖𝑖 𝐷𝐷𝐷𝐷𝐷𝐷 where � and are the coordinates of the acquired DoG’s center of mass, and 𝐷𝐷 𝐷𝐷𝐷𝐷 𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎 and𝐷𝐷 𝐷𝐷𝐷𝐷 𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎 are the coordinates of the DoGs in DDB’s center of 𝑖𝑖𝐶𝐶𝐶𝐶 𝑗𝑗𝐶𝐶𝐶𝐶 mass that are being compared with the acquired DoG. 𝑖𝑖𝐶𝐶𝐶𝐶𝐷𝐷𝐷𝐷𝐷𝐷 𝑖𝑖𝑖𝑖 𝐷𝐷𝐷𝐷𝐷𝐷 𝑗𝑗𝐶𝐶𝐶𝐶𝐷𝐷𝐷𝐷𝐷𝐷 𝑖𝑖𝑖𝑖 𝐷𝐷𝐷𝐷𝐷𝐷 Using as an empiric threshold for the maximum acceptable value for , each matching decision is made as follows: 𝐶𝐶𝐶𝐶𝑀𝑀𝑀𝑀𝑀𝑀 𝐶𝐶𝐶𝐶𝐷𝐷𝐷𝐷𝐷𝐷𝐷𝐷 , if < 1 = , else (4.19) 𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀ℎ 𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃 ≥ 𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀ℎ𝑇𝑇ℎ 𝑎𝑎𝑎𝑎𝑎𝑎 𝐶𝐶𝐶𝐶𝐷𝐷𝐷𝐷𝐷𝐷𝐷𝐷 𝐶𝐶𝐶𝐶𝑀𝑀𝑀𝑀𝑀𝑀 𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀ℎ𝑖𝑖𝑖𝑖𝑖𝑖 𝐷𝐷𝐷𝐷𝐷𝐷𝐷𝐷𝐷𝐷𝐷𝐷𝐷𝐷𝐷𝐷 � 𝐷𝐷𝐷𝐷𝐷𝐷𝐷𝐷 𝑛𝑛𝑛𝑛𝑛𝑛 𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚ℎ

48 where PerMatPix, analogously to PerSimPix in Subsection 4.5.2.2, is the percentage of pixels of the DoG in DDB that are similar to the pixels of the DoG under analysis; this value is compared to a threshold, MatchTh. As done in DPV (see Subsection 4.5.2.3), the search in DDB in this step is also executed from the most recently occurred DoGs to those which appeared earlier. This matching process is repeated until one of the following stop conditions occurs: 1. a match between the DoG detected and one of the DoGs in the DDB happens; 2. all DoGs in the DDB have been tested (with no matches registered). If the DoG under evaluation matches an entry in the DDB, the process continues in the Database Update & Management module (see Section 4.6.3); on the contrary, if the DoG does not match any entry in the DDB, it is classified as New Logo and the process continues with the DoGs Insertion in DoGs Database module (see Section 4.6.2).

4.6.2 DoGs Insertion in DoGs Database DoGs Insertion in the DoGs Database depends on several conditions that must be verified in first place. Those conditions are based on the DoGs detection results for all corners (after analyzing each video segment) and they differ for the Basic and Advanced Solutions (see Subsections 4.6.2.1and 4.6.2.2). Inserting a DoG in the DDB consists in creating a new slot in the DDB containing all the information about the DoG just detected. Each slot contains all the characteristics specified at the beginning of this section, according to the implemented solution. The way the common characteristics are computed and saved at this point is explained next: 1. DoG type – The first time a DoG is added to the DDB, this parameter is defined according to the conditions stated below (see Subsections 4.6.2.1 and 4.6.2.2). When the DoG type is defined as a TV channel logo, its type does not change anymore; on the other hand, undefined logos may be updated to TV channel logo or non-TV channel logo in DoG’s later appearances. 2. Average value of chrominance components per pixel – Includes the average value of chrominance components (Cr and Cb) over the video segments where it was detected, for each pixel belonging to the position (i, j) of the DoG binary map; this parameter never changes. 3. Center of mass – Includes the coordinates ( , ) of the DoG’s center of mass, which is helpful information when matching the SPMs Intersection maps with DoGs in DDB (see 𝑖𝑖𝐶𝐶𝐶𝐶 𝑗𝑗𝐶𝐶𝐶𝐶 Subsection 4.6.1); this parameter never changes. 4. Date of last occurrence – The first value assigned to this field is the ending date of the most recent segment containing this DoG; this parameter is updated each time the DoG appears. 5. Time persistence – The value assigned to this field is obtained by applying the expression (4.17). Next are presented the conditions applied in each solution, in order to properly insert each DoG’s data in the DDB.

4.6.2.1 Basic Solution Conditions for DoGs Insertion in DDB In this simpler approach, every new DoG detected is considered potentially relevant and thus its actual type (TV channel logo or non-TV channel logo) has to be verified in later steps, depending on its time persistence; thus, if a given DoG is detected for the first time, it is added to the DDB and it is classified as being an undefined logo.

49

4.6.2.2 Advanced Solution Conditions for DoGs Insertion in DDB Taking into account the characteristics of each detected DoG, the following verifications are made: 1. If a given DoG is detected at the same time as another DoG already known and classified as a TV channel logo, it is ignored (i.e., it is not inserted in the DDB), as only one TV channel logo may exist in a same frame, so the detected DoG is certainly an “Other DoGs”. 2. If a given DoG is detected at the same time as other DoGs, but none of them is classified as a TV channel logo, it is added to DDB as undefined logo. 3. If a given DoG is detected alone (i.e., no more DoGs were identified in the same video segment) for a consecutive time lower than four minutes, it is classified as undefined logo and the value “1” is assigned to the flag “alone_appearance”. 4. If a given DoG is detected alone (i.e., no more DoGs were identified in the same video segment) for a consecutive time higher than four minutes, it is classified as undefined logo. The additional DoGs’ parameters specific for this solution, presented in the beginning of this section, are inserted in the DDB as follows: A. Maximum consecutive duration – The maximum consecutive time the DoG was broadcasted, TMaxConsec, is computed and saved. This parameter may be updated in later DoG’s appearances, whenever the maximum consecutive duration of the DoG increases. B. Flag “alone_appearance” – If the condition 3. stated above is verified, the value assigned to this parameter is “1”; otherwise, it is “0”.

4.6.3 Database Update & Management The Database Update & Management module is responsible for updating the information about all DoGs in the DDB. Whenever a known DoG (i.e., a DoG already stored in the DDB) is detected, the maximum consecutive duration, date of last occurrence and time persistence are reevaluated. Taking the new information into account, several operations shall be executed in order to keep the DDB updated. Also, as it has been already mentioned, the amount of different DoGs detected in the long-term is enormous; thus, a main goal to build an intelligent and efficient DDB is to keep only the DoGs that actually represent logos. In order to restraint the computational cost, the amount of non-TV channel logos in the DDB is limited to a maximum capacity defined by MaxDoGs; MaxDoGs shall be defined taking into account the computational cost associated to the comparisons that have to be made before adding a new DoG to the DDB, in both the DoGs Presence Verification and DoGs Matching modules. The following subsections describe the rationale behind all the DDB building method and the operations description; Basic and Advanced Solutions are differentiated whenever it is necessary.

4.6.3.1 Basic Solution Rationale In the context of this solution, every DoG is seen as a potential TV channel logo, being that the reason why, following this simplistic approach, all detected DoGs are added to DDB in first place, as undefined logo. Then, the principle followed to later classify a DoG in DDB as TV channel logo or non-TV channel logo is the time persistence it presents, as the examination of real TV content allows to conclude that, in long term, a TV logo is on air for much longer than any other type of DoG. The major risk associated to this approach is related to the definition of thresholds associated to time persistence evaluation, which may lead to potentially poor assignment of TV channel logos classification to DoGs that are, for instance, “program/series logos” (see the beginning of Subsection 4.6.3). If this happens, the effect in the final results may

50 be an incorrect classification when “program/series logos” appear in commercials, as it happens in some cases (see Section 4.1)

4.6.3.2 Advanced Solution Rationale Taking into account the stated in the beginning of Subsection 4.6.3, the DoGs considered important to keep in the DDB are those identified at the beginning of Section 4.6 as TV channel logos and, among the non-TV channel logos, the “commercial brand logos” and “program/series logos”. The reason why the “Other DoGs” should not be kept (and they are not) in the DDB is because they do not represent specific instances (a TV channel, a program or a brand), are circumstantial to the content and may vary a lot even during a single program; leaving them in the DDB would not only increase the required memory but also the computational cost required for the comparison operations made on DoGs Presence Verification and DoGs Matching modules. On the other hand, “commercial brand logos” and “program/series logos” should be kept in the DDB because they appear regularly on TV and may have a real impact on each video segment classification - this is due to the way the final classification of each video segment is determined (see Section 4.7).

4.6.3.3 DoGs Serialization in DDB In order to organize all the acquired DoGs in a logical order, the date of last occurrence was chosen as the best parameter. The DDB works as a queue whose first element is the DoG that most recently appeared, while the last element is the oldest. This criterion was adopted to make the search processing more efficient as the most likely the DoG to appear at any moment is the one which has been broadcasted recently. By doing this, it is expected to reduce the computational cost and run time of stages that imply comparison operations on a pixel level (e.g. DoG Presence Verification and DoGs Matching), because the access to the most recently broadcasted DoGs has priority. This procedure is common to both Basic and Advanced Solutions.

4.6.3.4 DoGs Type Decision As explained in Subsection 4.6.2, DoGs are initially inserted in the DDB with different possible DoG type labels, according to the solution applied (Basic or Advanced Solution) and the corresponding conditions. However, the first decision may be not definitive and, when a DoG already in the DDB is detected, its status may be reviewed. The following sections establish the conditions and procedures applied to each situation, which also depends on the adopted solution.

A. Basic Solution Conditions for DoGs Type Decision After the DoG is on air for a duration T_OnAir worth to be analyzed (i.e., above T_Analysis), its time persistence TimePersist along the broadcasted video is evaluated and compared to a given threshold, ThPersist. If TimePersist is above ThPersist, it is classified as TV channel logo; else, it is classified as non-TV channel logo. The following expression clarifies this procedure.

, if _ _ > = , else (4.20) 𝑇𝑇𝑇𝑇 𝑐𝑐ℎ𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎 𝑙𝑙𝑙𝑙𝑔𝑔𝑔𝑔 𝑇𝑇 𝑂𝑂𝑂𝑂𝑂𝑂𝑂𝑂𝑂𝑂 ≥ 𝑇𝑇 𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴 𝑎𝑎𝑎𝑎𝑎𝑎 𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇 𝑇𝑇ℎ𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃 𝐷𝐷𝐷𝐷𝐷𝐷 𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇 � Once a DoG𝑁𝑁𝑁𝑁𝑁𝑁 is− classified𝑇𝑇𝑇𝑇 𝑐𝑐ℎ𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎 as𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙 TV channel logo or non-TV channel logo, this classification is definitive and this DoG’s type is not inspected anymore.

B. Advanced Solution Conditions for DoGs Type Decision Although the TV channel logos status is definitive and may not be changed (the same happens with non-TV channel logos), an undefined logo may evolve in three different ways as it

51 may be concluded that it is either a TV channel logo, a non-TV channel logo and it may also be deleted from the DDB as follows: 1. If a DoG already classified (in the DDB) as undefined logo appears at the same time that a DoG classified as TV Channel Logo, its “alone appearance” flag is checked: a. If the flag value is “0”, the DoG is deleted from the DDB, as it probably is a DoG of the type described as “Other DoGs” (see the beginning of this subsection). b. If the flag value is “1”, it is considered as a series/program brand logo and thus it is classified as non-TV channel logo. 2. If a DoG already classified (in the DDB) as undefined logo does not occur at the same time as a TV channel logo, the following checks take place: a. If the DoG appears alone: i. If TConsecMax is higher than four minutes, it is classified as a TV channel logo;

ii. If TConsecMax is lower than four minutes and if the flag “alone appearance” value is “1”, it is probably a commercial brand logo and the DoG is classified as a non-TV channel logo.

iii. If TConsecMax is lower than four minutes and the flag “alone appearance” value is “0”, no modification is made and the DoG keeps its undefined logo classification. After two checks, if no conclusion is reached about changing the classification for this DoG, the DoG is deleted from DDB. b. If the DoG appears at the same time as other undefined logos, each one is analyzed individually:

i. If the DoG’s persistence is higher than a given threshold, PersTVlogo, and TConsecMax is higher than four minutes, it is classified as a TV channel logo; when other DoGs are also present, those with the “alone appearance” value equal to “1” are kept (as they may be commercial brand logos) and the others are deleted; ii. If the previous condition is not verified, no decisions are taken. After two checks, if no conclusion to change the DoG classification is reached, the DoG is deleted from the DDB.

4.6.3.4 DoGs Removal from DDB The Database Update & Management module is responsible for removing from the DDB the DoGs that do not need to be stored anymore. At this point, the goal is to remove from the DDB old and uncommon TV channel logos and non-TV channel logos based on the date of the last occurrence of each DoG. In the first case, this operation is justified by the observation that, sometimes, broadcasters change their TV channel logo and, when this happens, it uses to be definitive. Thus, if a TV channel logo does not appear for a given time period of recording (e.g. higher than one month, to avoid temporary changes like those that occur in Christmas time), it should be removed from the DDB. For non-TV channel logos, removing DoGs is also a matter of performance; as it has been stated along this Thesis, this second type of DoGs is critical, because there are many different situations where they may occur and a mechanism to limit their presence in the DDB is required. Following the rationale used for DoGs serialization, it is easy to realize that an endless database could generate an unbearable amount of data to process, notably associated to the pixel-level operations, making it very inefficient. Also, a non-negligible part of the DoGs detected occur in very specific situations that may never be broadcast again (e.g. a particular interview or sports game), making it lawful to delete some of them. In the following, two conditions to remove non-TV channel logos from the DDB are proposed: the first one (A) depends on the DDB’s size and the second one (B) depends on the date of each DoG’s last appearance.

52

A. Decision based on the size of DDB - When the value assigned to MaxDDBSize is reached, the DoG in the DDB queue’s last position is removed. B. Decision based on the last appearance of each DoG - An efficient way to detect a DoG that is not broadcast with an acceptable regularity is to check the date of its last appearance (LastApp) and compare it to the current date (CurrDate). The following expression summarizes this decision: , > = , (4.21) 𝑌𝑌𝑌𝑌𝑌𝑌 𝑖𝑖𝑖𝑖 𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶 − 𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿 𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀 𝐷𝐷𝐷𝐷𝐷𝐷𝐷𝐷 𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅 𝐷𝐷𝐷𝐷𝐷𝐷𝐷𝐷𝐷𝐷𝐷𝐷𝐷𝐷𝐷𝐷 � If the difference between CurrDateandLastApp is 𝑁𝑁above𝑁𝑁 𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒 the maximum acceptable temporal distance to the last appearance, MaxTempDist, the DoG is removed from the DDB. This procedure is useful for both Basic and Advanced Solutions and it may be applied in these two cases.

4.7 Video Segment Classification

The Video Segment Classification is the final decision stage and it decides the final output of the proposed global solution: a segment is classified either as a Commercial Block or a Regular Program. Figure 4.26 presents the flowchart designed for this last stage. The following situations are considered when classifying a video segment: 1. No logo detected – If no DoG is detected in any corner, the video segment is classified as belonging to a Commercial Block. 2. TV channel logo detected – When a segment is associated to a TV channel logo, the segment is classified as a Regular Program. 3. Undefined logo detected - When a segment is associated to an undefined logo, the segment is classified as a Regular Program. This decision may not be obvious, but the goal is not to lose potentially relevant content assuming the user wants to see regular programs: as an undefined logo may be later classified as TV channel logo, this classification guarantees that no regular content (i.e. non-commercial content) is wrongly classified as commercial. 4. Non-TV channel logo(s) detected – When all DoGs detected are non-TV channel logos, the segment is classified as a Commercial Block. After this step, the most recent video segment has completed its processing. A report with the video segment classification is updated and the algorithm continues its course to the next video segments. In order to improve the performance of the Video Segment Classification module (and thus the results of the whole algorithm), a simple additional procedure aiming to correct false negative classifications is suggested which consists in a filtering mechanism based on the classification assigned to the neighboring video segments of each segment – see Figure 4.28.

53

Figure 4.27 – Video Segment Classification module flowchart.

Figure 4.28 – Output structure that should to be corrected.

In Figure 4.28 each vertical line represents a video segment. The previous Nprev and the following Nfoll segments (taking as reference the segment classified as Commercial Block), are Regular Program; considering that it makes no sense the existence of a commercial composed by a single video segment (and given that this type of situation was never observed for the video material recorded), the video segment identified as Commercial Block is probably a false negative, and thus it should be reclassified to Regular Program. Considering a real-time classification application, the inclusion of this reclassification procedure would require introducing a delay of Nfoll video segmentsto guarantee the availability of the necessary information to take a reclassification decision. For offline applications, the reclassification mechanism could use the same delay strategy or wait until finishing processing the full video sequence to start then sweeping the whole sequence and reclassify the segments considered incorrect, according to the previous conditions.

54

Chapter 5 Performance Evaluation

In this chapter, the performance of the novel commercials detection solution proposed in Chapter 4 is assessed under appropriate test conditions. An exhaustive evaluation methodology has been designed to obtain reliable and representative results as well as clear and relevant conclusions. Tests were performed with two main modules of the proposed solution, notably the Shot Change Detection algorithm (SCD) and the DoG Acquisition algorithm (DoGA), and also with the Global Solution (GS) for Commercials Detection. As a video shot is considered the basic unit for classification purposes, it is important to evaluate how well the SCD algorithm executes its task, as errors at this early point of the overall process may have a big impact later. The functions associated to the DoG Acquisition algorithm should also be subject to specific scrutiny, as they constitute a critical step of the global process; in particular, the DoGA behavior for different types of logos should be evaluated. Finally, the whole GS solution, which includes not only the SCD and DoGA modules but also the whole DoGs database management system is assessed. Objective statistical measures are computed, presented and discussed for the three evaluation experiments. All tests were performed by running the implemented algorithm on a computer with a 64-bit Operating System, with an Intel ® Core ™ i7-4770 CPU @ 3.40 GHz processor and 16GB RAM.

5.1 Test Material This section describes the test material used to assess the performance of the three different experiments mentioned above.

5.1.1 Shot Change Detection Assessment Dataset To assess how well the SCD performs in detecting hard cuts, eight video sequences have been created, where each sequence is a composition of recording from a single TV channel. The videos have a spatial resolution of 1920x1080 pixels and include several kinds of content, including movies, sports, cartoons and news programs. All the videos include mostly hard cuts and only these type of shot transitions are taken into account in the DoGs Acquisition assessment. Table 5.1 characterizes the video sequences used to perform the experiment, which are available and may be downloaded from [59].

55

Table 5.1 – Shot Change Detection test sequences characterization. Duration Sequence Hard Cuts Nº of Frames Time (s) biggsSCD 4581 183 49 hwdSCD 3933 157 55 sicNotSCD 3113 124 28 aBolaSCD 12901 516 71 sicSCD 6746 269 43 qSCD 9633 385 95 sicRadSCD 6386 255 97 rtp1SCD 8787 351 72 TOTAL 56080 2240 510

5.1.2 DoG Acquisition Assessment Dataset To verify how the DoGA module performs in detecting DoGs, eight video sequences have been created, where each one is a composition of recording from a single TV channel. The TV channels were selected in order to assure a high diversity for the DoGs; Table 5.2 presents the color code created to distinguish the different types of DoGs and Table 5.3 presents and describes the actual TV logos considered.

Table 5.2 – Color code used in logo types classification. Opaque & Static Opaque & Dynamic (Texture) Semitransparent & Static Semitransparent & Dynamic (Shape)

Table 5.3 - Logos used to test the DoGA module. TV Channel Sequence Logo Description Opaque; static in terms of shape and texture; white and ; well Biggs biggsDoGA defined limits. Opaque; blue and white; static in terms of shape and texture; word Hollywood hwdDoGA “Hollywood” with not well defined limits (may be confused with the background).

Opaque; static in terms of shape and texture; not well defined limits; SIC Notícias sicNotDoGA white bottom with the current time but not well defined limits.

Opaque; dynamic in terms of shape (shades regularly passing on the A Bola aBolaDoGA image); yellow, red and white; well defined limits; contains the current time on the bottom. Opaque; dynamic in terms of shape (shades always passing on the image); well defined limits except around the letter “I”, the right and SIC sicDoGA above sides of “S” and the left and above sides of “C”; contains the

current time on the bottom.

Semitransparent; static in terms of shape and texture; green, and Canal Q qDoGA purple; well defined limits.

Semitransparent; static in terms of shape and texture; white; not well SIC Radical sicRadDoGA defined limits (may be confused with the background).

Semitransparent; dynamic in terms of shape (the left form regularly RTP1 rtp1DoGA strolls on itself, turning blue, and then backs to this form); not well defined limits.

56

The video sequences and their content are characterized in the following (they can be accessed in [60]): 1. Sequence “biggsDoGA” – Cartoon broadcasted by “Biggs”, a Portuguese TV channel targeted to children; the only corner containing DoGs is the upper left one, where the "Biggs" channel logo is exhibited. 2. Sequence “hwdDoGA” – Assembly of scenes extracted from two different movies, broadcasted by “Hollywood”, an Iberian TV movies channel; it contains two DoGs: in the upper left corner, the logo of the TV station; in the upper right corner, a DoG announcing a movie that will be broadcasted later. 3. Sequence “sicNotDoGA” – Two news programs recorded in different days, broadcasted by “SIC Notícias”, a Portuguese TV news channel. As in most news programs, DoGs can be found not only in the corner where the TV channel logo is located (in this case, in the upper left corner) but also in the other corners. 4. Sequence “aBolaDoGA” – Several video segments extracted from different types of programs, namely a program announcement, an animated cartoon, part of an interview and its final credits and the highlights of a football game; all those segments were broadcasted by “A Bola TV”, a Portuguese sports TV channel. Besides the TV channel logo, several DoGs are used in the interview and in the football game highlights (e.g., presenting the game’s current time and score). 5. Sequence “sicDoGA” – Extracted from a news program, it contains three news reports, including a street report; it was broadcasted by “SIC”, a Portuguese private TV channel. 6. Sequence “qDoGA” – Talk-show broadcasted by “Canal Q”, a Portuguese comedy TV channel. It contains the TV channel logo in the upper left corner and a DoG in the upper right corner, with information to the viewer on how to record the program in the set-top box. 7. Sequence “sicRadDoGA” – Part of a wrestling show and also part of a reality TV series (“Shark Tank”) which was broadcasted by "SIC Radical", a Portuguese entertainment TV channel. It contains the TV channel logo in the upper left corner during all the shots, and a DoG in the lower right corner during part of the wrestling show. 8. Sequence “rtp1DoGA” – Part of a news show broadcasted by "RTP1", a Portuguese public TV channel. The TV channel logo contains many challenging characteristics for the designed DoGA algorithm (see Table 5.3 - Logos used to test ). This is the longest test sequence and also the one with the most difficult content. Table 5.4 characterizes the sequences presented above in terms of duration and total number of corners analyzed.

Table 5.4 – DoGA’s test sequences characterization.

Duration Total Number of Sequence Shots Time (s) Frame Corners biggsDoGA 62 183 248 hwdDoGA 57 157 228 sicNotDoGA 42 242 168 aBolaDoGA 70 445 280 sicDoGA 44 286 176 qDoGA 54 338 216 sicRadDoGA 104 328 416 rtp1DoGA 106 596 424 TOTAL 539 2575 2156

57

5.1.3 Global Solution for Detecting Commercials Assessment Dataset This subsection presents the sequences used to assess the performance of the proposed global solution for detecting commercials, including both the SCD and DoGA modules and also the other modules presented in Chapter 4; as mentioned previously, for the DoGs database management only the Basic Solution was considered. Three video sequences were created (and may be assessed in [61]) for this test experiment, each one composed by recordings of a single TV channel, containing content of regular programs and commercials (among these, some with and some without commercial brand logo in the screen). Table 5.5 presents a summary of each video’s characteristics. Table 5.5 – Global Solution test sequences characterization.

Duration Regular Commercial Sequence Program Blocks Shots Frames Time (s) Blocks sicNotGS 87 6824 273 4 3 tviGS 65 4215 168 3 2 rtpGS 181 17708 708 4 3 TOTAL 333 28747 1149 11 8

The video sequences and its content are characterized in the following: 1. Sequence “sicNotGS” – Composition of four blocks of different regular programs, interspersed with three commercial blocks, extracted from the Portuguese TV channel “SIC Notícias” (see Figure 5.1). Each block’s content is defined next: • Regular Program 1, 2, 3 and 4 – Parts of programs extracted from “SIC Notícias”. • Commercial Block 1 – Commercial of “Well’s”, which broadcasts its commercial brand logo on the upper right corner of the screen along all the advertising time. • Commercial Block 2 – Commercial with no commercial brand logo included. • Commercial Block 3 – Repetition of “Well’s” commercial also chosen for Commercial Block 1. This duplication intends to test if the algorithm is able to detect the commercial as such, as it is supposed after the time persistence and the DoG type corresponding to this non-TV channel logo is verified in the Database Update & Management step.

Figure 5.1 – Structure of the “sicNotGS” test sequence.

2. Sequence “tviGS” – Composition of three blocks of different regular programs, interspersed with two commercial blocks, extracted from the Portuguese TV channel “TVI” (see Figure 5.2). Each block’s content is defined next: • Regular Program 1, 2 and 3 – Parts of programs extracted from “TVI”. • Commercial Block 1 – Concatenation of parts of three different commercials, being the last one a commercial to “Danonino” yogurts, whose commercial brand logo is broadcasted on the lower right corner of the screen along all the advertising time. • Commercial Block 2 - Repetition of “Danonino” commercial used in the end Commercial Block 1. As before, it is intended to verify if the algorithm is able to detect the commercial as such.

Figure 5.2 - Structure of "tviGS" test sequence.

58

3. Sequence “rtp1GS” – Composition of three blocks of different regular programs extracted from the Portuguese public broadcaster RTP1, interspersed with three commercial blocks. • Regular Program 1, 2, 3 and 4 – Parts of programs extracted from “RTP1”. • Commercial Block 1 – Commercial constituting the transition between two programs. • Commercial Block 2 – Concatenation of parts of four different commercials; two of them are commercials of companies, “ehs” and “ALC”, that contain the commercial logo on the screen (on the LLC and ULC, respectively). In the case of “ehs” commercial, the commercial brand logo of the company is visually located near the LLC but it is not covered by the region that is analyzed by the algorithm, proving that a well-defined region on the screen corners is important to cut out some potential TV channel logos candidates. • Commercial Block 3 - Repetition of ALC’s commercial used in the end Commercial Block 2. As before, it is intended to verify if the algorithm is able to detect the commercial as such.

Figure 5.3 - Structure of the “rtp1GS” test sequence.

5.2 Performance Assessment Methodology and Metrics The objective of each assessment experiment is to evaluate how well each algorithm under assessment performs the task it targets, which implies the definition of appropriate objective performance metrics. For binary classifiers as those used in this work, the following statistical metrics are typically used:

1. True Positive Rate (TPR, also known as Recall) – Proportion of positives that are identified as such against the total number of actual positive conditions:

= (5.1) + 𝑇𝑇𝑇𝑇 𝑇𝑇𝑇𝑇𝑇𝑇 2. True Negative Rate (TNR, also known𝑇𝑇𝑇𝑇 as𝐹𝐹 𝐹𝐹Specificity) – Proportion of negatives that are correctly identified as such against the total number of actual negative conditions:

= (5.2) + 𝑇𝑇𝑇𝑇 𝑇𝑇𝑇𝑇𝑇𝑇 3. False Positive Rate (FPR) – Proportion𝑇𝑇𝑇𝑇 of negative𝐹𝐹𝐹𝐹 events wrongly classified as positive against the total number of actual negative conditions:

= (5.3) + 𝐹𝐹𝐹𝐹 𝐹𝐹𝐹𝐹𝐹𝐹 4. False Negative Rate (FNR) – Proportion𝑇𝑇𝑇𝑇 of𝐹𝐹 positive𝐹𝐹 events wrongly classified as negative against the total number of actual positive conditions:

= (5.4) + 𝐹𝐹𝐹𝐹 𝐹𝐹𝐹𝐹𝐹𝐹 5. Precision (Prec) – Proportion of true𝐹𝐹 positives𝐹𝐹 𝑇𝑇𝑇𝑇 against all positive results:

= (5.5) + 𝑇𝑇𝑇𝑇 𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃 6. F1-Score (F1) – Harmonic mean of true𝑇𝑇𝑇𝑇 positive𝐹𝐹𝐹𝐹 rate and precision:

59

. 1 = 2 (5.6) + 𝑇𝑇𝑇𝑇𝑇𝑇 𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃 𝐹𝐹 7. Accuracy (Acc) – Proportion of true𝑇𝑇 𝑇𝑇𝑇𝑇results𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃 (true positives and true negatives) against the total number of cases: + = (5.7) + + + 𝑇𝑇𝑇𝑇 𝑇𝑇𝑇𝑇 𝐴𝐴𝐴𝐴𝐴𝐴 As the goal of each module - SCD, 𝑇𝑇DoGA𝑇𝑇 𝐹𝐹 𝐹𝐹and 𝑇𝑇GS𝑇𝑇 - 𝐹𝐹varies,𝐹𝐹 also the concepts of TP, FP, TN and FN are different for each situation. The following subsections clarify the definitions used for these terms in each case.

5.2.1 Shot Change Detection Assessment All the video sequence frames were manually classified, frame-by-frame, as corresponding or not to a hard cut. As the goal of the SCD module is to detect hard cuts, the first frame after a hard cut (i.e., the first frame of each new shot) is considered as a "positive" event. The SCD output is then compared with the manually created ground truth, produced beforehand, and each frame is labeled according to the following definitions: • True Positive (TP) – a video frame correctly identified as being a hard cut; • True Negative (TN) – a video frame correctly identified as not being a hard cut; • False Positive (FP) – a video frame wrongly identified as being a hard cut; • False Negative (FN) – a video frame wrongly identified as not being a hard cut.

5.2.2 DoG Acquisition Assessment Each video test sequence was manually classified in two different perspectives: I. Shot changes – Each sequence was labeled in terms of shot transitions, comprising hard cuts, fade-ins, fade-outs, dissolves and wipes. II. DoGs per shot – Each image corner was labeled as containing or not DoGs, during each shot. As stated in Chapter 4, the DoGA module was designed to detect DoGs on the screen corners, which include not only TV channel logos but also commercial logos, current time, live traffic/weather updates, horizontal bars with headlines and sign language areas; thus, any DoG on a screen corner during a whole video shot is considered as a "positive" event. The DoGA runs over the test sequences and assigns, to each pair shot/corner, a binary classification signalizing the detection, or not, of a DoG. The DoGA output is then compared with the ground truth, produced beforehand, and each pair shot/corner is labeled according to the following definitions: • True Positive (TP) – A corner correctly identified as containing a DoG; • True Negative (TN) – A corner correctly identified as not containing a DoG; • False Positive (FP) – A corner wrongly identified as containing a DoG; • False Negative (FN) – A corner wrongly identified as not containing a DoG. The results are analyzed taking into account two perspectives: 1. Type I analysis - It only considers sequences with a TV channel logo and it only takes into account the inspection of the corner where it is known that the TV channel logo is located. This allows an analysis focused on the specific characteristics of each type of TV logo, as defined in Subsection 5.1.3. 2. Type II analysis - It takes into account all the screen corners, no matter if they contain or not a TV channel logo. This allows an analysis over all the possible DoGs cases, regardless of their specific characteristics.

60

Table 5.6 characterizes each sequence in terms of duration and defines the amount of positive events for both analysis types mentioned above (I and II); it also contains the total number of frame corners analyzed in each sequence, whose value is four times the number of shots (four corners per shot). For instance, in sequence “hwdDoGA” sequence, among a total of 224 analyzed screen corners, 56 screen corners contain the TV channel logo and 112 contain any type of DoG (including both TV channel and non-TV channel logos).

Table 5.6 – DoGA’s test sequences characterization.

Total Number of Analyzed Ground Truth Sequence Screen Corners Type I Type II biggsDoGA 196 49 49 hwdDoGA 224 56 112 sicNotDoGA 168 42 81 aBolaDoGA 280 70 180 sicDoGA 176 44 130 qDoGA 216 54 108 sicRadDoGA 416 104 120 rtp1DoGA 424 106 286 TOTAL 2100 525 1066

5.2.3 Global Solution for Detecting Commercials Assessment The best way to test the proposed solution would be to run it for several hours, days or even weeks, allowing the algorithm to face all possible kinds of content, shot transitions and DoGs that are used by a regular TV broadcaster. However, this would require an enormous amount of recording, storage and labeling to obtain controlled results with a ground truth, what was not possible to obtain due to the limited time for the thesis accomplishment. Each sequence was manually labeled with ground truth for the following characteristics: I. Shot changes – Each sequence was labeled in terms of shot transitions, comprising hard cuts, fade-ins, fade-outs, dissolves and wipes. II. Content type identification – Each video shot resulting from the previous labeling was identified as belonging to a regular program or a commercial block. As the very final goal of this Thesis is to distinguish the content in each video between regular programs and commercial blocks, a binary classification is implied. The algorithm runs over the test video sequences and assigns, to each frame, a Regular Program classification or a Commercial Block classification. Then, two approaches (Type A and Type B analyses) are followed in order to consider each classification event as “positive” or “negative” and the output produced is then compared to the ground truth produced beforehand: 1. Type A analysis - Considering an application whose main goal is to detect Regular Programs, a video segment is considered as a “positive” event when it corresponds to a Regular Program. In this case: • True Positive (TP) – A video frame correctly identified as Regular Program; • True Negative (TN) – A video frame correctly identified as Commercial Block; • False Positive (FP) – A video frame wrongly identified as Regular Program; • False Negative (FN) – A video frame wrongly identified as Commercial Block. 2. Type B analysis - Considering an application whose main goal is to detect Commercial Blocks, a video segment is considered as a “positive” event when it corresponds to a Commercial Block. In this case: • True Positive (TP) – A video frame correctly identified as Commercial Block; • True Negative (TN) – A video frame correctly identified as Regular Program;

61

• False Positive (FP) – A video frame wrongly identified as Commercial Block; • False Negative (FN) – A video frame wrongly identified as Regular Program. To understand the impact of the SCD module, the global system algorithm is run twice for each test sequence: in the first run, the algorithm takes into account the ground truth defined for the shot changes; this is because even though only hard cuts are implemented, a real system would be optimized if all the shot transitions are detected; in the second run, the implemented SCD’s detection results are used. 5.3 Results and Analysis This section presents and discusses the results for all the performed assessment experiments.

5.3.1 Shot Change Detection Assessment Experiment

In this subsection, the results obtained for SCD assessment experiment are presented (see Table 5.8) and analyzed; the algorithm parameters used to perform the tests are shown in Table 5.7.

Table 5.7 – Parameters for the SCD assessment experiment. SCD Parameters WF 14 lowWind 5000 highWind 70000 TpA 8 TpB 6 Histograms Parameters

# Bins 12 Normalized? No

In general, taking into account the context of this Thesis, a FN occurrence is more worrisome that a FP occurrence (i.e., it is more problematic not to detect a hard cut than to detect an inexistent hard cut), because a FN may be the boundary between a Regular Program and a Commercial Block and, in that case, it is likely to cause the misclassification of the corresponding video segment. Thus, an algorithm to detect hard cuts, in this context, should be oriented to have as less FNs as possible.

Observing the results, the following comments are relevant:

1. Regarding false positives: a. The metrics related to the existence of FPs are FPR and Prec. Taking into account the global average results, one can see that FPR = 0,08% and Prec = 92,4%, which are satisfying values. As stated before, FPs are not likely to have a relevant negative impact on the final classification performance; however, more FPs still imply more video segments to analyze and thus more run time, which has a negative impact on the overall performance of the proposed solution.

Table 5.8 – SCD module performance results. Number of Hard Cuts TP TN FP FN TPR FPR FNR Prec F1 Frames Ground Truth “biggsSCD”

62

4581 49 45 4526 6 4 91,8% 0,13% 8,2% 88,2% 90,0% “hwdSCD”

3933 55 47 3878 0 8 85,5% 0,00% 14,5% 100,0% 92,2% “sicNotSCD” 3113 28 27 3084 1 1 96,4% 0,03% 3,6% 96,4% 96,4% “aBolaSCD” 12901 71 65 12815 15 6 91,5% 0,12% 8,5% 81,3% 86,1% “sicSCD” 6746 43 40 6701 2 3 93,0% 0,03% 7,0% 95,2% 94,1% “qSCD” 9633 95 80 9537 1 15 84,2% 0,01% 15,8% 98,8% 90,9% “sicRadSCD” 6386 97 90 6386 9 7 92,8% 0,14% 7,2% 90,9% 91,8% “rtp1SCD” 8787 72 62 8707 8 10 86,1% 0,09% 13,9% 88,6% 87,3% Total 56080 510 456 55528 42 54 90,2% 0,08% 10,6% 92,4% 91,1%

b. Most FPs result from sudden and high changes in the brightness conditions. This effect is shown in Figure 5.4, where a spotlight up in the upper right corner of consecutive frames, thus causing a significant difference in the luminance histograms that origins the false detection of an hard cut.

(a) (b) Figure 5.4 – Strong brightness change in consecutive frames, visible in the upper right corner.

Additional conditions to prevent these false events should be designed to avoid them and improve the metrics related to FP.

2. Regarding false negatives: a. The metrics related to the effect of FN are FNR and TPR; considering the global averages, these metrics reached FNR = 10,6% and TPR = 90,2%. FNR is here the most critical metric to analyze, as its value implies that about one out of ten hard cuts is not detected; when that happens in a transition between Regular Programs and Commercial Blocks, the risk of misclassification of video segments in that boundary increases. In some sequences (e.g. “hwdSCD”, “qSCD” and “rtp1SCD”), FNR presents percentages that may be considered too high, having in consideration the potential harm provoked by FNs – to reduce the value of this metric, even if that means to increase FPR, is recommended. In the context of the results above, and comparing

63

the values of FPR to FNR, one should take into account the fact that the amount of positive events (49) is much lower than the amount of negative events (4532). Looking to the formulas used to calculate FPR, one can conclude that it is easy to a FP event to be “hidden” among the amount of TNs; on the other hand, each FN has a big impact in FNR, as it has the same order of magnitude of TPs. b. Most FN result from particular cases where, despite the differences between the frames, the histogram does not vary enough in the context of the frame window used, and thus the adopted threshold is not suited to that situation. 3. SCD run time - Table 5.9 shows the ratio between the SCD run time and the actual duration of each sequence. Considering the tested video sequences as a whole, the run time ratio achieved is lower than one third, which indicates that this module may be three times faster than real-time. Table 5.9 - Ratio between SCD run time and duration of each video sequence. Sequence Duration (s) Run Time (s) Run Time Ratio biggsSCD 183 57 31,1% hwdSCD 157 49 31,2% sicNotSCD 124 38 30,6% aBolaSCD 516 157 30,4% sicSCD 269 81 30,1% qSCD 385 115 29,9% sicRadSCD 255 78 30,6% rtp1SCD 351 108 30,8% AVERAGE 280 85,4 30,6%

5.3.2 DoG Acquisition Algorithm Assessement The algorithm parameters used to perform the tests are shown in Table 5.10; the results obtained for the DoGA assessment experiment are presented in Table 5.11.

Table 5.10 – Parameters for the DoGA assessment experiment. Key Frames Selection parameters

PerKF 10%

SegmentLenTh 100 K 10 Canny edge detector parameters (OpenCV’s function Canny) Threshold1 25 Threshold2 75 Aperture Size 3 Chrominance components variance thresholds ThChrom 10

Note that all the test sequences have the TV channel logo present; accordingly, the concepts of TN and FP do not apply to the lines corresponding to Type I Analysis and thus, the depending metrics - TNR, Prec, F1 and Acc - are not defined for those cases.

Table 5.11 – DoGA module: performance results.

64

Number of Type TP TN FP FN TPR TNR Prec F1 Acc Corners “biggsDoGA”

I 49 49 0 100,0%

II 196 49 147 0 0 100,0% 100,0% 100,0% 100,0% 100,0% “hwdDoGA” I 56 56 0 100,0% II 224 112 106 6 0 100,0% 94,6% 94,9% 97,4% 97,3% “sicNotDoGA” I 42 42 0 100,0% II 168 76 87 0 5 93,8% 100,0% 100,0% 96,8% 97,0% “aBolaDoGA” I 70 70 0 100,0% II 280 175 96 4 5 93,8% 100,0% 100,0% 96,8% 97,0% “sicDoGA” I 44 44 0 100,0% II 176 130 38 8 0 100,0% 82,6% 94,2% 97,0% 95,5% “qDoGA” I 54 54 0 100,0% II 216 108 91 17 0 100,0% 84,3% 86,4% 92,7% 92,1% “sicRadDoGA” I 104 104 0 100,0% II 416 120 296 0 0 100,0% 100,0% 100,0% 100,0% 100,0% “rtp1DoGA” I 106 100 6 94,3% II 424 280 127 11 6 97,9% 92,0% 96,2% 97,1% 96,0% Average I 525 519 6 98,9% II 2100 1050 988 46 16 98,5% 95,6% 95,8% 97,1% 97,0%

Observing Table 5.11, the following comments are relevant:

1. Regarding the TV channel logo detection performance: a. All logos included in the categories “Opaque & Static”, “Opaque and Dynamic (Texture)” and “Semitransparent & Static” were always correctly detected for all the sequences, demonstrating the robustness of the proposed algorithm. b. As it could be expected, there is no performance difference between static logos and dynamic logos in terms of texture, because only the detected edges are considered when analyzing the logo color variations. This is due, in first place, to the fact that, usually, texture changes are not constant in time, i.e. they are intermittent. Second, they are not very abrupt and thus the Canny Edge Detector’s parameters were selected to obtain a sensitivity level that avoids detecting those specific texture change events. So, these types of logos are actually equivalent for the proposed algorithm. c. “Canal Q” and “SIC Radical” logos, characterized as semitransparent, were also detected for every shot. Transparency reveals not to be a problem per se because, as it has been explained, the main issue to deal with is the existence of edges. Taken into account the characteristics of these specific logos it is worthwhile to refer: i. The DoGA module was already expected to perform well on the “qDoGA” sequence, because at least the inner logo edges (with the letters “canal” and “Q”) should be

65

recognized, as the contrast between the colors of those letters and the color around is clear. Even considering the possibility of a background with the same color of the outer purple window, the detection was supposed to happen due to the logo inner edges. ii. “SIC Radical” logo could be more problematic than the previous one, as it is white (or very close to it), which could result in a low color contrast with light backgrounds (e.g., light sky images); however, since this logo is larger than typical TV logos, it is more likely that some parts of its area may be detected. d. The poorest performance occurred with the video sequence “rtp1DoGA”, which is the only sequence without TPR = 100%. All the characteristics referred in Section 5.1 about the logo and the sequence itself (namely its size and content, chosen so that the algorithm could face the worst cases) justify the results. Figure 5.5 shows situations where the DoGA algorithm was not able to detect the “RTP1” channel logo. These examples deserve the following comments: i. For the example on the left, the limits of the logo were not correctly detected due to the highly textured background (in fact, they are barely distinguished visually); for the example on the right, it is also really difficult to visually identify the horizontal stripes of the logo, while for the “RTP1” characters the contrast between the logo and the background simply does not exist during the whole shot. ii. Short shots (i.e. shots with less than 100 frames) may be penalized by the small amount of key frames retained for further analysis. As an example, for a video segment with fifty frames (corresponding to two seconds), only five key frames are retained; if the logo is not distinguishable from the background it this duration, the segment may be classified as not containing any logo. This means that a more adaptive key frame selection process should be implemented to improve the DoGA results.

Figure 5.5 – Examples of situations where the DoGA algorithm failed with false negatives.

2. Regarding the DoGs detection performance: a. The events considered as positives include all kind of DoGs referred in b. Chapter 4, namely current time, live traffic/weather updates, horizontal bars with headlines and also sign language areas. c. As the main goal of DoGA algorithm is to detect TV channel logos, it is not worrisome to have a large amount of FNs, as long as they do not correspond to actual TV channel logos. d. For the sequences “sicNotDoGA” and “rtp1DoGA”, some FNs have occurred. Some of them are justified by the misconception of the area considered as “corner”: the fact is the classification made a priori had into account what a common viewer could consider a corner, which may not coincide with the exact regions analyzed by the algorithm; or, when it coincides, the amount of edges detected does not reach the minimum threshold defined in the algorithm as being worthy to analyze the area, and that fact is not predictable when preparing the testing material. Also, all these cases

66

correspond to situations where the DoG is not an actual TV channel logo, which minimizes the impact of this issue. The lower right corner in Figure 5.6 (b) shows an example of this effect as only a small part of the DoG with the name of the program (“Bom Dia Portugal”) and the current time is actually inside the corner area analyzed by the algorithm. As the woman doing sign language is constantly moving and there is running text in the blue bar, the DoG is naturally not detected. e. Some other FN cases are justified by transitions like the one shown in the lower left corner of Figure 5.6 (a). The blue bar contains running text on its left. In Figure 5.6 (a), the text is “País” (which means “Country”), in Figure 5.6 (b) there is a changing effect occurring and in Figure 5.6 (c) the subject is “Economia” (which means “Economy”). When the SPMs Intersection step occurs (see Subsection 4.5.2), the result may not include an amount of edges large enough to detect the DoG as such.

(a) (b)

(c) Figure 5.6 – Sequence of frames (in different shots) showing failures, notably false negatives on the lower left corner and the lower right corner); also a “RTP1” logo shape transition (in the stripes part) can be observed on the upper left corner. f. For sequences “aBolaDoGA” and “qDoGA” some FPs have occurred. Figure 5.7 summarizes these cases: highly textured backgrounds cause an enormous amount of detected edges and, if a sufficient number of them is constant, in terms of color characteristics, along the segments, then a FP may happen. This is an expected consequence of the proposed approach and this video segment was selected to show this weakness.

67

Figure 5.7 – Example of a situation where the algorithm fails by detecting false positives in the lower corners, due to highly textured background.

3. All the global metrics used to assess the quality of this approach (TPR, TNR, F1 and Acc) have average values above 95%. 4. DoGA run time - Table 5.12 shows the ratio between the DoGA run time and the actual duration of the sequences; on average, this ratio is 49,3% which indicates that this task may be performed near real-time.

Table 5.12 – Ratio between DoGA run time and duration of each video sequence. Sequence Duration (s) Run Time (s) Run Time Ratio biggsDoGA 183 91 49,7% hwdDoGA 157 92 58,6% sicNotDoGA 242 98 40,5% aBolaDoGA 445 182 40,9% sicDoGA 286 141 49,3% qDoGA 338 147 43,5% sicRadDoGA 328 181 55,2% rtp1DoGA 596 339 56,9% AVERAGE 321,9 158,9 49,3%

5.3.3 Global Solution Assessment In this subsection, the results obtained for the global solution (GS) assessment experiment are presented and analyzed; table 5.13 shows the parameters used to test the algorithm.

Table 5.13 - Parameters for the GS assessment experiment.

Key Frames Selection parameters

MinPixTh 10%

StaticPixTh 20% SimPixTh 25%

CMMax 15

TAnalysis 1000 frames

ThPersist 30%

1. Sequence “sicNotGS” – Table 5.14 shows the results obtained for this video test sequence, according to the Type A analysis. The results presented in the table’s first line used the shot cuts ground truth; in the second line, the shot cuts were detected using the developed SCD algorithm.

68

Table 5.14 - GS module: performance results for sicNotGS video test sequence following Type A analysis.

Video Shots Total TP TN FP FN TPR TNR Prec F1 Acc Detection Frames

Ground Truth 5597 1097 130 0 100,0% 89,4% 97,7% 98,9% 98,1% 6824 SCD 5598 1072 154 0 100,0% 87,4% 97,3% 98,6% 97,7%

The following conclusions are now in order:

• The difference between using the Ground Truth and the SCD algorithm is marginal. In fact, the values obtained for the assessment metrics show that, despite the number of video shots identified when using the SCD being different from those defined in the Ground Truth, there is a negligible effect due to that difference. • The greatest amount of FP occurrences are related to the incorrect association between static areas in a commercial and the DoGs in DDB, showing that the DoGs Matching mechanism is not perfect and should be improved. • All frames identified as being Regular Program were correctly identified as such, leading to TPR = 100%. • The TNR is the only metric below the mark of 90%, due to the false positive events referred above. • Prec, F1 and Acc contain values above 95%, allowing to conclude the potential of the proposed solution.

Table 5.15 presents the results obtained for the same video test sequence, considering Type B Analysis. Table 5.15 - GS module: performance results for sicNotGS video test sequence following Type B analysis.

Video Shots Total TP TN FP FN TPR TNR Prec F1 Acc Detection Frames

Ground Truth 1097 5597 0 130 89,4% 100,0% 100,0% 94,4% 98,1% 6824 SCD 1072 5598 0 154 87,4% 100,0% 100,0% 93,3% 97,7%

About the table above:

• As stated for Type A analysis, the difference between using Ground Truth and SCD is negligible (in the worst case, TPR, the difference is 2%). • The results shown above are representative of the potential of the solution to detect commercial blocks – TNR, Prec and Acc present values equal to, or near, 100%. • F1 metric is prejudiced by the effect of TPR, which is the only metric below 90%. This is due to misclassification of some commercials as such, reflecting the weakness already referred of DoGs Matching mechanism. 2. Sequence “tviGS” – Table 5.16 - shows the results obtained for this video test sequence, considering Type A analysis. Table 5.16 - GS module: performance results for tviGS video test sequence following Type A analysis.

Video Total Segments TP TN FP FN TPR TNR Prec F1 Acc Frames Detection

Ground Truth 2410 1597 117 91 96,4% 93,2% 95,4% 95,9% 95,1% 4215 SCD 2794 1164 192 65 97,7% 85,8% 93,6% 95,6% 93,9%

The following conclusions can be drawn:

69

• There is a difference of 7,4% between using the Ground Truth and the SCD algorithm in the specific case of the TNR metric. This fact is mainly due to the gap in the amount of events classified as TN in each run. • The majority of FP occurrences are related to the incorrect association between static areas in a commercial and the DoGs in DDB, increasing the belief that there is room to improve the DoGs Matching operation. The same mechanism is most likely to be responsible for the high number of FN cases. • Prec, F1 and Acc values are prejudiced by the number of FPs and FNs occurred and explained above.

Table 5.17 present the performance results for “tviGS” video test sequence, considering Type B Analysis. Table 5.17 - GS module: performance results for tviGS video test sequence following Type B analysis.

Video Total Segments TP TN FP FN TPR TNR Prec F1 Acc Frames Detection

Ground Truth 1597 2410 91 117 93,2% 96,4% 94,6% 93,9% 95,1% 4215 SCD 1164 2794 65 192 85,8% 97,7% 94,7% 90,1% 93,9%

About the table above:

• The same difference of 7,4% between using the Ground Truth and SCD, identified in Table 5.16, is now reflected in the TPR metric. • Prec, F1 and Acc present values above 93,9% in almost all cases; the only exception is F1 when using SCD as video shot detector, and that is the consequence of lower TPR metric result.

3. Sequence “rtpGS” – Table 5.18 shows the results obtained for this video test sequence, regarding Type A analysis. Table 5.18 - GS module: performance results for rtpGS video test sequence following Type A analysis.

Video Total Segments TP TN FP FN TPR TNR Prec F1 Acc Frames Detection

Ground Truth 11040 6281 387 0 100,0% 94,2% 96,6% 98,3% 97,8% 17708 SCD 11040 6137 531 0 100,0% 92,0% 95,4% 97,7% 97,0%

The following conclusions can be made:

• As in “sicNotGS” video sequence, the difference between using the Ground Truth and the SCD algorithm is negligible. • Also as before, the majority of FP occurrences are related to the incorrect association between static areas in a commercial and the DoGs in DDB, a problem already reported in the previous sequences. • The number of FN events is zero, meaning that the TV channel logo was correctly identified every time it occurred, contributing to the final correct classification. Table 5.19 presents the performance results obtained for the same video test sequence, taking into account Type B analysis.

70

Table 5.19 - GS module: performance results for rtpGS video test sequence following Type B analysis.

Video Total Segments TP TN FP FN TPR TNR Prec F1 Acc Frames Detection

Ground Truth 6281 11040 0 387 94,2% 100,0% 100,0% 97,0% 97,8% 17708 SCD 6137 11040 0 531 92,0% 100,0% 100,0% 95,9% 97,0%

The following comments about the table above are worth to be done:

• The difference between using Ground Truth and SCD is not significant. • TNR, Prec and Acc metrics, with their results being above 95, 9%, are promising outcomes. • The amount of FN is likely to be a consequence of DoGs Matching mechanism, as pointed out in the previous sequences.

4. Run time ratio – Table 5.20 shows the run time ratio, i.e., the ratio between run time and the actual duration of each sequence, and the average value obtained taking into account the three sequences tested.

Table 5.20 - Ratio between GS run time and duration of each video sequence. Shots Sequence Duration (s) Run Time (s) Run Time Ratio Definition Ground Truth 168 61,5% sicNotGS 273 SCD 141 51,6% Ground Truth 99 58,9% tviGS 168 SCD 112 66,7% Ground Truth 498 70,3% rtp1GS 708 SCD 516 72,3% Ground Truth 255 66,6% AVERAGE 383 SCD 256,3 66,9%

There are two main conclusions that can be drawn about the run time analysis. In first place, one concludes that, in average, each sequence demands two thirds of its duration to be analyzed. Secondly, it is interesting to observe that there is no difference, concerning run time, between the tests using the ground truth and the test using SCD. The main point that could make a difference in this analysis is related to the difference between the amount of shots to analyze on each case, because that number is always larger when using the ground truth (as it includes all types of shot change detection, whereas SCD only detects hard cuts); results show that there is no effect due to this issue.

71

Chapter 6 Summary and Future Work

This final chapter presents a summary of the work developed and also some suggestions for the future work. 6.1 Summary The main objective of this Thesis was to design a tool capable of detecting commercial blocks in Broadcast TV Content. Chapter 1 serves as the introduction to this Thesis, notably by providing important contextual information about the importance of TV advertising commercials as a tool to efficiently achieve the most relevant companies’ marketing goals. The amount of money involved in advertising actions is really huge, notably when they are associated to extremely popular events, such as the Olympic Games or the Super Bowl in USA. Advertising has a major impact, in first place, in the broadcasters’ budget as this revenue from the advertiser companies is crucial for their finances, especially for private channels; also, the investing companies themselves expect high selling revenues as a result of their marketing actions. On the other hand, there is the desire of a significant number of viewers who would like to eliminate the transmitted commercials from their TV programs. Another stakeholder are the regulators, i.e. those entities responsible for controlling the duration of commercial blocks, whose work is terribly simplified with the usage of automatic tools targeting their control variables. Chapter 2 starts by presenting the current legal framework, not only in Portugal but also in the EU, as it implies a set of advertising conditions that must be respected by the broadcasters. The typical structure of a commercial block is then introduced and, finally, as TV advertising content contains several specific characteristics, the relevant features are identified and described. There are here two main types of characteristics: i) intrinsic characteristics related to the video content itself; some examples are high scene cut rates, text presence, audio jingles and audio level; and ii) extrinsic characteristics associated to external factors; those may include the presence, or not, of the TV channel logo on the screen during the commercial, the potential existence of black frames separating different commercials, and commercial block separators included by the broadcaster, among others. Chapter 3 reviews several TV commercial detection schemes, classified according to the approach they follow: i) knowledge-based detection methods which are based on their intrinsic and/or extrinsic characteristics; and ii) repetition-based detection methods which rely on the fact that each TV commercial may be repeated several times along the time. The main conclusion regarding the differences between these two approaches is that knowledge-based detection methods, may be out of date more easily than repetition-based detection methods; in fact, they very much rely on the specific characteristics of advertising content and they have some dependence on the way and strategies that both, broadcasters and advertisers, decide to adopt. On the other hand, repetition-based methods are much more computationally expensive, which implies that sometimes the performance boosting may not pay the complexity gap between the two types of approaches.

72

Chapter 4 presents the proposed solution, which takes into account the observations made in real broadcast TV content and the constraints that were assumed to be important, namely the current legal framework and the conditions related to the intrinsic and extrinsic characteristics that are no longer reliable (even if being used in the past). Analyzing the real content, it was concluded that TV logos detection methods has not been fully explored, as the solutions related to this extrinsic characteristic use to be too simplistic. In fact, they do not consider that current TV content is full of Digital on-Screen graphics (DoGs) events, making it impossible to guarantee that, even for the same broadcaster, (i) the TV channel logo is always in the same corner and (ii) what does each DoG really represent (e.g. TV channel logo, commercial logo or other). In this context, a solution based on DoGs acquisition and analysis is proposed; this solution adopts a DoGs classification mechanism and mainly targets to detect and distinguish TV channel logos from other types of DoGs. Chapter 5 presents the performance assessment of the proposed solution, divided in three main areas: Shot Change Detection performance assessment, DoGs Acquisition assessment and, finally, a complete Global Solution assessment. After describing the test material, the performance assessment and the relevant parameters are shown, the actual performance is presented and analyzed, notably by underlining the special cases for each sequence and for each assessment experiment. Both Shot Change Detection and DoGs Detection present good results, thus showing the effectiveness of the developed solutions to overcome the problems they intend to address. The Global Solution results are also satisfying, although more tests should be performed to have a more solid and exhaustive assessment, allowing conclusions with higher confidence. It is worthwhile to remind that it is very difficult to have good and complete test material to evaluate all possible and relevant situations, as TV content is highly dynamic and there is always something new every day. Three main strengths can be identified in the solution proposed in this Thesis: in first place, the approach followed to detect DoGs (even those with poorly defined boundaries and likely to be confused with similar background) is not computationally expensive and the results show the validity of the process. In second place, and to the knowledge of this Thesis author, this is the first work that deals with the TV logo detection problem as a particular case of the more generic DoGs detection case. In third place, the fact that both Basic and Advanced Solutions for the DoGs Database Management are based on real and recent observations about the way broadcasters and advertisers are using DoGs. It is also important to note that the proposed solution may be implemented in real-time (as the run time of the global solution is about one third lower than the actual media time), which is mostly due to the fact that the DoGs acquisition process only depends, at most, on a tenth of the total number of frames analyzed. Also, the proposed system does not need any previously built DoGs database, which is an important advantage to take into account. On the other hand, there are some processes that have been identified as containing potential to be improved; e.g., the fact that the DoGs acquisition algorithm depends on static pixels makes it hard to properly detect very dynamic logos, despite this type of TV channel logos is not common at all. Another negative point in terms of DoGs acquisition is its sensibility to variations, on a pixel level, in the position of the logo on the screen, despite the dilation operation that is made in Color Analysis of Edge Pixels step being able to prevent some of those cases. It is very difficult to compare the results obtained with the state-of-the-art work, as there are not standardized and well established datasets and procedures, for the authors to assess their works; in fact, there is a diversity of datasets, test conditions and metrics used to assess the quality and performance of each solution. For instance, in Albiol et al. [28] the performance is measured at the second level, while some works (e.g. Glasberg et al. [25]) make an assessment at the commercial level, and some others, this Thesis included, are assessed at the frame level,

73 which is much more rigid. On another level, a comparison with works like the one from Sadlier et al.’s [24], reporting a precision of 100% and an accuracy of about 92%, would not be fair as it assumes the existence of black frames between commercials, which is an assumption no longer valid. Nevertheless, Albiol et al. [28] and Glasberg et al. [25] report, as source of error in their experiments, the existence of logos in some of the commercials they used to test their solutions; as it has been widely explained, the solution presented in this Thesis seeks to overcome this problem. The comparison with repetition-based methods presented in Chapter 3 is not fair, as they rely on different assumptions, but the true id the results obtained in this Thesis are slightly better that Lienhart et al.’s work [13] (despite the proposed solution is significantly faster), similar to those obtained by Gauch and Shivadas in [48] but, in average, below the values achieved by Li et al. in [34], which are near a perfect score. 6.2 Future Work Despite the good results obtained, showing that the proposed solution could be used as the algorithmic core of a commercial block detection application, there are some issues that deserve more attention to further improve the performance, namely: • improve the hard cut detection algorithm to reduce the amount of FNs; • extend the cut detection algorithm to also consider soft transitions such as fade-ins, fade-outs, dissolves and other soft video transitions; • improve the key frames selection criterion used in the DoGs Acquisition process; • improve the DAA for the cases of purely dynamic TV channel logos, e.g. by considering tracking techniques; • improve the comparison mechanism between the DoGs acquired and the DoGs in DDB, as the current proposal is too rigid and may fail in matching DoGs that are similar but whose intersection is not considered good enough; • record more video data, and assess the Advanced Solution proposed for the DB management, as it is more powerful than the Basic Solution.

All these matters are relevant and interesting points to think about, because their improvement and implementation could bring better results in future applications.

74

Appendix A DoGs Detection – Example of the complete process

This annex includes a demonstration example of a video test sequence going through the proposed DoGs detection algorithm, in order to provide a better insight about this process.

1. Video test sequence characterization The video sequence is called “rtp1_example” and may be accessed in [62]; this sequence was extracted from the Portuguese public TV channel “RTP1”, whose TV channel logo is referred in Chapters 4 and 5. Table A.1 characterizes this video sequence in terms of shots number and shots length, the latter measured in number of frames and seconds.

Table A.1 - Characterization of "rtp1_example" video test sequence.

Sequence Duration Segment # Frames Time (s) Shot 1 46 1,8 Shot 2 159 6,4 Shot 3 140 5,6 Shot 4 125 5,0 Shot 5 27 1,1 TOTAL 497 19,9

2. Screenshots extracted from each shot Figure A.1 to Figure A.5 are screenshots extracted from each video shot (each shot corresponding to one video segment) of the video test sequence under analysis; this is shown in order to provide a context for the results that are presented in the next sections.

Figure A.1 - Screenshot extracted from the first shot of “rtp1_example” video test sequence.

75

Figure A.2 - Screenshot extracted from the second shot of “rtp1_example” video test sequence.

Figure A.3 - Screenshot extracted from the third shot of “rtp1_example” video test sequence.

Figure A.4 - Screenshot extracted from the fourth shot of “rtp1_example” video test sequence.

Figure A.5 - Screenshot extracted from the fifth shot of “rtp1_example” video test sequence.

It is important to notice that, in Figure A.5, the dynamic part of the TV channel logo is changing its form, through a process that started at the end of the previous shot (represented in Figure A.4).

76

3. Key Frames Edge Fusion step Key Frames (KF) Edge Fusion is the step responsible for performing the ‘OR’ operation between all KFs edges maps inside each video segment (in this case, each video segment corresponds to a video shot); Figure A.6 shows the output obtained when applying this process to the considered video test sequence.

(a) Shot 1 (b) Shot 2 (c) Shot 3

(d) Shot 4 (e) Shot 5 Figure A.6 - Key Frames Edges Fusion step output obtained for each shot of the video sequence “rtp1_example”.

It is possible to see that the TV channel logo is better defined in shots 1, 2 and 3 than it is in the cases of shots 4 and 5. This is due to the fact that, as referred before, the shape dynamic part of the logo occurs in the transition between those shots; in the specific case of shot 5, there is also to consider the movement of the camera, making the edges detection output of this shot the most confusing among the five shots. Figure A.7 presents the KFs edges maps of the shot 5, which allows to better understand Figure A.6 (e).

(a) KF 1 (b) KF 2 (c) KF 3 Figure A.7 - Key Frames edges maps that represent the fifth shot of the sequence.

77

4. SPMs Intersection step Figure A.8 presents the resulting intersection of the SPMs obtained for the “RTP1” TV channel logo, along the video test sequence analyzed; this is the final edges map obtained before classifying it as a “New DoG Detected”.

Figure A.8 - SPMs Intersection output for the sequence "rtp1_example".

5. Color Map of the detected DoG Figure A.9 presents the color map, in YCrCb, that is saved as a reference for the pixels included in SPMs intersection output shown in Figure A.8. As only chrominance components are kept in DDB, a value of for luminance, Y=134, was defined for those pixels, in order to get this image.

Figure A.9 - Color map of the detected DoG.

6. Heat Map Figure A.10 shows a heat map representing the chrominance variance of the pixels belonging to the Key Frames Edge Fusion map of shot 5.

Figure A.10 – Heat map corresponding to the Key Frames Edge Fusion map of shot 5.

78

Appendix B SPMs Intersection step results

Figure B.1 presents the "best cases" binary maps resulting from the SPMs Intersection operation for the sequences used in the DoG Acquisition Algorithm Assessement.

(a) Biggs (b) Hollywood

(c) SIC Notícias (d) A Bola

(e) Canal Q (f) SIC

(g) SIC Radical (h) RTP1 Figure B.1 – Some results from the SPMs Intersection step for each TV channel tested in DoG Acquisition Algorithm Assessement.

79

Bibliography

[1] “Super Bowl advertising.” [Online]. Available: https://en.wikipedia.org/wiki/Super_Bowl_advertising. [Accessed: 18-Sep-2015].

[2] “Wikipedia - Commercial skipping.” [Online]. Available: https://en.wikipedia.org/wiki/Commercial_skipping. [Accessed: 18-Sep-2015].

[3] “Windows Media Center.” [Online]. Available: http://windows.microsoft.com/pt- pt/windows/products/windows-media-center.

[4] “SageTV.” [Online]. Available: http://sagetv.com/. [Accessed: 13-Sep-2015].

[5] “MythTV.” [Online]. Available: https://www.mythtv.org/. [Accessed: 13-Sep-2015].

[6] “Directive 2007/65/EC of the European Parliament and of the Council of 11 December 2007 amending Council Directive 89/552/EC (...) concerning the pursuit of television broadcasting activities.” [Online]. Available: http://eur-lex.europa.eu/legal- content/EN/ALL/?uri=CELEX%3A32007L0065. [Accessed: 14-Oct-2016].

[7] “Directive 2010/13/EU of the European Parliament and of the Council of 10 March 2010 on the coordination of certain provisions laid down by law, regulation or administrative action in Member States concerning the provision of audiovisual media services.” [Online]. Available: http://eur- lex.europa.eu/legal-content/EN/ALL/?uri=CELEX%3A32010L0013. [Accessed: 14-Oct-2016].

[8] Decreto do Presidente da República no 45/2011 de 11 de Abril, Pub. L. No. Diário da República, 1a Série – No 71 (2011), 2011.

[9] C. Colombo, A. Del Bimbo, and P. Pala, “Retrieval of commercials by semantic content: The semiotic perspective,” Multimed. Tools Appl., vol. 13, no. 1, pp. 93–118, 2001.

[10] X. S. Hua, L. Lu, and H. J. Zhang, “Robust learning-based TV commercial detection,” IEEE Int. Conf. Multimed. Expo, pp. 149–152, 2005.

[11] J. Chen, J. Yeh, W. Chu, J. Kuo, and J. Wu, “Improvement of commercial boundary detection using audiovisual features,” Adv. Multimed. Inf. Process., vol. 3767, pp. 776–786, 2005.

[12] L. Duan, J. Wang, Y. Zheng, J. S. Jin, H. Lu, and C. Xu, “Segmentation, categorization, and identification of commercial clips from TV streams using multimodal analysis,” in Proceedings of the 14th annual ACM international conference on Multimedia, pp. 201–210. 2006.

[13] R. Lienhart, C. Kuhmunch, and W. Effelsberg, “On the detection and recognition of television commercials,” in Proceedings of IEEE International Conference on Multimedia Computing and Systems, pp. 509–516, 1997.

[14] A. G. Hauptmann and M. J. Witbrock, “Story segmentation and detection of commercials in broadcast news video,” in Proceedings IEEE International Forum on Research and Technology Advances in Digital Libraries, 1998.

[15] “‘The Scientist’ videoclip, by Coldplay.” [Online]. Available: https://abigailelsdena2.wordpress.com/2012/09/02/coldplay-the-scientist/. [Accessed: 26-Aug- 2015].

80

[16] “Introduction to Multi-track editing.” [Online]. Available: http://en.flossmanuals.net/how-to-use- video-editing-software/multitrack-editing/. [Accessed: 26-Aug-2015].

[17] N. Dimitrova, L. Agnihotri, and G. Wei, “Video classification based on HMM using text and faces,” in EUSIPCO, 2000.

[18] D. Chen, J. M. Odobez, and H. Bourlard, “Text detection and recognition in images and video frames,” Pattern Recognit., vol. 37, no. 3, pp. 595–608, 2004.

[19] U. Gargi, D. Crandall, S. Antani, T. Gandhi, R. Keener, and R. Kasturi, “A system for automatic text detection in video,” Proc. Fifth Int. Conf. Doc. Anal. Recognition. ICDAR ’99 (Cat. No.PR00318), 1999.

[20] C. Wolf, J. Jolion, and L. De Lyon, “Model based text detection in images and videos: a learning approach,” … -Special Issue Camera-based Text …, pp. 1–24, 2004.

[21] N. Venkatesh and M. Girish Chandra, “Commercial break detection and content based video retrieval,” Lect. Notes Electr. Eng., vol. 68, no. 96, pp. 481–493, 2010.

[22] L. Lu, H. J. Zhang, and S. Z. Li, “Content-based audio classification and segmentation by using Support Vector Machines,” Multimed. Syst., vol. 8, no. 6, pp. 482–492, 2003.

[23] B. Satterwhite and O. Marques, “Automatic detection of TV commercials,” IEEE Potentials, vol. 23, no. 2, pp. 9–12, 2004.

[24] D. a. Sadlier, S. Marlow, N. O’Connor, and N. Murphy, “Automatic TV advertisement detection from MPEG bitstream,” in Internationa Conference on Enterprise Information Systems, vol. 35, no. 12, pp. 2719–2726. 2002.

[25] R. Glasberg, C. Tas, and T. Sikora, “Recognizing commercials in real-time using three visual descriptors and a decision-tree,” IEEE Int. Conf. Multimed. Expo, pp. 1481–1484, 2006.

[26] R. Glasberg, S. Schmiedeke, M. Mocigemba, and T. Sikora, “New real-time approaches for video- genre-classification using high-level descriptors and a set of classifiers,” Proc. - IEEE Int. Conf. Semant. Comput. 2008, ICSC 2008, pp. 120–127, 2008.

[27] K. Schoffmann, M. Lux, and L. Boeszoermenyi, “A novel approach for fast and accurate commercial detection in H. 264/AVC bit streams based on logo identification,” Proc. 15th Int. Multimed. Model. Conf., pp. 119–127, 2009.

[28] L. T. A. Albiol, M. J.Fullà, F. A. Albiol, “Detection of TV commercials,” in IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 3, pp. 541–544, 2004.

[29] M. Covell, S. Baluja, and M. Fink, “Advertisement detection and replacement using acoustic and visual repetition,” 2006 IEEE 8th Work. Multimed. Signal Process. MMSP 2006, pp. 461–466, 2007.

[30] S. A. Berrani, G. Manson, and P. Lechat, “A non-supervised approach for repeated sequence detection in TV broadcast streams,” Signal Process. Image Commun., vol. 23, no. 7, pp. 525–537, 2008.

[31] I. Döhring and R. Lienhart, “Mining TV Broadcasts 24/7 for recurring video sequences,” Stud. Comput. Intell., vol. 287, pp. 327–356, 2010.

[32] C. Herley, “ARGOS: Automatically extracting repeating objects from multimedia streams,” IEEE Trans. Multimed., vol. 8, no. 1, pp. 115–129, 2006.

[33] X. Wu and S. Satoh, “Ultrahigh-speed TV commercial detection, extraction, and matching,” IEEE Trans. Circuits Syst. Video Technol., vol. 23, no. 6, pp. 1054–1069, 2013.

81

[34] Y. Li, D. Zhang, X. Zhou, and J. S. Jin, “A confidence based recognition system for TV commercial extraction,” Conf. Res. Pract. Inf. Technol. Ser., vol. 75, pp. 57–64, 2008.

[35] S. Marlow, D. a Sadlier, K. McGeough, N. O’Connor, and noel Murphy, “Audio and video processing for automatic TV advertisement detection,” in Irish Signals and Systems Conference, pp. 25–27, 2005.

[36] J. S. Boreczky and L. A. Rowe, “Comparison of video shot boundary detection techniques,” J. Electron. Imaging, vol. 5, no. 2, p. 122, 1996.

[37] Z. Feng and J. Neumann, “Real time commercial detection in videos", Comcast Lab, 2013.

[38] R. Zabih, J. Miller, and K. Mai, “A feature-based algorithm for detecting and classifying scene breaks,” in Proceedings of the third ACM international conference on Multimedia, vol. 95, pp. 189– 200, 1995.

[39] L. . Rabiner and B. H. Juang, “An introduction to hidden Markov models.,” IEEE ASSP Mag., vol. 3, no. January, pp. 4–16, 1986.

[40] E. Esen, M. Soysal, T. K. Ateş, A. Saracoǧlu, and a. A. Alatan, “A fast method for animated TV logo detection,” 2008 Int. Work. Content-Based Multimed. Indexing, CBMI 2008, Conf. Proc., pp. 236–241, 2008.

[41] Mikhail, E. and Vatolin, D., "Automatic logo removal for semitransparent and animated logos", Proceedings of GraphiCon, GraphicCon, 2011.

[42] N. Özay and B. Sankur, “Automatic TV logo detection and classification in broadcast videos,” Eur. Signal Process. Conf., no. Eusipco, pp. 839–843, 2009.

[43] C. Panagiotakis and G. Tziritas, “A speech/music discriminator based on RMS and zero- crossings,” IEEE Trans. Multimed., vol. 7, no. 1, pp. 155–166, 2005.

[44] C. Cortes and V. Vapnik, “Support vector networks", in Journal Machine Learning, vol.20, issue 3, pp. 273–297, 1995.

[45] V. N. Vapnik, “An overview of statistical learning theory.,” IEEE Trans. Neural Netw., vol. 10, no. 5, pp. 988–999, 1999.

[46] Hao Jiang, HongJiang Zhang, and Tony Lin, “Video segmentation with the support of audio segmentation and classification,” IEEE Int. Conf. Multimed. Expo, no. 5, 2000.

[47] M. Li, C. Yong, W. Min, and L. Yuanxing, “TV commercial detection based on shot change and text extraction”, in Proceedings of the 2009 2nd International Congress on Image and Signal Processing, CISP’09, 2009.

[48] J. M. Gauch and A. Shivadas, “Identification of new commercials using repeated video sequence detection,” in IEEE International Conference on Image Processing 2005, vol. 3, pp. II–1252, 2005.

[49] “Jornal ‘Público’ Online.” [Online]. Available: https://www.publico.pt/tecnologia/noticia/a-partir-de- junho-a-publicidade-na-tv-deixa-de-subir-o-volume-1725736. [Accessed: 08-Oct-2016].

[50] “Branding with Bugs.” [Online]. Available: http://www.videomaker.com/article/c3/14602-branding- with-bugs. [Accessed: 08-Oct-2016].

[51] “Digital on-Screen Graphics.” [Online]. Available: https://en.wikipedia.org/wiki/Digital_on- screen_graphic.

[52] “Improving Brand Recognition in TV Ads,” 2010. [Online]. Available: http://hbswk.hbs.edu/item/improving-brand-recognition-in-tv-ads. [Accessed: 08-Oct-2016].

82

[53] “Lamborghini Commercial.” [Online]. Available: https://www.youtube.com/watch?v=Xd0Ok-MkqoE. [Accessed: 08-Oct-2016].

[54] R. Lienhart, “Comparison of automatic shot boundary detection algorithms,” in Proc. of Storage and Retrieval for Image and Video Databases, 1998.

[55] R. B. Allen., P. England, and A. Dailianas, “Comparison of automatic video segmentation algorithms,” in SPIE Photonics West, 1996.

[56] Q. Zhang and R. L. Canosa, “A Comparison of Histogram Distance Metrics for Content-Based Image Retrieval,” in Proc. of Imaging and Multimedia Analytics in a Web and Mobile World, 2014.

[57] J. Canny, “A Computational Approach to Edge Detection,” IEEE Trans. Pattern Anal. Mach. Intell., vol. PAMI-8, no. 6, pp. 679–698, 1986.

[58] “‘rtp1_demo1’ video sequence.” [Online]. Available: https://www.dropbox.com/s/9rlzlcsajgad9dz/rtp1_demo1.mp4?dl=0. [Accessed: 08-Oct-2016].

[59] “Shot Change Detection Assessment Dataset.” [Online]. Available: https://www.dropbox.com/sh/64mbtob2zgby3j6/AACAD-wHwCffLoFyw3zoEKoka?dl=0. [Accessed: 08-Oct-2016].

[60] “DoG Acquisition Assessment Dataset.” [Online]. Available: https://www.dropbox.com/sh/2hnhjyb9ld55mk4/AAAZuucfALVyyZYSV-sj8ZE8a?dl=0. [Accessed: 08-Oct-2016].

[61] “Global Solution for Detecting Commercials Assessment Dataset.” [Online]. Available: https://www.dropbox.com/sh/urmly4b8yjm1tgk/AACk0CI6NopxelKglA-momw7a?dl=0. [Accessed: 08-Oct-2016].

[62] “‘rtp1_example’ video test sequence.” [Online]. Available: https://www.dropbox.com/s/mvtco2rd63v9chp/rtp1_example.mp4?dl=0. [Accessed: 08-Oct-2016].

83