3. Visual Attention

Brunel UNIVERSITY WEST LONDON INTELLIGENT IMAGE CROPPING AND SCALING A thesis submitted for the degree of Doctor of Philosophy by Joerg Deigmoeller School of Engineering and Design th 24 January 2011 1 Abstract Nowadays, there exist a huge number of end devices with different screen properties for watching television content, which is either broadcasted or transmitted over the internet. To allow best viewing conditions on each of these devices, different image formats have to be provided by the broadcaster. Producing content for every single format is, however, not applicable by the broadcaster as it is much too laborious and costly. The most obvious solution for providing multiple image formats is to produce one high- resolution format and prepare formats of lower resolution from this. One possibility to do this is to simply scale video images to the resolution of the target image format. Two significant drawbacks are the loss of image details through downscaling and possibly unused image areas due to letter- or pillarboxes. A preferable solution is to find the contextual most important region in the high-resolution format at first and crop this area with an aspect ratio of the target image format afterwards. On the other hand, defining the contextual most important region manually is very time consuming. Trying to apply that to live productions would be nearly impossible. Therefore, some approaches exist that automatically define cropping areas. To do so, they extract visual features, like moving areas in a video, and define regions of interest (ROIs) based on those. ROIs are finally used to define an enclosing cropping area. The extraction of features is done without any knowledge about the type of content. Hence, these approaches are not able to distinguish between features that might be important in a given context and those that are not. The work presented within this thesis tackles the problem of extracting visual features based on prior knowledge about the content. Such knowledge is fed into the system in form of metadata that is available from TV production environments. Based on the extracted features, ROIs are then defined and filtered dependent on the analysed content. As proof-of-concept, this application finally adapts SDTV (Standard Definition Television) sports productions automatically to image formats with lower resolution through intelligent cropping and scaling. If no content information is available, the system can still be applied on any type of content through a default mode. The presented approach is based on the principle of a plug-in system. Each plug-in represents a method for analysing video content information, either on a low level by extracting image features or on a higher level by processing extracted ROIs. The 2 combination of plug-ins is determined by the incoming descriptive production metadata and hence can be adapted to each type of sport individually. The application has been comprehensively evaluated by comparing the results of the system against alternative cropping methods. This evaluation utilised videos which were manually cropped by a professional video editor, statically cropped videos and simply scaled, non-cropped videos. In addition to and apart from purely subjective evaluations, the gaze positions of subjects watching sports videos have been measured and compared to the regions of interest positions extracted by the system. 3 Acknowledgment First and foremost, in memory of a great person and colleague, I would like to thank my advisor at IRT, Dipl.-Ing. Gerhard Stoll. He always believed in me and gave me great support, even in busy times. Special thanks to my supervisor at Brunel University, Dr Takebumi Itagaki, for his valuable guidance and advice. I wish additionally to thank Dipl.-Inf. Norbert Just, Dipl.-Ing. Oliver Bartholmes, Dipl.- Ing. Tim Weiß and all those people who helped along the way by contributing to this work. Special thanks go to Dipl.-Ing. Ralf Neudel and Dipl.-Inf. Matthias Laabs for the great time in and outside of IRT, as well as for their support in various ways. Last but not least, thanks to my friends and family for always encouraging me. 4 Table of Contents 1. Introduction .............................................................................................................. 16 1.1 Motivation ........................................................................................................ 16 1.2 Outline .............................................................................................................. 17 2. Background on Computer Vision ............................................................................ 19 2.1 Image Formation in Digital Television ............................................................ 19 2.1.1 Studio colour encoding ............................................................................. 20 2.1.2 Studio image resolutions ........................................................................... 21 2.1.3 Pixel aspect ratios ...................................................................................... 22 2.2 Image Enhancement ......................................................................................... 22 2.2.1 Point processing ........................................................................................ 23 2.2.2 Linear filters .............................................................................................. 24 2.2.3 Non-linear filters ....................................................................................... 25 2.3 Geometric Operations and Interpolation .......................................................... 28 2.3.1 Affine transformations .............................................................................. 28 2.3.2 Interpolation operations ............................................................................ 30 2.4 2D Fourier Transform....................................................................................... 33 2.5 Motion Estimation ............................................................................................ 35 2.5.1 Local motion estimation ............................................................................ 35 2.5.2 Global motion estimation .......................................................................... 40 2.6 Segmentation .................................................................................................... 47 2.6.1 Thresholding ............................................................................................. 47 2.6.2 Region labelling with flood fill ................................................................. 48 2.6.3 Edge linking .............................................................................................. 48 2.6.4 Clustering .................................................................................................. 50 2.7 Classification .................................................................................................... 52 2.7.1 Bayes classifier ......................................................................................... 52 5 2.7.2 Nearest neighbour classifier and linear classifiers .................................... 55 2.8 Summary .......................................................................................................... 56 3. Visual Attention ....................................................................................................... 57 3.1 Bottom-Up and Top-Down Attention .............................................................. 57 3.2 Visual Search .................................................................................................... 58 3.3 Feature-Integration Theory of Attention (FIT) ................................................. 59 3.3.1 Feature maps ............................................................................................. 60 3.3.2 Object detection ........................................................................................ 61 3.4 Guided Search .................................................................................................. 61 3.5 Visual Attention Systems ................................................................................. 61 3.5.1 Visual-attention system by Itti, Koch and Niebur ..................................... 61 3.5.2 Spectral Residual ....................................................................................... 65 3.6 Summary .......................................................................................................... 67 4. Image Composition .................................................................................................. 68 4.1 Composition Makes Order Out of Confusion .................................................. 68 4.2 Positioning Objects in an Image ....................................................................... 69 4.3 Depth of Field and Motion Blur ....................................................................... 70 4.4 Composition Guidelines for European Public Broadcasters ............................ 70 4.4.1 Safe areas .................................................................................................. 70 4.4.2 Scanned image areas ................................................................................. 71 4.5 Summary .......................................................................................................... 73 5. State of the Art ROI Extraction

3. Visual Attention

241E1SB/00 Philips LCD Monitor with Smarttouch Controls

190V1SB/00 Philips LCD Widescreen Monitor

WFM7200 Multiformat Multistandard Waveform Monitor

Panasonic BT-LH1710 Operating Instructions

Camera Basics, Principles and Techniques –MCD 401 VU Topic No

Computational Complexity Optimization on H.264 Scalable/Multiview Video Coding

Video Data Processing Method and Device for Display Adaptive Video Playback

Digital Image Prep for Displays

(12) United States Patent (10) Patent No.: US 7,257,776 B2 Bailey Et Al

The Art of Sound Reproduction the Art of Sound Reproduction

Property Name Definition / Semantics Ebucore Ebucore Implementation

(Tech 3264) to Ebu-Tt Subtitle Files