Providing Video Annotations in Multimedia Containers for Visualization and Research

Providing Video Annotations in Multimedia Containers for Visualization and Research Julius Schoning,¨ Patrick Faion, Gunther Heidemann, Ulf Krumnack Institute of Cognitive Science Osnabruck¨ University, Germany fjuschoening, pfaion, gheidema, [email protected] Abstract So, why not encapsulate these additional data in a standard container format? This has become common practice There is an ever increasing amount of video data sets for storing text plus metadata, e.g., in the PDF container. which comprise additional metadata, such as object la- State of the art video containers like the open container for- bels, tagged events, or gaze data. Unfortunately, metadata mats [38](OGG), MPEG-4 [12], or the Matroskaˇ container are usually stored in separate files in custom-made data format [16](MKV) can encapsulate video and metadata, formats, which reduces accessibility even for experts and such that it can be interpreted as a single file by standard makes the data inaccessible for non-experts. Consequently, multimedia players and can be streamed via the Internet. we still lack interfaces for many common use cases, such By this means, the accessibility of video metadata will be as visualization, streaming, data analysis, machine learn- increased substantially1. ing, high-level understanding and semantic web integra- We argue that video data sets should be stored in a mul- tion. To bridge this gap, we want to promote the use of ex- timedia container, carrying all metadata like tags, subtitles, isting multimedia container formats to establish a standard- labels, object descriptions, online links, and even complex ized method of incorporating content and metadata. This data such as gaze trajectories of multiple subjects and other will facilitate visualization in standard multimedia players, video related data. Within prototype implementations using streaming via the Internet, and easy use without conver- multimedia containers, we present solutions that contain in sion, as shown in the attached demonstration video and a single file all metadata, e.g., for research [14, 30, 20, 24], files. In two prototype implementations, we embed object entertainment [9] and education supplements [15, 23], or labels, gaze data from eye-tracking and the corresponding communication aids for disabled people [10]. The mul- video into a single multimedia container and visualize this timedia container already supports a variety of data for- data using a media player. Based on this prototype, we dis- mats, enabling visualization of annotations, gaze points, etc. cuss the benefit of our approach as a possible standard. Fi- in standard media players or slightly modified versions of nally, we argue for the inclusion of MPEG-7 in multimedia them. We want to establish a standard format that facilitates containers as a further improvement. application in various fields, ranging from annotation for deep learning in computer vision over highlighting objects in movies for visually impaired people to creating auditory 1. Introduction displays for blind people. A standard format will boost accessibility and shareability of metadata. Metadata of video files is still commonly stored in sepa- Our contribution focuses on object annotation metadata rate files using custom file formats. Metadata such as object but is trying not to neglect the variety of possible other annotations and segmentation information, as well as high- metadata. The paper starts with a general section about an- level information like gaze trajectories or behavior data of notation followed by an extensive review of available data subjects, are stored next to the video in a single or multi- formats for video annotation, metadata description, timed ple files. The data structures of these files are mostly cus- data formats, and multimedia containers. Based on a dis- tomized, sometimes unique, and they are stored in a diver- cussion on suitable data representations for scientific meta- sity of formats, e.g. plain text, XML, MATLAB format, or data, we present two prototype implementations for provid- binary. Accessing and visualizing these data requires spe- cial tools. For the general audience, the use of these data is 1For a demonstration video, data sets, source code and supplementary quite impossible. s. https://ikw.uos.de/%7Ecv/publications/WACV17 c 2017 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works. DOI: 10.1109/WACV.2017.78 ing annotations, gaze data of subjects, and the correspond- 3.1. Video annotation formats ing video file in a single multimedia container. After com- The Video Performance Evaluation Resource (ViPER) is paring the advantages and drawbacks of our prototypes, we a software system designed primarily for analyzing anno- summarize the possible impact and the opportunities a stan- tated videos compared to a given ground truth. This soft- dardized container will provide. ware system can also be used for the process of object annotation. For efficient annotation, automatically generated 2. Annotation 2D propagations of already annotated objects can be used to speed up the manual annotation process. Though ViPER Annotation is the process of adding notes to, e.g., a text, was already released in 2001, it is still quite popular, be- a diagram [29], or a video. For videos, such “notes” usu- cause of its XML-based file format for storing annotation ally describe image or action related entities, such as ar- information, which is organized in so-called descriptors. A eas of interest, object locations and classes, or space-time ViPER descriptor is a record, describing some element of volumes of certain actions. But annotations become ever the video with a unique identifier and a timespan during more complex, now including data of subjects watching the which it is valid. The schema of the XML is specified as video, like gaze movements [14], brain waves, and emo- XSD. The XSD description facilitates designing interfaces tions [22]. All of these data are referred to as metadata. for a straightforward usage of ViPER ‘s annotations in the In principle, there is no restriction to the kind and extent context of other applications. Unfortunately, this format of recorded information. However, to allow the exchange only supports labeling and tracking of objects and is stored and reuse of metadata, and to make annotations available separately from the corresponding video footage. Further, to a larger audience, compatibility with established stan- we noticed that the currently provided XML samples on the dards would be desirable. But most of current annotation ViPER website are not valid with regard to the provided XSD tools [7, 35, 25, 18, 27] and data sets [6, 13] do not take descriptions. care of this problem, thus, the metadata are inaccessible to Other annotation tools [3,5] use mostly customized other tools and to media players. XML schemes, plain text, resource description frameworks For high-level video understanding and semantic web (RDF) [32], or MPEG-7—MPEG-7 will be discussed later. applications, we are particularly interested in annotations All these manual or partially automated annotation tools where objects are tagged, e.g., “red car shown in frame have one feature in common: The annotations are stored 10, 23-30” and labeled “red car is represented by pixels alongside the video. So far, no tool, even the tools using x300 y400, x301 y400... x340 y600 in frame 10 and 25”. MPEG-7, considers the integration of its metadata into mul- To obtain data sets where objects are labeled or tagged timedia containers. This might be caused by the fact that pixel-wise, today one still has to download zipped archives video labeling is a challenging task by itself [35, 25], so the of several GBs and write or adapt a customized software storing and representation of the created metadata are not tool for interpretation — maybe only to find out whether the primary focus of these works. the data set really fits requirements! So a streamable data 3.2. Metadata Formats format for video annotation would simplify the utilization of metadata immensely. The content of video material is hard to access by ma- chines, so the value of annotations for management and Evidently, the major question in designing a metadata processing of videos can hardly be overestimated. How- format is this: What metadata are required? While we can ever, the lack of standardization makes it hard to combine not give an answer in general, we are able to show proto- or exchange such data, or to provide search across different types of annotation formats that already cover a variety of repositories. generic tasks, which are streamable and provide an out of Standards for metadata abound and have become popu- the box visualization. lar with the rise of the semantic web. The RDF standard [32] provides a general format to make statements about re- 3. Data Formats sources, i.e., virtually everything that can be given a unique name. It comes with well-defined formal semantics and al- Todays annotation tools [7, 35, 25, 18, 27] are used to lows a distributed representation. However, it provides only create and store video annotations. But hardly any used data a very limited predefined vocabulary, requiring applications formats or methods which support the streaming of video to extend the language for specific domains by specifying data together with metadata. Despite, many streamable for- schemes. By now, several of such schemes exist, but to our mats, in the domains of DVD, Blu-ray, or video compres- knowledge, no standard for the description of video mate- sion, exist, which might be capable of carrying metadata rial has been evolved. Videos feature a temporal and spatial next to the video to fit the necessary requirements.

Load more