Providing Video Annotations in Multimedia Containers for Visualization and Research

Julius Schoning,¨ Patrick Faion, Gunther Heidemann, Ulf Krumnack Institute of Cognitive Science Osnabruck¨ University, Germany {juschoening, pfaion, gheidema, krumnack}@uos.de

Abstract So, why not encapsulate these additional data in a stan- dard container format? This has become common practice There is an ever increasing amount of video data sets for storing text plus metadata, e.g., in the PDF container. which comprise additional metadata, such as object la- State of the art video containers like the open container for- bels, tagged events, or gaze data. Unfortunately, metadata mats [38](), MPEG-4 [12], or the Matroskaˇ container are usually stored in separate files in custom-made data format [16](MKV) can encapsulate video and metadata, formats, which reduces accessibility even for experts and such that it can be interpreted as a single file by standard makes the data inaccessible for non-experts. Consequently, multimedia players and can be streamed via the Internet. we still lack interfaces for many common use cases, such By this means, the accessibility of video metadata will be as visualization, streaming, data analysis, machine learn- increased substantially1. ing, high-level understanding and semantic web integra- We argue that video data sets should be stored in a mul- tion. To bridge this gap, we want to promote the use of ex- timedia container, carrying all metadata like tags, , isting multimedia container formats to establish a standard- labels, object descriptions, online links, and even complex ized method of incorporating content and metadata. This data such as gaze trajectories of multiple subjects and other will facilitate visualization in standard multimedia players, video related data. Within prototype implementations using streaming via the Internet, and easy use without conver- multimedia containers, we present solutions that contain in sion, as shown in the attached demonstration video and a single file all metadata, e.g., for research [14, 30, 20, 24], files. In two prototype implementations, we embed object entertainment [9] and education supplements [15, 23], or labels, gaze data from eye-tracking and the corresponding communication aids for disabled people [10]. The mul- video into a single multimedia container and visualize this timedia container already supports a variety of data for- data using a media player. Based on this prototype, we dis- mats, enabling visualization of annotations, gaze points, etc. cuss the benefit of our approach as a possible standard. Fi- in standard media players or slightly modified versions of nally, we argue for the inclusion of MPEG-7 in multimedia them. We want to establish a standard format that facilitates containers as a further improvement. application in various fields, ranging from annotation for deep learning in computer vision over highlighting objects in movies for visually impaired people to creating auditory 1. Introduction displays for blind people. A standard format will boost ac- cessibility and shareability of metadata. Metadata of video files is still commonly stored in sepa- Our contribution focuses on object annotation metadata rate files using custom file formats. Metadata such as object but is trying not to neglect the variety of possible other annotations and segmentation information, as well as high- metadata. The paper starts with a general section about an- level information like gaze trajectories or behavior data of notation followed by an extensive review of available data subjects, are stored next to the video in a single or multi- formats for video annotation, metadata description, timed ple files. The data structures of these files are mostly cus- data formats, and multimedia containers. Based on a dis- tomized, sometimes unique, and they are stored in a diver- cussion on suitable data representations for scientific meta- sity of formats, e.g. plain text, XML, MATLAB format, or data, we present two prototype implementations for provid- binary. Accessing and visualizing these data requires spe- cial tools. For the general audience, the use of these data is 1For a demonstration video, data sets, source code and supplementary quite impossible. s. https://ikw.uos.de/%7Ecv/publications/WACV17

c 2017 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works. DOI: 10.1109/WACV.2017.78 ing annotations, gaze data of subjects, and the correspond- 3.1. Video annotation formats ing video file in a single multimedia container. After com- The Video Performance Evaluation Resource (ViPER) is paring the advantages and drawbacks of our prototypes, we a software system designed primarily for analyzing anno- summarize the possible impact and the opportunities a stan- tated videos compared to a given ground truth. This soft- dardized container will provide. ware system can also be used for the process of object an- notation. For efficient annotation, automatically generated 2. Annotation 2D propagations of already annotated objects can be used to speed up the manual annotation process. Though ViPER Annotation is the process of adding notes to, e.g., a text, was already released in 2001, it is still quite popular, be- a diagram [29], or a video. For videos, such “notes” usu- cause of its XML-based file format for storing annotation ally describe image or action related entities, such as ar- information, which is organized in so-called descriptors. A eas of interest, object locations and classes, or space-time ViPER descriptor is a record, describing some element of volumes of certain actions. But annotations become ever the video with a unique identifier and a timespan during more complex, now including data of subjects watching the which it is valid. The schema of the XML is specified as video, like gaze movements [14], brain waves, and emo- XSD. The XSD description facilitates designing interfaces tions [22]. All of these data are referred to as metadata. for a straightforward usage of ViPER ‘s annotations in the In principle, there is no restriction to the kind and extent context of other applications. Unfortunately, this format of recorded information. However, to allow the exchange only supports labeling and tracking of objects and is stored and reuse of metadata, and to make annotations available separately from the corresponding video footage. Further, to a larger audience, compatibility with established stan- we noticed that the currently provided XML samples on the dards would be desirable. But most of current annotation ViPER website are not valid with regard to the provided XSD tools [7, 35, 25, 18, 27] and data sets [6, 13] do not take descriptions. care of this problem, thus, the metadata are inaccessible to Other annotation tools [3,5] use mostly customized other tools and to media players. XML schemes, plain text, resource description frameworks For high-level video understanding and semantic web (RDF) [32], or MPEG-7—MPEG-7 will be discussed later. applications, we are particularly interested in annotations All these manual or partially automated annotation tools where objects are tagged, e.g., “red car shown in frame have one feature in common: The annotations are stored 10, 23-30” and labeled “red car is represented by pixels alongside the video. So far, no tool, even the tools using x300 y400, x301 y400... x340 y600 in frame 10 and 25”. MPEG-7, considers the integration of its metadata into mul- To obtain data sets where objects are labeled or tagged timedia containers. This might be caused by the fact that pixel-wise, today one still has to download zipped archives video labeling is a challenging task by itself [35, 25], so the of several GBs and write or adapt a customized software storing and representation of the created metadata are not tool for interpretation — maybe only to find out whether the primary focus of these works. the data set really fits requirements! So a streamable data 3.2. Metadata Formats format for video annotation would simplify the utilization of metadata immensely. The content of video material is hard to access by ma- chines, so the value of annotations for management and Evidently, the major question in designing a metadata processing of videos can hardly be overestimated. How- format is this: What metadata are required? While we can ever, the lack of standardization makes it hard to combine not give an answer in general, we are able to show proto- or exchange such data, or to provide search across different types of annotation formats that already cover a variety of repositories. generic tasks, which are streamable and provide an out of Standards for metadata abound and have become popu- the box visualization. lar with the rise of the semantic web. The RDF standard [32] provides a general format to make statements about re- 3. Data Formats sources, i.e., virtually everything that can be given a unique name. It comes with well-defined formal semantics and al- Todays annotation tools [7, 35, 25, 18, 27] are used to lows a distributed representation. However, it provides only create and store video annotations. But hardly any used data a very limited predefined vocabulary, requiring applications formats or methods which support the streaming of video to extend the language for specific domains by specifying data together with metadata. Despite, many streamable for- schemes. By now, several of such schemes exist, but to our mats, in the domains of DVD, Blu-ray, or video compres- knowledge, no standard for the description of video mate- sion, exist, which might be capable of carrying metadata rial has been evolved. Videos feature a temporal and spatial next to the video to fit the necessary requirements. structure, distinguishing them from most other types of data y coordinate x coordinate 2 quadratic interpolations

5 linear interpolations time t 25 real data

x coordinate

Reference Region Reference Region Reference Region

Motion Motion Motion time t (a) (b) Figure 1. MPEG-7 supports temporal interpolation. (a) shows interpolation of 25 stored data points using five connected linear functions and two connected quadratic functions; cf. Figure 6 [11]. In (b) spatio-temporal interpolation between regions in a video sequence is shown, cf. Figure 23 [11]. and requiring a special metadata framework. TOR describes spatio-temporal regions of an object of inter- The Continuous Media Markup Language est in a video sequence, as shown in Figure 1(b). (CMML)[21]) is a format for annotating time-continuous 3.3. Timed text data with different types of information. It was developed to improve the integration of multimedia content into the Timed text [31] refers to time-stamped text documents world wide web, providing markup facilities for timed that allow to relate sections of the text to certain time in- objects in a way similar to what HTML provides for text tervals. Typical applications of timed text are subtitling of documents. With CMML, temporal intervals (clips) can be movies, captioning for hearing impaired or people lacking marked and described using a predefined vocabulary, allow- audio devices. The most simple form of timed text consists ing to provide textual descriptions, hyperlinks, images (e.g., of units providing a time interval and a text to be displayed a key frame), and not further specified metadata in form of during that interval. This is basically what is provided by attribute-value pairs. While CMML is able to address the the popular SubRip format (SRT). Additionally, the second temporal structure of videos, it provides no specific means edition of the Timed Text Markup Language 1 (TTML1)[34] for referring to pixels, regions, or space-time volumes. The introduces a layout with a vocabulary for regions. same holds for the Synchronized Multimedia Integration Today there are many timed text formats which are Language (SMIL)[33]) that aims at the integration and hardly interoperable and therefore require media players to synchronization of multiple different media elements. provide a large set of decoders and rendering engines. Probably the most advanced metadata framework for multimedia content is the Multimedia content description 3.4. Subtitle and Caption Formats interface defined in the ISO/IEC 14496-3 standard, which The universal subtitle format (USF) was an open specifi- has been developed by the Moving Picture Experts Group cation [19] which aims at providing a modern format for and is also known as MPEG-7. It specifies a set of tools encoding subtitles. It tries to be comprehensive by pro- to describe various types of multimedia information. How- viding features from different existing subtitle formats. A ever, MPEG-7 has been criticized for the lack of a formal se- XML-based representation is chosen to gain flexibility, hu- mantics, which causes ambiguity leading to interoperability man readability, portability, unicode support, a hierarchical problems and hinders a widespread application [17,1]. structure and an easier management of entries. The subtitles MPEG-7 provides means to describe multimedia content are intended to be rendered by the multimedia player, allow- at different degrees of abstraction. It is designed as an ing to adapt the display style (color, size, position, etc.) to extensible format, providing its own “Description defini- fit the needs of the viewer. However, there are also tools to tion language” (DDL – basically an extended form of XML generate pre-rendered subtitles, e.g., in the VobSub format schema). Its basic structure defines some vocabulary, which for players that do not support USF. can be used for different aims: GRID LAYOUT,TIME SE- Nowadays, the open source project USF [19] has become RIES,MULTIPLE VIEW,SPATIAL 2D COORDINATES, and private, so the community does not have any influence on TEMPORAL INTERPOLATION. Especially the TEMPORAL the development. The latest version 1.1 has parts which are INTERPOLATION is quite interesting for object labeling. As still under development. Consequently, some parts—e.g., illustrated in Figure 1(a), it allows temporal interpolation the draw commands—are incomplete. In addition to visual- using connected polynomials, with linear interpolation as a ization of data on top of the video, USF provides a comment special case. On this basis, the SPATIO-TEMPORAL LOCA- tag which would allow storing additional information—not for display but for exchange. Sub Station Alpha (SSA) is a file format for video subti- Header tles, which has been introduced with the subtitle editor Sub Static

Station Alpha. It has been widely used by fansubbers and Payload General Metadata support has been implemented for many multimedia play- ers. The extended version V4.00+ [28]—also known as Advanced SSA (ASS)—includes simple drawing commands, supporting straight lines, 3rd degree bezier curves, and 3rd degree uniform b-splines, which is probably sufficient to roughly mark objects in a scene. Today’s video players usually support the ASS drawing commands, but in older Video Audio 1 Audio 2 Subtitle 2 Subtitle 1 Subtitle n

players, the drawing commands are not implemented. (Metadata) (Metadata) (Metadata)

3.5. Multimedia container Temporal Payload Multimedia objects consist of multiple parallel tracks, time t usually a video track, one or more audio tracks, and some optional subtitle tracks. These tracks are often combined Figure 2. General data structure of a multimedia container, the header and the general metadata which have no temporal depen- into a single container for storage, distribution, or broad- dencies are stored before the temporal video, audio, subtitle, and casting. In contrast to classical data archives, multimedia metadata. For streaming this video, only the non-temporal data containers have to account for the temporal nature of their have to be transmitted before playing; the temporal data are trans- payload, in order to support seeking and synchronized play- mitted during playing. back of the relevant tracks, as shown in Figure2. To embed data into a container, a packager needs some basic under- standing of the data format, at least enough to understand bedding of MPEG-7 [11] data into MP4 [12] containers. Up its temporal structure. Further, certain aspects of the data or to now, we know of no software that is able to do this em- the container format may prevent a direct encapsulation. In bedding. brief: Not every payload is suited for every container. Current multimedia containers are the VOB and EVO for- 4. Suitable data format mats used for that are based on the MPEG-PS stan- dard. The more modern MP4 format, specified in MPEG- What kind of annotation standard would be suited best 4 (ISO/IEC 14496, Part 14) [12], was established to hold for video annotations? Data formats, as seen in Section3, video, audio and timed text data. Though MP4 can not han- for metadata abound and keep growing, fostered by the ad- dle arbitrary video, audio and timed text formats, it does vance of the semantic web. While general formats, like conform to the formats introduced in the rest of the stan- RDF [32], are well established, they are not geared towards dard. the description of video material, and hence miss the re- The free OGG container format was originally put for- quired vocabulary. We suggest three approaches for stan- ward by the non-profit Xiph.Org [38] foundation for stream- dardizing video annotation: i) extension of metadata like ing encoded audio files and is nowadays supported RDF with video vocabulary, ii) using a well-defined stan- by many portable devices. With the introduction of dard for video metadata iii) reutilization of a commonly and Dirac video formats, OGG has also become a popular used format, e.g., for subtitles or captions, to store anno- format for streaming multimedia content on the web. The tation. We will find that the first two approaches, though open Matroskaˇ container format [16](MKV) aims at flexi- desirable in general, are not feasible to solve the problem bility to allow an easy inclusion of different types of pay- with overseeable effort and on a short time scale. But the load. It serves as a basis for the format, which is third approach is feasible with almost non additional effort pushed forward to establish a standard for multimedia con- and compatible with today’s media players1. tent in the web. Extension of a general, well-established, extensible Beyond subtitling, the inclusion of timed metadata into metadata standard like RDF with a video vocabulary is the multimedia containers seems to be discussed only sporad- first approach motivated by experiences in non video do- ically. The embedding of CMML descriptions in an OGG mains. It requires the definition of a specialized vocabu- container is one such approach [37], but it seems quite lim- lary. Since we have to start from scratch, it would require ited due to the restricted data format, not providing direct the definition of a time-dependent vocabulary as well as the support for spatial features like region markup. extension of tools and the developing of players that can Probably the most matured specified approach is the em- deal with these new vocabularies. For this, one can resort to existing libraries for serializing (writing) and deserializing tations, cf. Section4, two complementary prototypes were (reading) of data. implemented. The first one is based on USF and encapsu- Using a specialized, well-established standard for video lates the complete metadata for visualization in a modified metadata like MPEG-7 is the second possible approach. version of the VLC media player. The second one is based MPEG-7 has a well defined and established vocabulary on ASS and only able to carry selected metadata, but this for various annotations. Unfortunately, no standard me- data can be visualized by most current media players. dia player (like VLC, MediaPlayer, and Windows Media Player) seems to support MPEG-7 video annotation for- 5.1. Metadata as USF mats. Hence, it requires implementing extensions for ev- In order to use USF for encapsulating the annotation data, ery single player—fortunately one can build on existing li- we analyzed which features of USF are available in the lat- braries [2]. When implementing multimedia player exten- est releases of common multimedia players. One of the sions, one should aim for a generic interface that can be most common media players is the VLC media player. The used to visualize other MPEG-7 annotations as well. current version 3.0.0 already supports a variety of USF at- The third approach is the reutilization of other kinds of tributes, which are text, image, karaoke and comment. The formats for storing annotations, which are already imple- latest USF specification introduces an additional attribute DVD Blu-ray mented for s and s. Examples are subtitles, shape that is still marked as under development, although captions, audio tracks, and online links [9]. This kind of this specification is already quite old. Since common object “hijacking” formats has the benefit that these formats are visualization methods rely on simple geometric shapes, like widely supported by multimedia players. Still, some rele- rectangles and polygons, the use of the shape attributes for vant features, like drawing boxes or polygons, seem not to video annotations seems to be quite appropriate. be widely used and implemented. By putting annotations Since the exact specification of the shape attribute is, into a hijacked format, one has to be careful that no infor- as mentioned, not complete, we particularized it with re- mation is lost. Additionally, one should bear in mind that spect to rectangles, polygons, and points, as illustrated in the original format was designed for some other purpose, listing1. These simple geometric shapes were taken as first so it may not support desired features, e.g., simultaneous components in order to visualize a multitude of different ob- display of multiple objects. ject annotation types. Rectangles are most commonly used Which of the three approaches is best suited for video for bounding box annotations, whereas polygons provide annotation? In our opinion, the use of MPEG-7 or similar a more specific, but complex way of describing the con- formats is preferable for any kind of video annotation. Due tour of an object. Finally, point-like annotations are useful to its proper specification in the norm ISO-IEC-15938 [11], to describe locations without region information, e.g., for it provides a platform for consistent implementations in all gaze-point analysis in eye-tracking studies. media players. A further advantage is that MPEG-7 can the- The visualization of USF data is handled by VLC in a oretically be encapsulated into an MP4 [12] container so codec module called subsusf. This codec module receives that both video and its related annotations are stored as one streams of the subtitle-data for the current frame from the streamable file. Unfortunately, the well-detailed specifica- demuxer of VLC. We extended this module with additional tion leads to the issue that developing media player exten- parsing capabilities for our specified shape data, which is sions require considerable implementation effort. then drawn into the so-called subpictures and passed on to Thus, we can not provide easily a MPEG-7 extension the renderer of VLC. Since the thread will be called for ev- as the optimal solution. But to highlight the advantages ery frame, the implementation is time-critical and we de- of a single multimedia container file—carrying all anno- cided to use the fast rasterization algorithms of Bresenham tations as well as other metadata—we chose the reutiliza- [4]. Additionally, we added an option to fill the shapes, tion of USF, which we can pack into the MKV multimedia which is implemented with the scan line algorithm [36]. container. These containers can be played back by video While extending the subsusf module, we noticed several players like VLC and others. We hijacked the USF data for- issues in the already existing code, which resulted in un- mat for two main reasons: Firstly, its specification considers wanted crashes in some scenarios, even though the data possible methods for drawing shapes, a necessity for object used was completely valid. Thus, in addition to extending labeling. Secondly, and more importantly, USF is—like the the module for our needs, we fixed some unstable parts of preferable MPEG-7—a XML-based format and thereby ca- the original implementation as well1. The result of the visu- pable to provide complex data structures. alization of the USF attribute shape can be seen in Figure3. 5. Prototypes To test our implementation, we created a MKV container, containing a video file, as well as dummy USF files with all Following the previous discussion to hijack, or more pre- supported attributes. VLC can open these containers and cisely, to modify an existing subtitle format for video anno- yields the desired visualization of geometric object anno- (a) (b) (c) Figure 3. Visualization of the USF shape attributes implemented by our version of the subsusf.c for the VLC player version 3.0.0: (a) rectangle shaped metadata like bounding boxes; (b) arbitrary shapes, e.g., for pixel-wise labeling and areas of interest; and (c) pixel bound metadata like gaze points on a video tations. This proves our concept that the incorporation of tent. metadata into USF is possible. Further, using MKV as con- Listing 1. Section of the USF specification [19], * marked attributes tainer format implies possible usage for streaming, since are added to the specification and implemented in our alterd VLC content and metadata are integrated temporally. Opening player. the container in a standard version of VLC without the ad- ... @-Type (0..1) Additional tests were carried out with fully anno- +-text (0..N) +-image (0..N) tated video data, i.e., labels were supplied for all pixels +-karaoke (0..N) such as “object” and “background”, as exported from our +-shape (0..N)* +-polygon (0..N)* interactive annotation and Segmentation tool (iSeg)[25]. In @-posy (1) * our tool with an option to export all labeled objects to USF files, encapsulated together with the original video in a MKV @-height (1)* being coded as different subtitle tracks. As a result, always @-diameter(1)* the visualization of the labeled objects during playback us- +-comment (0..N) ing the same interface metaphors as for changing subtitles. The nature of the general subtitle system in VLC only al- ... lows for one active subtitle track at all times, which un- Using the existing USF files with pixel-wise labeled ob- fortunately limits the range of possibilities for annotations jects, created for the first prototype, ASS files are created us- significantly. Often it is important to visualize inter-object ing the turing-complete transformation language XSLT (Ex- relations, which becomes very tedious, if not even infeasi- tensible Stylesheet Language Transformations) with a sim- ble in this context. 1 ple translation stylesheet file . After the conversion, a MKV container is created including the video and one ASS file for 5.2. Metadata as ASS each labeled object. The resulting container makes meta- Since the USF based prototype needs a modified version data accessible for a broad audience, because the simple vi- of a media player, a broad audience is still unable to watch sualization by ASS subtitles can be displayed with unmodi- the annotations without the dedicated media player. There- fied video players. fore, a subtitle format which already can draw lines is used 5.3. Representing semantic knowledge for implementing the second prototype. For this, we use the ASS subtitle format. In contrast to the USF, the ASS subtitle Based on the latest version of iSeg [25], we implemented format can not carry all metadata. This can be put down to a semantic time line for adding semantic information to the its plain text format, which is not capable to represent com- labeled objects. Part of this information is the stack order plex data structures and can not add nonvisualizable con- of the labeled object, aka z-order. In our implementation of (b)

(c)

(a)

(d) Figure 4. Data is easily available via menu selection for visualization or out of the mkv, e.g., for experiments. (a) the selection menu (b)–(d) example visualization of pixel-wise labels in the VLC player. the USF export method, we are able to interpret the z-order data. such that a subtitle is created automatically, which describes the visibility of the objects of interest, see Figure5. Due to 7. Results and Discussion the encapsulation into a multimedia container, a subtitle is Comparing the USF based and the ASS based prototypes generated automatically, which can be displayed in com- with the MPEG-7 approach, as shown in Tab.1, one might mon multimedia players. The automatic subtitle generation assume that the USF approach is as good as the MPEG-7 ap- uses a simple grammar: If the labeled area of one object is proach. In our opinion this is not the case because of three intercepted by another label and if the z-order is higher than facts: Firstly, a full support of complex metadata is very the current object, then the object is not visible or partially important for providing video annotations in standardized occluded. Otherwise, if the labeled area is within the video multimedia containers. In contrast to USF, MPEG-7 achieve resolution, the object is visible. this goal completely due to its variety of well-defined vo- 6. User Study cabulary. Secondly, we are convinced that the missing inte- gration of MPEG-7 encoders in state-of-the-art media play- The use of our multimedia containers allows to perform ers would be solved soon if the semantic web evolves to- multimodal analysis with a standard multimedia player. To wards the incorporation of media data like videos. Finally, investigate the gains of our approach, we conducted a first no known media player is designed to visualize several sub- short user study [26]. We asked one expert and two non- titles in parallel, which is quite important to visualize inter- experts in the field of multimodal analysis to exploratively object relations. This fact can be seen as a blocking point investigate data sets with video and image stimuli as well for approaches based on hijacking subtitle formats for pro- as gaze and EEG data of several subjects that were made viding metadata for visualization. available in our container format. A first response of the ex- 7.1. Benefit of metadata in multimedia containers pert was “it’s a beautiful approach because one gets a first for visualization and research expression of the data without the need to run any scripts or install a special player”. Further, the expert considers the A widespread use of multimedia containers for stor- exploration of unprocessed data highly useful for several ing metadata will not only make metadata accessible to a reasons, e.g. the possiblility to quickly perform plausibility broader audience, but also boost the use of metadata in checks of the material. The non-expert users highly appre- science, in particular, for machine learning tasks in com- ciated the ease of interactively demonstrating multimodal puter vision. Instead of downloading GBs of zipped files (a) (b) (c) Figure 5. Label of white car with semantic knowledge, visualized as automatic generated subtitle: (a) white car visible — (b) in the next second white car partial occluded by red car — (c) next time white car visible again.

Table 1. Comparison of the USF based prototype, the ASS based 7.2. Available data sets prototype, and the MPEG-7 approach. indicate full, indicate partially, and indicate non completely or support. Promoting the use of metadata encapsulated in multime- USF ASS MPEG-7 dia containers, we are converting existing video data sets prototype prototype approach into MKV containers. For every data set, two types of the General metadata rep- MKV multimedia container are available. In the first type, resentation the complete metadata are stored in USF, thus one only Supporting complex needs the described modification in the subtitle drawing metadata routine in order to visualize the metadata. In the other type, Out of the box media only parts of the metadata are stored in the ASS format, such player integration that the metadata can be visualized with almost every cur- Parallel visualization rently available video player. of different metadata Currently, two converted data sets can be downloaded1. entries The first data set by Kurzhals et al. [13] contains video stim- Easy accessibility of uli together with rectangle-shaped, labeled objects and gaze metadata with scien- trajectories of multiple subjects as metadata. The second tific tools converted data set carries segmentations created manually Open source develop- by multiple persons [8] as metadata. ment tools available Encapsulation in a single file container 8. Conclusion Streamable con- The importance of metadata in general and for video an- tainer with temporal notation in particular will significantly increase if metadata depended data are provided in multimedia containers, as we suggest in this For encapsulation MKVMKV, MP4 work. These multimedia containers can be interpreted by supported formats MP4, common video players in the same way as today’s subtitles. OGG,. . . In our opinion, the research community should prefer the use of MPEG-7 as the best-suited format for describing any to get an impression of what data is available, one can sim- kind of metadata due to its high-level description. Unfortu- ply watch the data set online. Moreover, using a common nately, we recognize the lack of MPEG-7 integration in cur- metadata standard will terminate the tedious process where rent media players. Therefore, we propose the reuse of sub- researchers have to adjust their software for every differ- title formats as a feasible, transitory solution until MPEG-7 ent data set. Finally, machine learning techniques like deep support in video players is available. Nevertheless, to pro- learning, which depend on large-scale data sets, can now be mote the use of multimedia containers, like it has become applied to the (so far too large) domain of video understand- common for text containers like PDF, we are converting ex- ing. Additionally, consumers may profit from features like isting video data sets in MKV containers which provide the highlighting objects in movies for visually impaired per- corresponding metadata as USF or as ASS. For the future, sons, searching videos for semantic relations between ob- if MPEG-7 is supported by media players, our USF based jects, or generating an understandable auditory description multimedia container can be easily converted because both out of a video. formats are XML-based. References [18] Multimedia Knowledge and Social Media Analytics Labora- tory. Video image annotation tool http://mklab.iti. [1] R. Arndt, S. Staab, R. Troncy, and L. Hardman. Adding for- gr/project/via, Jan 2017. mal semantics to MPEG-7. Arbeitsberichte des Fachbereichs [19] C. Paris, L. Vialle, and U. Hammer. TitleVision Informatik 04/2007, Universitat¨ Koblenz-Landau, 2007. - USF specs http://www.titlevision.com/usf. [2] W. Bailer, H. Frntratt, P. Schallauer, G. Thallinger, and htm, Nov 2016. W. Haas. A C++ library for handling MPEG-7 descriptions. [20] N. Petrovic, N. Jojic, and T. S. Huang. Adaptive video fast In Proceedings of the 19th ACM International Conference on forward. MULTIMED TOOLS APPL, 26(3):327–344, 2005. MM, pages 731–734. ACM, 2011. [21] S. Pfeiffer, C. D. Parker, and A. Pang. The Continuous Media [3] S. Bianco, G. Ciocca, P. Napoletano, and R. Schettini. An Markup Language (CMML), Version 2.1. Internet Engi- interactive tool for manual, semi-automatic and automatic neering Task Force (IETF) https://www.ietf.org/ video annotation. COMPUT VIS IMAGE UND, 131:88–99, archive/id/draft-pfeiffer-cmml-03.txt, 2015. March 2006. [4] J. E. Bresenham. Algorithm for computer control of a digital [22] E. M. Provost, Y. Shangguan, and C. Busso. UMEME: Uni- plotter. IBM SYST J, 4(1):25–30, 1965. versity of michigan emotional mcgurk effect data set. IEEE [5] S. Dasiopoulou, E. Giannakidou, G. Litos, P. Malasioti, and J AFFC, 6(4):395–409, 2015. Y. Kompatsiaris. A survey of semantic image and video an- [23] C. Rackaway. Video killed the textbook star? use of multi- notation tools. Lect Notes Comput Sc, 6050:196–239, 2011. media supplements to enhance student learning. Journal of [6] A. F. de Araujo,´ J. Chaves, D. M. Chen, R. Angst, and Political Science Education, 8(2):189–200, 2010. B. Girod. Stanford I2V: a news video dataset for query-by- [24] J. Schoning.¨ Interactive 3D reconstruction: New opportuni- image experiments. In Proceedings of the 6th ACM MMSys ties for getting CAD-ready models. In Proceedings of the Conference, pages 237–242, 2015. ICCSW, volume 49, pages 54–61. Schloss Dagstuhl–LZI, [7] D. Doermann and D. Mihalcik. Tools and techniques for 2015. video performance evaluation. In Proceedings of the 15th [25] J. Schoning,¨ P. Faion, and G. Heidemann. Semi-automatic IEEE ICPR, volume 4, pages 167 –170, 2000. ground truth annotation in videos: An interactive tool for [8] F. Galasso, N. S. Nagaraja, T. J. Cardenas, T. Brox, and polygon-based object annotation and segmentation. In Pro- B. Schiele. A unified video segmentation benchmark: An- ceedings of the 8th International Conference on K-CAP, notation, metrics and analysis. In Proceedings of the IEEE pages 17:1–17:4. ACM, New York, 2015. ICCV. IEEE, 2013. [26] J. Schoning,¨ A. L. Gert, A. Ac¸ik, T. C. Kietzmann, G. Heide- [9] J. R. Giorgio Bertellini. DVD supplements: A commentary mann, and P. Konig.¨ Exploratory multimodal data analysis on commentaries. CINEMA J, 49(3):103–105, 2010. with standard multimedia player — multimedia containers: a feasible solution to make multimodal research data accessi- [10] C. Gonsalves and M. K. Pichora-Fuller. The effect of hear- ble to the broad audience. In Proceedings of the International ing loss and hearing aids on the use of information and com- Conference on VISAPP. SCITEPRESS [InPress], 2017. munication technologies by community-living older adults. [27] R. Schroeter, J. Hunter, and D. Kosovic. Vannotea - a col- CAN J AGING, 27(2):145–157, 2008. laborative video indexing, annotation and discussion system [11] ISO/IEC. Information technology—multimedia content de- for broadband networks. Workshop on Knowledge Markup scription interface—Part 3: Visual (ISO/IEC 15938-3:2001), & Semantic Annotation, pages 1–8, 2003. 2001. [28] Sub Station Alpha v4.00+ Script Format moodub.free. [12] ISO/IEC. Information technology—coding of audio-visual fr/video/ass-specs.doc, Jan 2017. objects—Part 14: MP4 file format (ISO/IEC 14496- [29] A. Stevenson, editor. Oxford Dictionary of English. Oxford 14:2003), 2003. University Press, 3 edition, 2010. [13] K. Kurzhals, C. F. Bopp, J. Bassler,¨ F. Ebinger, and [30] M. Vernier, M. Farinosi, and G. L. Foresti. A smart visual D. Weiskopf. Benchmark data for evaluating visualization information tool for situational awareness. In Proceedings of and analysis techniques for eye tracking for video stimuli. the International Conference on VISAPP, volume 3, pages Workshop on BELIV, pages 54–60, 2014. 238–247. SCITEPRESS, 2016. [14] K. Kurzhals, F. Heimerl, and D. Weiskopf. ISeeCube - visual [31] W3C. Timed text working group http://www.w3.org/ analysis of gaze data for video. In ACM, editor, Proceedings AudioVideo/TT/, Jan 2016. of the 2014 Symposium on ETRA, pages 43–50, 2014. [32] W3C. RDF - semantic web standards https://www.w3. [15] M. Martin, J. Charlton, and A. M. Connor. Mainstreaming org/RDF/, Jan 2017. video annotation software for critical video analysis. Journal [33] W3C. Synchronized multimedia integration language (SMIL of Technologies and Human Usability, 11(3):1–13, 2015. 3.0) https://www.w3.org/TR/SMIL3/, Jan 2017. [16] . Matroska Media Container https://www. [34] W3C. Timed text markup language 1 (TTML1) (second edi- matroska.org/, Jan 2017. tion) https://www.w3.org/TR/ttml1/, Jan 2017. [17] F. Mokhtarian and M. Bober. Curvature Scale Space Rep- [35] S. Wu, S. Zheng, H. Yang, Y. Fan, L. Liang, and H. Su. resentation: Theory, Representation: Applications, and SAGTA: Semi-automatic ground truth annotation in crowd MPEG-7 Standardization, volume 25 of Computational scenes. In Proceedings of IEEE ICME, pages 1–6. Institute Imaging and Vision. Springer, 2003. of Electrical & Electronics Engineers (IEEE), 2014. [36] C. Wylie, G. Romney, D. Evans, and A. Erdahl. Half-tone perspective drawings by computer. In Proceedings of the Fall Joint Computer Conference, AFIPS ’67 (Fall), pages 49–58. ACM, 1967. [37] Xiph.org. CMML mapping into Ogg https://wiki. xiph.org/index.php/CMML, Feb. 2016. [38] Xiph.org. Ogg https://xiph.org/ogg/, Jan 2017.