An Efficient, Streamable Text Format for Multimedia Captions and Subtitles Dick C.A
Total Page:16
File Type:pdf, Size:1020Kb
An Efficient, Streamable Text Format for Multimedia Captions and Subtitles Dick C.A. Bulterman, Jack Jansen, Pablo Cesar, Samuel Cruz-Lara To cite this version: Dick C.A. Bulterman, Jack Jansen, Pablo Cesar, Samuel Cruz-Lara. An Efficient, Streamable Text Format for Multimedia Captions and Subtitles. ACM Symposium on Document Engineering - DocEng 2007, Aug 2007, Winnipeg, Canada. inria-00192467 HAL Id: inria-00192467 https://hal.inria.fr/inria-00192467 Submitted on 28 Nov 2007 HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non, lished or not. The documents may come from émanant des établissements d’enseignement et de teaching and research institutions in France or recherche français ou étrangers, des laboratoires abroad, or from public or private research centers. publics ou privés. An Efficient, Streamable Text Format for Multimedia Captions and Subtitles Dick C.A. Bulterman, A.J. Jansen, Pablo Cesar and Samuel Cruz-Lara CWI: Centrum voor Wiskunde en Informatica INRIA Lorraine Kruislaan 413 Laboratoire Loria BP 239 1098 SJ Amsterdam F-54506 Vandoeuvre Les Nancy cedex The Netherlands France {Dick.Bulterman, Jack.Jansen, P.S.Cesar}@cwi.nl [email protected] ABSTRACT 1. INTRODUCTION In spite of the high profile of media types such as video, XML multimedia languages integrate a collection of audio, audio and images, many multimedia presentations rely graphics, image, text, and video media items into a single extensively on text content. Text can be used for incidental presentation. Since most media items are by nature very labels, or as subtitles or captions that accompany other large — they consist of sampled data that can run into the media objects. In a multimedia document, text content is not gigabytes — it has been common wisdom in multimedia only constrained by the need to support presentation styles document design to develop container formats that refer to and layout, it is also constrained by the temporal context of media objects by reference. A good example of such a lan- the presentation. This involves intra-text and extra-text tim- guage is SMIL [4], which does not contain any actual media ing synchronization with other media objects. This paper data but consists of scheduling and layout directives that describes a new timed-text representation language that is control the presentation of a set of external (to the docu- intended to be embedded in a non-text host language. Our ment) media object URIs. format, which we call aText (for the Ambulant Text Format), balances the need for text styling with the requirement for While the clean separation of content from document struc- an efficient representation that can be easily parsed and ture is a valid abstract concept, it often brings with it an scheduled at runtime. aText, which can also be streamed, is increased authoring burden. Instead of maintaining (or gen- defined as an embeddable text format for use within declar- erating) a single document, several documents need to be ative XML languages. The paper presents a discussion of the defined: one for the integrating structure and one each for requirements for the format, a description of the format and the media objects. It also brings media access problems — a comparison with other existing and emerging text formats. especially in mobile environments — where multiple trans- We also provide examples for aText when embedded within fer sessions need to be initiated to fetch a few bytes of con- the SMIL and MLIF languages and discuss our implementa- tent. While this is defendable for large, typically binary tion experiences of aText with the Ambulant Player. media objects, it is less appropriate for text data. Categories and Subject Descriptors One approach for increasing authoring and processing effi- D.3.2 [Language Classifications]: Specialized application ciency is to integrate an existing text format into the XML languages; I.7.2 [Document and Text Processing]: Docu- container language. While many public and proprietary text ment Preparation—Languages and systems. formats exist, they cannot be simply imported into a multi- General Terms media container format because most are extremely feature rich (meaning that they need a complex text processing Performance, Design, Experimentation, Standardization, engine to manage text styling and layout) and because they Languages. were designed to be timing agnostic (meaning that there is Keywords no existing syntax for describing intra-text timing, nor extra- text synchronization with other objects). Even dedicated lan- Timed text, SMIL, DFXP, RealText, streaming text, Ambu- guages for describing timed text (such as RealText [14], lant. DFXP [1], HTML+TIME [10] or MPEG4-Part 17 [7]) often are based on temporal models that clash with the host timing model used in the media presentation. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that This paper presents a new language for integrating text in an copies are not made or distributed for profit or commercial (processing) efficient and streamable manner into XML- advantage and that copies bear this notice and the full citation on based multimedia languages. We call this language the the first page. To copy otherwise, or republish, to post on servers or Ambulant Text Format, or aText. Rather than being designed to redistribute to lists, requires prior specific permission and/or a to be a stand-alone text container format, aText was designed fee. to be an embedded timed text description format that could Conference '00, Month 1-2, 2000, City, State. extend existing XML languages such as SMIL and MLIF [5]. Copyright 200x ACM 1-58113-000-0/00/0000…$5.00. DocEng ’07 Submission Page 1 of 10 The principal contribution of aText is that a balance has been presented in parallel with the text. It may also require achieved between a text format that meets the stylistic needs style cues for speaker identification. This class will of a set of common multimedia applications classes and one require light formatting (alignment, line breaks) and that is easy to parse, schedule and render on a wide range of tight temporal coupling with other objects in the presen- multimedia implementation platforms (including low-pow- tation. Historically, captions and subtitles have been ered mobile devices and low-end set-top boxes). The format geared for use by the hearing impaired. has been implemented and integrated into the Ambulant 4. Foreign-language subtitles: this class is similar to the cap- open media player [3]. tions/subtitles described in class #3, except that audio cues and speaker identification are typically not pre- This paper is structured as follows. Section 2 provides a sented; there is an assumption that the foreign-language summary of the principal uses of a temporal text format in reader can see and hear the speakers and cues. multimedia documents. Section 3 surveys existing support for timed text, both within document containers and as 5. Time-constrained moving text: this class requires support stand-alone timed text languages. Section 4 describes the for marquee-style crawling or credits-style rolling text. aText format in detail, including its design goals, temporal The content may or may not be directly related to other characteristics and the styling facilities available in the lan- content in the presentation (such as an independent guage. We also discuss the use of aText as an external con- news crawl under an unrelated video). Such moving text tainer format. Section 5 provides an overview of our current needs to have a rate associated with it that is not related implementation status and the performance issues encoun- to its internal content. It may also need to loop based on tered when implementing aText within the Ambulant player. external (to the text) factors. We conclude with our expectations for further development 6. Inter-object triggered text: this class requires the specifica- and standardization of aText. tion of text fragments that are either triggered by exter- nal media (such as a particular frame in a video) or that 2. APPLICATIONS CLASSES FOR TEXT trigger external objects (such as the sound of a gong IN MULTIMEDIA DOCUMENTS when a certain word appears in the text). The level of granularity may be at the word or phrase level, rather The functionality proposed for aText is based on a set of gen- than at the text object level. eral use cases that we have found to be common when authoring multimedia presentations. 7. Conditional timed text: this class requires support for the conditional specification of text into a rendering area. We preface this discussion with the observation that, unlike The conditions may depend on the state of variables in languages such as XHTML [12], text often plays an inciden- the integrating document, and may be dynamically eval- tal role in multimedia presentations. In text-centric presenta- uated during the presentation of the text. tions, the flow of text in a document will dominate layout 8. Static block text: this class requires support for relatively and styling concerns. In multimedia documents, the tempo- large amounts of formatted text. The text block may con- ral nature of continuous media object (and the presentation sist of informative information that is conditionally pre- sequence of objects such as graphics elements and images) sented during a presentation, such as help text or content define the presentation structure. When text is present, it descriptions. normally plays a subordinate role. As we will see, aText was designed to provide substantial support for the first seven classes and reasonable support The subordinate role of text does not mean that it is unim- for the eight class of text.