Signal Processing: Image Communication 15 (2000) 281}298

MPEG-4 Systems: Overviewଝ Olivier Avaro! *, Alexandros Eleftheriadis", Carsten Herpel#, Ganesh Rajan$, Liam Ward%

!Deutsche Telekom } Berkom GmbH, Abt. T.22, D-64307 Darmstadt, Deutschland, Germany "Columbia University, Dept. of Electrical Engineering, 500 West 120th Street, Mail Code 4712, New York, NY 10027, USA #Deutsche Thomson-Brandt GmbH, Karl-Wichert-Allee 74, 30625 Hannover, Deutschland, Germany $General Instrument, 6450 Sequence Dr., San Diego, 92121 CA, USA %Teltec Ireland, DCU, Dublin 9, Ireland

Abstract

This paper gives an overview of Part 1 of ISO/IEC 14496 (MPEG-4 Systems). It "rst presents the objectives of the MPEG-4 activity. In the MPEG-1 and MPEG-2 standards, `Systemsa referred only to overall architecture, multiplexing, and synchronization. In MPEG-4, in addition to these issues, the Systems part encompasses scene description, interactivity, content description, and programmability. The description of the MPEG-4 speci"cation follows, starting from the general architecture up to the description of the individual MPEG-4 Systems tools. Finally, a conclusion describes the future extensions of the speci"cation, as well as a comparison between the solutions provided by MPEG-4 Systems and some alternative technologies. ( 2000 Published by Elsevier Science B.V. All rights reserved.

Keywords: ; Architecture; Audio-visual; Bu!er management; Composition; Content description; Interactivity; MPEG-4 systems; Multiplex; Programmability; Scene description; Speci"cation; Synchronization; Tools

1. Introduction these issues, the Systems part encompasses scene description, interactivity, content description, and The concept of `Systemsa in MPEG has evolved programmability. The combination of the exciting dramatically since the development of the MPEG-1 new ways of creating compelling interactive audio- and MPEG-2 standards. In the past, `Systemsa visual content o!ered by MPEG-4 Systems, and referred only to overall architecture, multiplexing, the e$cient representation tools provided by the and synchronization. In MPEG-4, in addition to Visual and Audio parts, promise to be the founda- tion of a new way of thinking about audio-visual information. * Corresponding author. ଝ This paper gives an overview of MPEG-4 This paper is a revised text from the original publication Systems. It is structured around the objectives, Advances in Multimedia: Systems, Standards, and Networks, Chapter 12 by Olivier Avaro, Editors: Atul Puri and Tsuhan architecture, and the tools of MPEG-4 Systems as Chen, originally published by Marcel Dekker Inc. follows: E-mail addresses: [email protected] (O. Avaro), eleft@ee. E columbia.edu (A. Eleftheriadis), [email protected] (C. Her- Objectives: This section describes the motiva- pel), ganesh}[email protected] (G. Rajan), liam.ward@tel- tions and the rationale behind the development tec.dcu.ie (L. Ward) of the MPEG-4 Systems speci"cations. As with

0923-5965/00/$ - see front matter ( 2000 Published by Elsevier Science B.V. All rights reserved. PII: S 0 9 2 3 - 5 9 6 5 ( 9 9 ) 0 0 0 5 0 - 8 282 O. Avaro et al. / Signal Processing: Image Communication 15 (2000) 281}298

all MPEG activities, MPEG-4 Systems is guided MPEG-4 Systems requirements may be categor- by a set of requirements [8], i.e., the set of objec- ized into two groups: tives that must be satis"ed by the speci"cations E Traditional MPEG Systems requirements: The core resulting from the work or activities of the sub- requirements for the development of the systems group. This paper give a particular attention to speci"cations in MPEG-1 and MPEG-2 were to the way the requirements of MPEG-4 Systems enable the transport of coded audio, video and are derived from the principal concept behind user-de"ned private data, and to incorporate MPEG-4, viz., the coding of audio-visual objects. timing mechanisms to facilitate synchronous de- E Architecture: This section describes the overall coding and presentation of these data at the structure of MPEG-4, known as the `MPEG-4 client side. These requirements also constitute Systems Architecturea. A complete walkthrough a part of the fundamental requirements set for of an MPEG-4 session highlights the di!erent MPEG-4 Systems. The evolution of the tradi- phases that a user will, in general, follow in con- tional MPEG Systems activities to match the suming MPEG-4 content. objectives for MPEG-4 Systems is detailed in E Tools: MPEG-4 is a `toolboxa , provid- Section 2.2. ing a number of tools, sets of which are parti- E Specixc MPEG-4 Systems requirements: The re- cularly suited to certain applications. This quirements in this set, most notably, the notions section provides a functional description of the of audio-visual objects and scene description, MPEG-4 Systems tools. These tools are further represent the ideas central to MPEG-4 and are described in the sections that follow, and are fully completely new in MPEG Systems. The core speci"ed in [3,9]. competencies needed to ful"l these requirements were not present at the beginning of the activity Of course, MPEG-4 is not the only initiative that but were acquired during the standards develop- attempts to provide solutions in the area described ment process. Section 2.3 describes these speci"c above. Several companies, industry consortia, and MPEG-4 Systems requirements. even other standardization bodies have developed technologies that, to some extent, also to address To round out this discussion on the MPEG-4 objectives similar to those of MPEG-4 Systems. objectives, Section 2.4 "nally provides an answer to In concluding this look at MPEG-4 Systems, this the question `What is MPEG-4 Systems?a by paper provides an overview of some of these alter- summarizing the objectives of the MPEG-4 native technologies and makes a comparison with Systems activity and describing the charter of the the solutions provided by MPEG-4 Systems. MPEG-4 Systems sub-group during its four years of existence.

2. Objectives 2.2. Traditional MPEG Systems requirements

2.1. Requirements The work of MPEG traditionally addressed the representation of audio-visual information. In the To understand the rationale behind the activity, past, this included only natural audio and video a good starting point is one of the most funda- material. As we will indicate in subsequent sec- mental MPEG-4 , viz., the MPEG-4 re- tions, the types of media included within the scope quirements [8]. This gives an extensive of the MPEG-4 standards have been signi"cantly list of the objectives that needed to be satis"ed by extended. Regardless of the type of the media, each the MPEG-4 speci"cations. The goal of specifying a standard way for the description and coding of  `Naturala, in this , is generally understood to mean audio-visual objects was the primary motivation representations of the real-world that are captured using behind the development of the tools in the MPEG- cameras, microphones and so on, as opposed to synthetically 4 Systems. generated material. O. Avaro et al. / Signal Processing: Image Communication 15 (2000) 281}298 283 one has spatial and/or temporal attributes and decoder architecture and de"nes the behavior of needs to be identi"ed and accessed by the applica- its architectural elements. The STD provides for tion consuming the content. This results in a set of a precise de"nition of time and recovery of tim- requirements for MPEG Systems on streaming, ing information from information encoded with- synchronization and management, further in the streams themselves, as well as mechanisms described below. to synchronize streams with each other. It also allows for the management of the decoder's E Streaming: The audio-visual information is to be bu!ers. delivered in a streaming manner, suitable for live 2. Packetization of streams: This set of tools de"nes broadcast of such content. In other words, the the organization of the various audio-visual audio-visual data are to be transmitted piece by data into streams. First, de"nition of the struc- piece, in order to match the delivery of the con- ture of individual streams containing data of tent to clients with limited network and terminal a single type is provided (e.g., a video stream), capabilities. This is in stark contrast to some of followed by the multiplexing of di!erent indi- the existing scenarios, the , for vidual streams for transport over a network or example, wherein the audio-visual information is storage in disk "les. At each level, additional completely downloaded on to the client terminal information is included to allow for the complete and then played back. It was thought that such management of the streams (synchronization, scenarios would necessitate too much storage on intellectual property rights, etc.). the client terminals for applications envisaged by MPEG. All these requirements are still relevant for E Synchronization: Typically, the di!erent compo- MPEG-4. However, the existing tools needed to be nents of an audio-visual presentation are closely extended and adapted for the MPEG-4 context. In related in time. For most applications, audio some cases, these requirements led to the creation samples with associated video frames have to be of new tools. More speci"cally: presented together to the user at precise instants 1. Systems decoder model (SDM): The nature of in time. The MPEG representation needs to MPEG-4 streams can be di!erent from the ones allow a precise de"nition of the notion of time so dealt with in the traditional MPEG-1 and that data received in a streaming manner can be MPEG-2 decoder models. For example, processed and presented at the right instants in MPEG-4 streams may have a bursty data deliv- time, and be temporally synchronized with each ery schedule. They may be downloaded and other. cached before their actual presentation to the E Stream Management: Finally, the complete man- user. Moreover, to implement the MPEG-4 agement of streams of audio-visual information principle of `create once, access everywherea, the implies the need for certain mechanisms to allow transport of content does not need to be (indeed, an application to consume the content. These should not be) integrated into the overall archi- include mechanisms for unambiguous location of tecture. These new aspects, therefore, led to the content, identi"cation of the content type, a modi"cation of the MPEG-1 and MPEG-2 description of the dependencies between content models that resulted in the MPEG-4 Systems elements, access to the intellectual property in- Decoder Model. formation associated to the data, etc. 2. Synchronization: The MPEG-4 principle of ` a In the previous MPEG-1 and MPEG-2 stan- create once, access everywhere is easier to dards, these requirements led to the de"nition of achieve when all of the content-related informa- the following tools: tion forms part of the encoded representation of the multimedia content. This content-related 1. Systems target decoder (STD): The Systems Tar- information includes synchronization informa- get Decoder is an abstract model of an MPEG tion also. The observation that the range of bit decoding terminal that describes the idealized rates addressed by MPEG-4 is broader than in 284 O. Avaro et al. / Signal Processing: Image Communication 15 (2000) 281}298

MPEG-1 and MPEG-2, and can be from a few process, content can be ignored or adapted to kbit/s up to several Mbit/s led to the de"nition match bandwidth, complexity, or price require- of a #exible tool to encode the synchronization ments. information, the Sync Layer (Synchronization In order to be able to use these audio-visual Layer). objects in a presentation, additional information 3. Packetization of streams: On the delivery side, needs to be transmitted to the client terminals. The most of the existing networks provide ways individual audio-visual objects are only a part of for packetization and transport of streams. the presentation structure that an author wants Therefore, beyond de"ning the modes for the delivered to the consumers. Indeed, for the pre- transport of MPEG-4 content on the existing sentation at the client terminals, the coding of infrastructures, MPEG-4 Systems did not see audio-visual objects needs to be augmented by the a need to develop any new tools for this purpose. following: However, due to the possibly unpredictable tem- poral behavior of MPEG-4 data streams as well 1. The coding of information that describes the as the possibly large number of such streams spatio-temporal relationships between the vari- in MPEG-4 applications, MPEG-4 Systems ous audio-visual objects present in the presenta- developed an e$cient and simple multiplexing tion content. In MPEG-4 terminology, this tool to enhance the transport of MPEG-4 data: information is referred to as the Scene descrip- the FlexMux (Flexible Multiplex) tool. tion information. 2. The coding of information that describes how 2.3. MPEG-4 Specixc Systems requirements time-dependent objects in the scene description are linked to the streamed resources actually The foundation of MPEG-4 is the coding of transporting the time-dependent information. audio-visual objects. As per MPEG-4 terminology, These considerations imply additional require- an audio-visual object is the representation of ments for the overall architectural design, which a natural or synthetic object that has an audio are summarized below: and/or visual manifestation. Examples of audio- visual objects include a video sequence (perhaps E Object description: In addition to the identi"ca- with shape information), an audio track, an - tion of the location of the streams, other informa- mated 3D face, speech synthesized from text, or tion may need to be attached to streamed a background consisting of a still image. resources. This may include the identi"cation of The advantages of coding audio-visual objects streams in alternative formats, a scalable stream can be summarized as follows: hierarchy that may be attached to the object or description of the coding format of the object. E It allows interaction with the content. At the E Content authoring: The object description in- client side, users can be given the possibility to formation is conceptually di!erent from the access, manipulate, or activate speci"c parts of scene description information. It will therefore the content. have di!erent life cycles. For example, at some E It improves reusability and coding of the content. instant in time, object description information At the content creation side, authors can easily may change (like the intellectual property rights organize and manipulate individual components of a stream or the availability of new streams) and reuse existing material. Moreover, each type while the scene description information remains of content can be coded using the most e!ective the same. Similarly, the structure of the scene . Artifacts due to joint coding of het- may change (like changing the positions of the erogeneous objects (e.g., overlaid on objects), while the streaming resources remain natural video) disappear. the same. E It allows content-based scalability. At various E Content consumption: The consumer of the con- stages in the authoring/delivery/consumption tent may wish to obtain information about the O. Avaro et al. / Signal Processing: Image Communication 15 (2000) 281}298 285

content (e.g., the intellectual property attached to 2. Audio-visual objects behavior: It should be pos- it or the maximum bit-rate needed to access it) sible to attach behavior to audio-visual objects. before actually requesting it. He then only needs User actions or other events, like time, trigger to receive information about object description, these behaviors. not about the scene description. 3. Client}Server interaction: Finally, in case a re- turn channel from the client to the server is Besides the coding of audio-visual objects organ- available, the user should be able to send back ized spatio-temporally, according to a scene de- information to the server that will act upon it scription, one of the key concepts of MPEG-4 is the and eventually send updates or modi"cation of idea of interactivity, that is, that the content reacts the content. upon the action of a user. This general idea is expressed in three speci"c requirements: 2.4. What is MPEG-4 Systems? 1. Client side interaction: The user should be able to manipulate the scene description as well as the The main concepts that were described in this properties of the audio-visual objects that the section are depicted in Fig. 1. The mission, author wants to expose to interaction. therefore, of the MPEG-4 Systems activity may be

Fig. 1. MPEG-4 systems principles. 286 O. Avaro et al. / Signal Processing: Image Communication 15 (2000) 281}298 summarized by the following sentence: **Develop (MP4) "les or the DAB (digital audio broadcasting) a coded, streamable representation for audio-visual multiplexer. objects and their associated time-variant data along Most of the currently available transport layer with a description of how they are combined++. systems provide native means for multiplexing in- More precisely, in this sentence: formation. There are, however, a few instances where this is not the case, like in GSM (Global ` a E Coded representation should be seen in contrast Systems for Mobile communication) data channels. ` a to textual representation . Indeed, all the in- In addition, the existing multiplexing mechanisms formation that MPEG-4 Systems contains (scene may not "t MPEG-4 needs in terms of low delay, or description, object description, synchronization they may incur substantial overhead in handling information) is binary encoded for bandwidth the expected large number of streams associated e$ciency. with an MPEG-4 session. As a result, the FlexMux ` a E Streamable should not be seen in contrast to tool can optionally be used on top of the existing ` a stored , since storage and transport are dealt transport delivery layer. with in a similar and consistent way in the Regardless of the transport layer used and the MPEG-4 framework. It should rather be seen in use (or not) of the FlexMux option, the delivery ` a contrast to downloaded . Indeed, MPEG-4 is layer provides to the MPEG-4 terminal a number built on the concept of streams that have a tem- of elementary streams. that not all of the poral extension, and not on the concept of "les of streams have to be downstream (server to the "nite size. client); in other words, it is possible to de"ne ` E Elementary audio-visual sources along with a elementary streams for the purpose of conveying a description of how they are combined should be data back from the terminal to the transmitter or ` seen in contrast to individual audio or server. a visual streams . MPEG-4 System does not In order to isolate the design of MPEG-4 from with the encoding of audio or visual in- the speci"cs of the various delivery systems, the formation but only with the information related concept of the DMIF (delivery multimedia integra- to the combinations of streams: combination of tion framework) application interface (DAI) [7] audio-visual objects to create an interactive was de"ned. This interface de"nes the process of audio-visual scene, synchronization of streams, exchanging information between the terminal and multiplexing of streams for storage or transport. the delivery layer in a conceptual way, using a num- ` a The term description here reinforces the fact ber of primitives. It should be pointed out that this that the combination involves explicit content interface is non-normative; MPEG-4 terminal im- authoring. plementations do not need to expose such interface. The DAI de"nes procedures for initializing an MPEG-4 session and obtaining access to the vari- 3. Architecture ous elementary streams that are contained in it. These streams can contain a number of di!erent The overall architecture of an MPEG-4 terminal informations: audio-visual object data, scene de- is depicted in Fig. 2. Starting at the bottom of the scription information, control information in the "gure, we "rst encounter the particular storage or of object descriptors, as well as meta-informa- transmission medium. This refers to the lower tion that describes the content or associates intel- layers of the delivery infrastructure (network layer lectual property rights to it. and below, as well as storage). The transport of the Regardless of the type of data conveyed in each MPEG-4 data can occur on a variety of delivery elementary stream, it is important that they use systems. This includes MPEG-2 Transport a common mechanism for conveying timing and Streams, UDP (User Datagram Protocol) over IP framing information. The Sync Layer (SL) is de- ( Protocol), ATM (asynchronous transfer "ned for this purpose. It is a #exible and con"gur- mode) AAL2 (ATM adaptation layer 2), MPEG-4 able packetization facility that allows the inclusion O. Avaro et al. / Signal Processing: Image Communication 15 (2000) 281}298 287

Fig. 2. MPEG-4 systems architecture. of timing, fragmentation, and continuity informa- The SL is the sole mechanism of implementing tion on associated data packets. Such information timing and synchronization mechanisms in is attached to data units that comprise complete MPEG-4. The fact that it is highly con"gurable presentation units, e.g., an entire video object plane allows the use of several di!erent models. At one (VOP) or an audio . These are called access end of the spectrum, traditional clock recovery units. An important feature of the SL is that it does methods using clock references and time stamps not contain frame demarcation information; in can be used. It is also possible to use a rate-based other words, the SL header contains no packet approach (rather than using explicit timestamps, length indication. This is because it is assumed that the known rate of the access units implicitly deter- the delivery layer that processes SL packets will mines their time stamps). At the other end of the already make such information available. Its exclu- spectrum, it is possible to operate without any sion from the SL thus eliminates duplication. clock information } data are processed as soon as 288 O. Avaro et al. / Signal Processing: Image Communication 15 (2000) 281}298 they arrive. This would be suitable, for example, for time, a feature that provides considerable power for a slide-show presentation. The primary mode of content creators. In fact, the scene description tools operation, and the one supported by the currently provided by MPEG-4 also provide a special light- de"ned conformance points of the speci"cation, weight mechanism to modify parts of the scene involves the full complement of clock recovery and description in order to e!ect animation. This is time stamps. By de"ning a system decoder model, accomplished by coding, in a separate stream, only this makes it possible to both synchronize the re- the parameters that need to be updated. ceiver's clock to the sender's, as well as manage the The system's compositor uses the scene descrip- bu!er resources at the receiver. tion information, together with decoded audio-vis- From the SL information we can recover a time ual object data, in order to render the "nal scene base as well elementary streams. The streams are that is presented to the user. It is important to note sent to their respective decoders that process the that the MPEG-4 Systems architecture does not data and produce composition units (e.g., a de- de"ne how information is to be rendered. In other coded video object plane). In order for the receiver words, the Systems part of the MPEG-4 standard to know what type of information is contained in does not detail mechanisms through which the each stream, control information in the form of values of the pixels to be displayed or audio sam- object descriptors is used. These descriptors associ- ples to be played back can be uniquely determined. ate sets of elementary streams to one audio or This is an unfortunate side-e!ect of providing syn- visual object, de"ne a scene description stream, or thetic content representation tools. Indeed, in the even point to an object descriptor stream. These general case, it is not possible to de"ne rendering descriptors, in other words, are the way with which without venturing into issues of terminal imple- a terminal can identify the content being delivered mentation. Although this makes compliance testing to it. Unless a stream is described in at least one much more di$cult (requiring subjective evalu- object descriptor, it is impossible for the terminal to ation), it allows the inclusion of a very rich set of make use of it. synthetic content representation tools. In some At least one of the streams must be the scene cases however, like in the Audio part of the description information associated with the con- MPEG-4 standard [6], composition can be and is tent. The scene description information de"nes the fully de"ned. spatial and temporal position of the various ob- The scene description tools provide mechanisms jects, their dynamic behavior, as well as any inter- to capture user or system events. In particular, they activity features made available to the user. As allow the association of events to user operations mentioned above, the audio-visual object data is on desired objects, that can } in turn } modify the actually carried in its own elementary streams. The behavior of the stream. Event processing is the core scene description contains pointers to object de- mechanism with which application functionality scriptors when it refers to a particular audio-visual and di!erentiation can be provided. In order to object. We should stress that it is possible that an provide #exibility in this respect, MPEG-4 allows object (in particular synthetic objects like text and the use of ECMAScript (also known as JavaScript) simple graphics) may be fully described by the scene scripts within the scene description. Use of scripting description. As a result, it may not be possible to tools is essential in order to access state informa- uniquely associate an audio-visual object with just tion and implement sophisticated interactive ap- one syntactic component of MPEG-4 Systems. As plications. detailed in Section 4.2, the scene description is It is important to point out that, in addition to -structured and is heavily based on VRML the new functionalities that MPEG-4 makes avail- (Virtual Reality Modeling Language [4]) structure. able to content consumers, it provides tremendous A key feature of the scene description is that, advantages to content creators as well. The use of since it is carried in its own elementary stream(s), it an object-based structure, where composition is can contain full timing information. This implies performed at the receiver, considerably simpli"es that the scene can be dynamically updated over the content creation process. Starting from a set of O. Avaro et al. / Signal Processing: Image Communication 15 (2000) 281}298 289 coded audio-visual objects, it is very easy to de"ne object descriptor, it is possible to update and re- a scene description that combines these objects in move each object descriptor in a dynamic and a meaningful presentation. A similar approach is timely manner. The existence or the absence of essentially used in HTML (Hyper Text Markup descriptors determines the availability (or the lack Language) and Web browsers, thus allowing even thereof) of the associated elementary streams to the non-expert users to easily create their own content. MPEG-4 terminal. The fact that the content's structure survives the The object descriptor, a derivative of the process of coding and distribution, also allows for object descriptor, is a key element necessary for its reuse. For example, content "ltering and/or accessing the MPEG-4 content. It conveys content searching applications can be easily implemented complexity information in addition to the regular using ancillary information carried in object de- elements of an object descriptor. As depicted in scriptors (or its own elementary streams, as de- Fig. 3, the initial object descriptor usually contains scribed in Section 4.1). Also, users themselves can at least two elementary stream descriptors. One of easily extract individual objects, assuming that the descriptor must point to a scene description the intellectual property information allows them stream while the others may point to an object to do so. descriptor stream. This object descriptor stream In the following section the di!erent components transports the object descriptors for the elementary of this architecture are described in more detail. streams that are referred to by some of the compo- nents in the scene description. Initial object descriptors may themselves be transported in 4. Tools object descriptor streams since they allow content to be hierarchically nested, but may as well be 4.1. Stream management: the object description conveyed by other means, serving as starting framework pointers to MPEG-4 content. In addition to providing essential information The object description framework provides the about the relation between the scene description glue between the scene description and the stream- and the elementary streams, the object descrip- ing resources } the elementary streams } of an tion framework provides mechanisms to describe MPEG-4 presentation, as indicated in Fig. 1. hierarchical relations between streams, re#ecting Unique identi"ers are used in the scene description scalable encoding of the content and means to in- to point to the object descriptor, the core element of dicate multiple alternate representations of content. the object description framework. The object de- Furthermore, textual descriptors about content scriptor is a container structure that encapsulates items, called object content information (OCI), all of the setup and association information for a set and descriptors for the intellectual property rights of elementary streams. A set of sub-descriptors, management and protection (IPMP) have been de- contained in the object descriptor, describe the in- "ned. The latter allow conditional access or other dividual elementary streams, including the con"g- content control mechanisms to be associated to uration information for the stream decoder as well a particular content item. These mechanisms may as the #exible sync layer syntax for this stream. be di!erent on a stream-by-stream basis and Each object descriptor, in turn, groups a set of possibly even a multiplicity of such mechanisms streams that are seen as a single entity from the could co-. perspective of the scene description. A single MPEG-4 presentation, or program, may Object descriptors are transported in dedicated consist of a large number of elementary streams elementary streams, called object descriptor with a multiplicity of data types. The object de- streams, that make it possible to associate timing scription framework has been separated from the information to a set of object descriptors. With the scene description to account for this fact and the appropriate wrapper structures, called OD com- related consequence that service providers may mands (object descriptor commands), around each possibly wish to relocate streams in a simple way. 290 O. Avaro et al. / Signal Processing: Image Communication 15 (2000) 281}298

Fig. 3. The initial object descriptor and the linking of elementary streams to the scene description.

Such relocation may require changes in the object objects in scenes, along with their attributes and descriptors; however, it will not a!ect the scene behaviors. Elements of the scene and the relation- description. Therefore, object descriptors improve ships between them form the scene graph that must content manageability. be coded for transmission. The fundamental scene The reader is referred to the MPEG-4 Systems graph elements are the `nodesa that describe speci"cation [3] for the syntax and of the audio-visual primitives and their attributes, along various components of this framework and their with the structure of the scene graph itself. BIFS usage within the context of MPEG-4. draws heavily on this and other concepts employed by VRML [4]. 4.2. Presentation engine: BIFS

MPEG-4 speci"es a binary format for scenes  VRML was proposed and developed by the VRML Consor- (BIFS) that is used to describe scene composition tium and their VRML 2.0 speci"cation became an ISO/IEC information: the spatial and temporal locations of International Standard in 1998. O. Avaro et al. / Signal Processing: Image Communication 15 (2000) 281}298 291

Designed as a "le format for describing 3D ways that allow a much broader range of applica- models and scenes (`worldsa in VRML termino- tions to be supported. Note that a fundamental logy), VRML lacks some important features that di!erence between the two is that BIFS is a binary are required for the types of multimedia applica- format, whereas VRML is a textual format. So, tions targeted by MPEG-4. In particular, the although it is possible to design scenes that are support for natural video and audio are basic (ex: compatible with both BIFS and VRML, transcod- streaming of audio or video objects are not sup- ing of the representation formats are required. ported) and the timing model is loosely speci"ed, Here, we highlight the functionalities that BIFS implying that synchronization in a scene consisting adds to the basic VRML set. Readers unfamiliar of multiple media types cannot be guaranteed. with VRML might "nd it useful to "rst acquire Furthermore, VRML worlds are often very large some background knowledge from [4]. (ex: there is neither compression nor animations streaming). Animations lasting around 30 s typi- 1. Compressed binary format: BIFS describes an cally consume several megabytes of disk space. The e$cient binary representation of the scene graph strength of VRML is its scene graph description information. The coding may be either lossless capabilities and this strength has been the basis or lossy. The coding e$ciency derives from upon which MPEG-4 scene description has been a number of classical compression techniques, built. plus some novel ones. The knowledge of context BIFS includes support for almost all of the nodes is exploited heavily in BIFS. This technique is in the VRML speci"cations. In fact, BIFS is essen- based on the fact that given some scene graph tially a superset of VRML, although there are some data have been previously received, it is possible exceptions. BIFS does not yet support the PROTO to anticipate the type and format of data to be and EXTERNPROTO nodes, nor does it support received subsequently. This technique is descri- the use of language in the Script nodes (BIFS bed in Fig. 4. Quantization of numerical values is only supports ECMAScript). BIFS does, however, supported, as well as the compression of 2D expand signi"cantly on VRML's capabilities in meshes as speci"ed in MPEG-4 Visual [5].

Fig. 4. Simple example of the use of context in e$cient coding. 292 O. Avaro et al. / Signal Processing: Image Communication 15 (2000) 281}298

2. Streaming: BIFS is designed so that the scene processed and transformed with special pro- may be transmitted as an initial scene followed cedural code to produce various sounds e!ects by timestamped modi"cations to the scene. For [6]. dynamic scenes that change over time, this leads 6. Facial animation: BIFS provides support at the to a huge improvement in memory usage and scene level for the MPEG-4 facial animation reduced latency when compared to equivalent decoder [5]. A special set of BIFS nodes expose VRML scenes. The BIFS Command protocol the properties of the animated face at the allows replacement of the entire scenes, addi- scene level, making it a full participant of the tion/deletion/replacement of nodes and behav- scene that can be integrated with all BIFS func- ioral elements in the scene graph as well as tionalities, similarly to any other audio or visual modi"cation of scene properties. objects. 3. Animation: A second streaming protocol, BIFS Anim, is designed to provide a low-overhead 4.3. Timing and synchronization: the systems decoder mechanism for the continuous animation of model (SDM) and the sync layer changes to numerical values of the components in the scene. These streamed animations provide The MPEG-4 SDM is conceived as an adapta- an alternative to the interpolator nodes sup- tion of its MPEG-2 predecessor. The System ported in both BIFS and VRML. The main Target Decoder in MPEG-2 is a model that di!erence is that interpolator nodes typically precisely describes the temporal and bu!er con- contain very large amounts of data that must be straints under which a set of elementary streams loaded in the scene and stored in memory. By may be packetized and multiplexed. Due to the streaming these animations, the amount of data generic approach taken towards stream delivery that must be held in memory is reduced signi"- } which includes stream multiplexing } MPEG-4 cantly. Secondly, by removing these data from chose not to de"ne multiplexing constraints in the the scene graph that must be initially loaded, the SDM. Instead, the SDM assumes the concurrent amount of data that must be processed (and delivery of an arbitrary number of } already demul- therefore the time taken) in order to begin pre- tiplexed } elementary streams to the decoding buf- senting the scene is also reduced. fers of their respective decoders. A constant end- 4. 2D Primitives: BIFS includes native support for to-end delay is assumed between the encoder out- 2D scenes. This facilitates content creators who put and the input to the decoding bu!er on the wish to produce low-complexity scenes, includ- receiver side. This leaves the task of handing the ing the traditional television and multimedia delivery jitter (including multiplexing) to the deliv- industries. Many applications cannot bear the ery layer. cost of requiring decoders to have full 3D ren- Timing of streams is expressed in terms of decod- dering and navigation. This is particularly true ing and composition time of individual access units where hardware decoders must be of low cost, as within the stream. Access units are the smallest for instance television set-top boxes. However, sets of data to which individual presentation time rather than simply partitioning the multimedia stamps can be assigned (e.g., a video object plane). world into 2D and 3D, MPEG-4 BIFS allows The decoding time stamp indicates the point in the combination of 2D and 3D elements in time at which an access unit is removed from the a single scene. decoding bu!er, instantaneously decoded, and 5. Enhanced audio: VRML provides simple audio moved to the composition memory. The composi- support. This support has been enhanced in tion time stamp allows the separation of decoding MPEG-4. BIFS provides the notion of an and composition times, to be used for example `audio scene grapha where the audio sources, in the case of bi-directional prediction in including streaming ones, can be mixed. It also visual streams. This idealized model allows the provides nodes interface to the various MPEG-4 encoding side to monitor the space available in the audio objects [6]. Audio content can even be decoding side'sbu!ers, thus helping it, for example, O. Avaro et al. / Signal Processing: Image Communication 15 (2000) 281}298 293 to schedule ahead-of-time delivery of data. Of part of the elementary stream descriptor that sum- course, a resource management for the memory for marizes the properties of each (SL-packetized) decoded data would be desirable as well. However, elementary stream. it was acknowledged that this issue is strongly linked with memory use for the composition pro- 4.4. The transport of MPEG-4 content cess itself, which is considered outside the scope of MPEG-4. Therefore, management of composition Delivery of MPEG-4 content is a task that is bu!ers is not part of the model. supposed to be dealt with outside the MPEG-4 Time stamps are readings of an object time base systems speci"cation. All access to delivery layer (OTB) that is valid for an individual stream or a set functionality is conceptually done only through of elementary streams. At least all the streams be- a semantic interface, called the DMIF application longing to one audio-visual object have to follow interface (DAI). It is speci"ed in Part 6 of MPEG-4, the same OTB. Since the OTB, in general, is not Delivery multimedia integration framework a universal clock, object clock reference time (DMIF) [7]. In practical terms this means that the stamps can be conveyed periodically with an ele- speci"cation of control and data mapping to under- mentary stream to make it known to the receiver. lying transport protocols or storage architectures is This is, in fact, done on the wrapper layer around to be done jointly with the respective organization elementary streams, called the sync layer (SL). The that manages the speci"cation of the particular sync layer provides the syntactic elements to en- delivery layer. For example, for the case of code the partitioning of elementary streams into MPEG-4 transport over IP, development work is access units and to attach both decoding and com- done jointly with the Internet Engineering Task position time stamps as well as object clock refer- Force (IETF). ences to a stream. The resulting stream is called an An analysis of existing delivery layer properties SL-packetized stream. This syntax provides a uni- showed that there might be a need for an additional form shell around elementary streams, providing layer of multiplexing, in order to map the occa- the information that needs to be shared between sionally bursty and low bit-rate MPEG-4 streams the compression layer and the delivery layer in to a delivery layer protocol that exhibits "xed order to guarantee timely delivery of each access packet size or too much packet overhead. Further- unit of an elementary stream. more, the provision of a large number of delivery Di!erent from its predecessor in MPEG-2, the channels may have a substantial burden in terms packetized elementary stream (PES), the sync layer of management and cost. Therefore, a very simple does not constitute a self-contained stream, but multiplex packet syntax has been de"ned, called the rather a packet-based interface to the delivery FlexMux. It allows multiplexing a number of SL- layer. This takes into account the properties of packetized streams into a self-contained FlexMux typical delivery layers like IP, H.223 or MPEG-2 stream with rather low overhead. It is proposed as itself, into which SL-packetized streams are sup- an option to designers of the delivery layer map- posed to be mapped. There is no need to encode pings but is not used for the de"nition of MPEG-4 either unique start codes or the length of an SL conformance points. packet within the packet, since synchronization and length encoding are already provided by the mentioned delivery layer protocols. 5. Conclusion Furthermore, MPEG-4 has to operate both at very low and rather high bit-rates. This has led to 5.1. Extensions of the specixcation a #exible design of the sync layer elements, making it possible to encode time stamps of con"gurable The technologies considered for standardization size and resolution, as required in a speci"c content in MPEG-4 were not all identically mature. There- or application scenario. The #exibility is made pos- fore, the MPEG-4 project in general and MPEG-4 sible by means of a descriptor that is conveyed as Systems in particular, was organized in two phases: 294 O. Avaro et al. / Signal Processing: Image Communication 15 (2000) 281}298

Version 1 and Version 2. The tools described above tion to this parametric capability. Version 2 de"nes already contain the majority of the functionality of a set of Java2+ language APIs (MPEG-J) through MPEG-4 Systems and allow the development of which access to an underlying MPEG-4 engine can compelling multimedia applications. These are pro- be provided to Java (called MPEG-lets). vided by the current MPEG-4 standard, the This tool forms the basis for very sophisticated so-called MPEG-4 `Version 1a. Extension of the applications, opening up completely new ways for standard in the form of amendments, the so-called audio-visual content creators to augment the use of `Version 2a, completes the Version 1 toolbox with their content. new tools and new functionalities. Version 2 tools Finally, MPEG-4 Systems completes the toolbox are not intended to replace any of the Version 1 for transport and storage of MPEG-4 content by tools. On the contrary, Version 2 is a completely providing: backward compatible extension of Version 1. Version 2 will provide for the following addi- E A "le format [9] for the exchange of MPEG-4 tional BIFS functionalities: content. This "le format provides meta-informa- tion about the content, in order to allow index- E Support of all the VRML constructs that were ing, fast search and random access. Furthermore not supported in Version 1. The major ones it is possible to easily re-purpose the "le content are PROTO and EXTERNPROTO, allowing for delivery over various delivery layers. a more e$cient compression of the scene descrip- E The speci"cation of MPEG-4 delivery on IP, tion, a more #exible content authoring as well as jointly developed with IETF. This Internet-Draft a robust mechanism to de"ne BIFS extensions; will specify the encapsulation of MPEG-4 data in E Speci"cation of BIFS interfaces for new Version RTP, the Real-time transport protocol, which is 2 audio and visual media such as synthetic body the focal point for delivery of streaming multi- description and animation or 3D model coding; media content on the Internet. E Speci"cation of advanced audio BIFS for more E The speci"cation of MPEG-4 delivery on natural sound source and sound environment MPEG-2 Systems [2]. This amendment for modeling. These new audio functionalities in- MPEG-2 Systems will de"ne a set of descriptors clude modeling of air absorption, of the audio that provide for the signaling of the presence response of the visual scene, and more natural in the MPEG-2 multiplex of MPEG-4 content, modeling of distance dependent attenuation and both in the form of individual MPEG-4 elemen- sound source directivity. tary streams as well as complete MPEG-4 pre- In Version 1 of MPEG-4, there is no normative sentations. MPEG-2 content and MPEG-4 support for the structure of upstream data or its content may not only co-exist within an MPEG- semantics. Version 2 standardizes both the mecha- 2 multiplex, but it is also possible to reference nisms with which the transmission of such data is the MPEG-2 content in the MPEG-4 scene triggered at the terminal, as well as its formats as it description. is transmitted back to the sender. The inclusion of a normative speci"cation for backchannel informa- 5.2. MPEG-4 Systems and competing technologies tion in Version 2 closes the loop between the ter- minal and its server, vastly expanding the types This section aims to complete the description of of applications that can be implemented on an MPEG-4 Systems by trying to make a fair com- MPEG-4 infrastructure. parison between the tools provided by MPEG-4 The BIFS Scene description framework, de- Systems and the ones that can be found } or will be scribed in the previous section, o!ers a parametric found in a near future } in applications in the methodology for scene structure representation in market place. addition to e$ciently coding it for transmission Technical issues aside, the mere fact of being over the wire. Version 2 of the MPEG-4 standard proprietary is a signi"cant disadvantage in the also o!ers a programmatic environment, in addi- content industry when alternatives O. Avaro et al. / Signal Processing: Image Communication 15 (2000) 281}298 295 exist. With the separation of content production, within the IETF with regards to delivery of multi- delivery, and consumption stages in the multimedia media content over IP networks. With the in- pipeline, the MPEG-4 standard will enable di!er- creasing number of streams resident in a single ent companies to separately develop authoring multimedia program, with possibly low network tools, servers, or players, thus opening up the mar- bandwidths and unpredictable temporal network ket to independent product o!erings. This competi- behavior, the overhead incurred by the use of RTP tion is then very likely to allow a fast proliferation streams and their management in the receiving of content and tools that will inter-operate. terminals is becoming considerable. IETF is there- fore currently investigating a generic multiplexing 5.2.1. Transport solution, that has requirements similar to that of As stated in Section 4.4, MPEG-4 Systems does the MPEG-4 FlexMux. We expect that, with the not specify or standardize a transport protocol. close collaboration between IETF and MPEG-4, In fact, it is designed to be transport-agnostic. a consistent solution will be developed for the However, in order to be able to utilize the existing transport of MPEG-4 content over IP networks. transport infrastructures (e.g. MPEG-2 transport or IP networks), MPEG-4 de"nes an abstraction 5.2.2. Streaming framework of the delivery layer with speci"c mappings of With the speci"cations of the object description MPEG-4 content on existing transport mecha- framework and the sync layer, MPEG-4 Systems nisms [7]. However, there are two exceptions in provides a consistent and e$cient framework for terms of the abstraction of the delivery layer: the the description of content and the means for its MPEG-4 and the FlexMux tool. synchronized presentation at client terminals. At There are several available "le formats for stor- this juncture, this framework, with its #exibility and ing, streaming, and authoring multimedia content. dynamics, does not have any equivalents in the Among the ones most used presently are Micro- standards . A parallel could be drawn with the soft's ASF (advanced streaming format), Apple's combination of RTP and SDP (session description QuickTime, as well as RealNetworks'"le format protocol); however, such a solution is Internet- (RMFF). The ASF and QuickTime formats were speci"c and cannot be applied directly on other proposed to MPEG-4 in response to a call for systems like digital cable or DVDs (digital video proposals on "le format technology. QuickTime disc). has been selected as the starting point for the collaborative development of the MPEG-4 "le 5.2.3. Scene description representation format (referred to as &MP4') [9]. The RMFF Within the context of Web applications, a format has several similarities with QuickTime (in number of competitors to MPEG-4 BIFS base terms of object tagging using four-character strings their syntax architecture on the XML (Extensible and the way indexing information is provided). ) [1] syntax, while MPEG-4 MP4 inherits from QuickTime key technical bases its syntax architecture on VRML using a features such as the ability to stream content from binary, SDL-described form (MPEG-4 Syntactic multiple sources (local or through a network) Description Language). The main di!erence with interactivity and known rendering quality. In addition to these key features, the MP4 format adds support for MPEG-4 speci"c features. In  SDL originates from the need to formalize bit-stream repres- particular, MP4 inherits from MPEG-4 all the entation. SDL is an intuitive, natural and well-de"ned extension new and compelling audio, video and systems of the typing system of C## and Java. It has been explicitly multimedia content. designed to follow (as much as possible) a declarative approach The MPEG-4 proposal for light-weight multi- to bit-stream syntax speci"cation. SDL-based speci"cations de- how the data is laid out on the bit-stream. SDL makes the plexing (FlexMux) addresses some MPEG-4 task of understanding the speci"cation and transforming it into speci"c needs as described in Section 4.4. The same running code straightforward, like it is done in other domains kinds of requirements have recently been raised (e.g., Sun ONC RPC or OMG CORBA). 296 O. Avaro et al. / Signal Processing: Image Communication 15 (2000) 281}298 between the two is that XML is a general purpose representation of multimedia information is with- text-based description language for tagged data, out a doubt the best approach from the bandwidth while VRML with SDL provide a binary format for e$ciency point of view. This is the problem prim- a scene description language. arily addressed and solved by MPEG-4. None of The competition is "rst at the level of the seman- the XML-based approaches have addressed satis- tics, i.e., the functionality provided by the repres- factorily this issue up to now. entation. At the time the MPEG-4 standard was Indeed, in addition to XML semantics that lever- published, several speci"cations were providing se- age the functionality developed by MPEG-4, mantics with an XML-compliant syntax to solve a complete competitive solution would also need to multimedia representation in speci"c domains. For de"ne a binary mapping as well as media streaming example, the W3C (World Wide Web Consortium) and synchronization mechanisms. In June 1999, HTML-NG (HTML next generation) was re- there was no evidence how this could happen on designing HTML to be XML compliant [12]. The a short or medium term schedule. Indeed, all the W3C SMIL (Synchronized Multimedia Integration potential alternative frameworks are at the stage Language) working group has produced a speci- of research and speci"cation development, while "cation for 2D-multimedia scene description [11]. MPEG-4 is at the stage of standard veri"cation The ATSC/DASE (advanced television systems and deployment. There is no evidence that any of committee/digital-TV application envi- these frameworks will be able to leverage e$ciently ronments) BHTML (broadcast HTML) speci"ca- all of the advantages of MPEG-4 speci"cs, includ- tions were working at providing broadcast ing compression, streaming and synchronization. extensions to HTML-NG. The Consor- Finally, the fragmented nature of the development tium (extensible 3D) requirements were inves- of XML-based speci"cations by di!erent bodies tigating the use of XML for 3D scene description, and industries certainly hinders integrated solu- whereas W3C SVG (scalable ) was tions. This may therefore cause a distorted vision of standardizing also in an the integrated, targeted system as well as duplica- XML compliant way [10]. tion of functionality. MPEG-4 is built on a true 3D scene description, including the event model, as provided by VRML. 5.3. The key features of MPEG-4 systems None of the XML-based speci"cations currently available reaches the sophistication of MPEG-4 in In summary, the key features of MPEG-4 terms of composition capabilities and interactivity Systems can be stated as follows: features. Furthermore, incorporation of the tem- poral component in terms of streamed media, in- 1. MPEG-4 System "rst provides a consistent and cluding scene descriptions, is a non-trivial matter. complete architecture for the coded representa- MPEG-4 has successfully addressed this issue, as tion of the desired combination of streamed well as the overall timing and synchronization elementary audio-visual information. This issues, whereas alternative approaches are lacking framework covers a broad range of applications, in this respect. functionality and bit-rates. However, not all of A second level of competition can be seen in the the features need to be implemented in a single coded representation of the scene structure (text- device. Through pro"le and level de"nitions, based versus binary representation). An advantage MPEG-4 System establishes a framework that of a text-based approach is ease of authoring; docu- allows consistent progression from simple ap- ments can be easily generated using a . plications (e.g., an audio broadcast application Such textual representations for MPEG-4 content, with graphics) to more complex ones (e.g., a based on extensions of VRML or on XML, were virtual reality home theater). under consideration at the time this paper was 2. MPEG-4 System augments this architecture published. However, for delivery and streaming with a set of tools for the representation of the over a "nite bandwidth medium, a compressed multimedia content, viz., a framework for object O. Avaro et al. / Signal Processing: Image Communication 15 (2000) 281}298 297

description (the OD framework), a binary lan- Finally, the authors would like to thank Ananda guage for the representation of multimedia inter- Allys (France Telecom } CNET), for providing active 2D and 3D scene description (BIFS), some of the pictures used in this paper; the a framework for monitoring and synchronizing European project MoMuSys for its support of elementary data stream (the SDM and the Olivier Avaro and Liam Ward; the National SyncLayer), and programmable extensions to ac- Science Foundation and the industrial sponsors of cess and monitor MPEG-4 content (MPEG-J). Columbia's ADVENT Project for their support of 3. Finally, MPEG-4 Systems completes this set of Alexandros Eleftheriadis; the German project tools by de"ning an e$cient mapping of the MINT for its support of Carsten Herpel; the MPEG-4 content on existing delivery infrastruc- General Instrument Corporation for their support tures. This mapping is supported by the follow- to Ganesh Rajan. ing additional tools: an e$cient and simple multiplexing tool to optimize the carriage of MPEG-4 data (FlexMux), extensions allowing References the carriage of MPEG-4 content on MPEG-2 and IP systems, and a #exible "le format for [1] T. Bray, J. Paoli, C.M. Sperberg-McQueen (Eds.), Exten- authoring, streaming and exchanging MPEG-4 sible markup language (XML) 1.0, 10 February 1998. data. http://www.w3.org/TR/REC-xml. [2] ISO/IEC 13818-1, Generic Coding of Moving Pictures The MPEG-4 System tools can be used separate- and Associated Audio Information } Part 1: Systems. ly in some applications. But MPEG-4 Systems also [3] ISO/IEC 14496-1, Coding of audio-visual objects: systems, guarantees that they will work together in an integ- "nal draft international standard, ISO/IEC JTC1/SC29/WG11 N2501, October 1998. rated way, as well as with the other tools speci"ed [4] ISO/IEC 14772-1, The Virtual Reality Modeling Lan- within the MPEG-4 standards. guage, 1997, http://www.vrml.org/Specifications/ VRML97. [5] ISO/IEC 14496-2, Coding of audio-visual objects: visual, Acknowledgements "nal draft international standard, ISO/IEC JTC1/SC29/WG11 N2502, October 1998. [6] ISO/IEC 14496-3, Coding of audio-visual objects: audio, The MPEG-4 System speci"cation re#ects the "nal draft international standard, ISO/IEC JTC1/SC29/ results of teamwork within a worldwide project in WG11 N2503, October 1998. which many people invested enormous time and [7] ISO/IEC 14496-6, Coding of audio-visual objects: delivery energy. The authors would like to thank them all, multimedia integration framework, "nal draft interna- tional standard, ISO/IEC JTC1/SC29/WG11 N2506, Oc- and hope that the experience and results achieved tober 1998. at least matched the level of their expectations. [8] ISO/IEC JTC1/SC29/WG11 N2723, MPEG-4 require- Among the numerous contributors to the ments document, March 1999. MPEG-4 Systems projects, three individuals stand [9] ISO/IEC JTC1/SC29/WG11 N22739, Text of ISO/IEC out for their critical contributions in the develop- 14496-1/PDAM1, March 1999. [10] Scalable vector graphics (SVG) Speci"cation. W3C work- ment of MPEG-4: Cli! Reader is the person that ing draft 11-February-1999. http://www.w3.org/TR/ originally articulated the vision of MPEG-4 in an WD-SVG/. eloquent way and led it through its early formative [11] Synchronized multimedia integration language. (SMIL) stages; Phil Chou was the "rst to propose a consis- 1.0 Speci"cation W3C Recommendation 15-June-1998. tent and complete architecture that could support http://www.w3.org/TR/REC-smil/. [12] XHTML2+ 1.0: The extensible markup such a vision; and Zvi Lifshitz who, by leading the language. A reformulation of HTML 4.0 in XML 1.0. IM1 software implementation project, was the one W3C working Draft 24th February 1999. http:// that made the vision manifest itself in the form of www.w3.org/TR/WD-html-in-xml/. a real-time player. 298 O. Avaro et al. / Signal Processing: Image Communication 15 (2000) 281}298

Olivier Avaro was born in Carsten Herpel was born in Provence, France on 27 July 1968. Cologne, Germany in 1962. He re- He received his Dipl. Ing. degree in ceived his diploma in communica- telecommunication engineering in tion engineering from RWTH 1992 from the Higher School of Aachen, Germany, in 1988. He Telecommunication of Brittany. then joined Thomson Multi- He joined France Telecom-CNET media's research facility in Hann- department on image communica- over, Germany, and developed tion in Paris in 1992. He worked video coding algorithms, with a fo- on image representation tech- cus on scalable coding tools. His niques and analysis. His research areas cover video research interests have then moved to multimedia sys- compression algorithms, error resiliency of video tems with a current focus on MPEG-4. Mr. Herpel serves compression algorithms, invariant representation of as a co-editor to the MPEG-4 Systems speci"cation. images and shapes and model based representation for interpersonal communications. He has been early in- volved in the ISO/MPEG-4 project, in particular Ganesh Rajan received his Ph.D. through the European platform MAVT and its successor degree in Electrical Engineering MoMuSys. He later managed the France Telecom from the University of California, CNET project Oxygen on MPEG-4 multimedia telecon- Santa Barbara, in 1987. He was ferencing systems. In 1999, he joined Deutsche Telekom with the Tektronix Laboratories, } Berkom department on Multimedia Communication in Tektronix, Inc., until October Darmstadt within the context of joint Deutsche Telekom 1995. During his tenure there, he } France Telecom projects. He is currently chairman of led the Digital Signal Processing the MPEG-4 Systems sub-group. group in projects encompassing the areas of digital signal and im- age processing, and digital communications. From Octo- Alexandros Eleftheriadis was ber 1995 until May 1999, he was with the Advanced born in Athens, Greece, in 1967, Technology group, General Instrument Corp. and was and received his Ph.D. in Electrical involved in the areas of multimedia coding and presenta- Engineering from Columbia Uni- tion architectures. Ganesh has been involved with the versity in 1995. Since 1995 he has MPEG related activities since early 1996, especially in been an Assistant Professor at Col- the Synthetic and Natural Hybrid Coding (SNHC), and umbia, where he is leading a re- the Systems subgroups. He was the chairman of the search team working on media SNHC group during the 36th ISO/IEC JTC1/SC29/ representation with emphasis on WG11 (MPEG) meeting at Chicago, in October 1996. He multimedia software, video signal is the co-editor of the ISO/IEC 14496-1 (MPEG-4 Sys- processing and compression, video communication sys- tems) speci"cations. Ganesh is a member of the IEEE tems and the mathematical fundamentals of compres- and the ACM. sion. Dr. Eleftheriadis is a member of the ANSI NCITS L3.1 Committee and the ISO-IEC JTC1/SC29/WG11 (MPEG) Group, where he serves as the Editor of the Liam Ward graduated from Dub- MPEG-4 Systems speci"cation. His awards include an lin City University (Ireland) in NSF CAREER Award. 1991 with a degree in Electronic Engineering. He is currently a Sen- ior Research O$cer with R&D consultants Teltec Ireland. He is an editor of the recently completed MPEG-4 Systems speci"cation, in particular the part dealing with scene description (BIFS). He is also leader of the European ACTS-DICEMAN consor- tium, a project which is developing techniques for the trading of audio-visual material on digital networks us- ing MPEG-7.