Standards MPEG-4

Editor: Peiya Liu Standards Siemens Corporate Research MPEG-4: A Multimedia Standard for the Third Millennium, Part 1 PEG-4 (formally ISO/IEC international stan- tionalities potentially accessable on a single com- Mdard 14496) defines a multimedia system for pact terminal and higher levels of interaction with interoperable communication of complex scenes content, within the limits set by the author. containing audio, video, synthetic audio, and MPEG-4 achieves these goals by providing graphics material. In part 1 of this two-part article standardized ways to support Stefano Battista we provide a comprehensive overview of the tech- bSoft nical elements of the Moving Pictures Expert ❚ Coding—representing units of audio, visual, or Group’s MPEG-4 multimedia system specification. audiovisual content, called media objects. Franco Casalino In part 2 (in the next issue) we describe an appli- These media objects are natural or synthetic in Ernst & Young cation scenario based on digital satellite television origin, meaning they could be recorded with a Consultants broadcasting, discuss the standard’s envisaged camera or microphone, or generated with a evolution, and compare it to other activities in computer. Claudio Lande forums addressing multimedia specifications. CSELT ❚ Composition—describing the composition of Evolving standard these objects to create compound media MPEG-4 started in July 1993, reached Com- objects that form audiovisual scenes. mittee Draft level in November 1997, and achieved International Standard level in April ❚ Multiplex—multiplexing and synchronizing the 1999. MPEG-4 combines some typical features of data associated with media objects for trans- other MPEG standards, but aims to provide a set port over network channels providing a QoS of technologies to satisfy the needs of authors, ser- appropriate for the nature of the specific media vice providers, and end users. objects. For authors, MPEG-4 will enable the produc- tion of content with greater reusability and flexi- ❚ Interaction—interacting with the audiovisual bility than possible today with individual scene at the receiver’s end or, via a back technologies such as digital television, animated channel, at the transmitter’s end. graphics, World Wide Web (WWW) pages, and their extensions. Also, it permits better manage- The structure of the MPEG-4 standard consists ment and protection of content owner rights. of six parts: Systems,1 Visual,2 Audio,3 Confor- For network service providers, MPEG-4 will mance Testing, Reference Software, and Delivery offer transparent information, interpreted and Multimedia Integration Framework (DMIF).4 translated into the appropriate native signaling messages of each network with the help of relevant Systems standards bodies. However, the foregoing excludes The Systems subgroup1 defined the framework quality-of-service (QoS) considerations, for which for integrating the natural and synthetic compo- MPEG-4 will provide a generic QoS descriptor for nents of complex multimedia scenes. The Systems different MPEG-4 media. The exact translations level shall integrate the elementary decoders for from the QoS parameters set for each media to the media components specified by other MPEG-4 network QoS exceed the scope of MPEG-4 and subgroups—Audio, Video, Synthetic and Natural remain for network providers to define. Hybrid Coding (SNHC), and Intellectual Property For end users, MPEG-4 will enable many func- Management and Protection (IPMP)—providing 74 1070-986X/99/$10.00 © 1999 IEEE the specification for the parts of the system relat- Composition ed to composition and multiplex. decoder Composition information consists of the representation of the hierarchical structure of the scene. A graph describes the relationship among Natural audio decoder elementary media objects comprising the scene. The MPEG-4 Systems subgroup adopted an Demultiplexer approach for composition of elementary media Natural objects inspired by the existing Virtual Reality video decoder Modeling Language (VRML)5 standard. VRML pro- vides the specification of a language to describe Synthetic the composition of complex scenes containing 3D audio decoder material, plus audio and video. Compositor The resulting specification addresses issues spe- Synthetic cific to an MPEG-4 system: video decoder ❚ description of objects representing natural audio and video with streams attached, and IPMP decoder ❚ description of objects representing synthetic audio and video (2D and 3D material) with separate entities that make up the scene. Figure 1. MPEG-4 streams attached (such as streaming text or The model adopted by MPEG-4 to describe the high-level system streaming parameters for animation of a facial composition of a complex multimedia scene relies architecture (receiver model). on the concepts VRML uses. Basically, the Systems terminal). group decided to reuse as much of VRML as pos- The techniques adopted for multiplexing the sible, extending and modifying it only when elementary streams borrow from the experience strictly necessary. of MPEG-1 and MPEG-2 Systems for timing and The main areas featuring new concepts accord- synchronization of continuous media. A specific ing to specific application requirements are three-layer multiplex strategy defined for MPEG- 4 fits the requirements of a wide range of networks ❚ dealing with 2D-only content, for a simplified and of very different application scenarios. scenario where 3D graphics is not required; Figure 1 shows a high-level diagram of an MPEG- 4 system’s components. It serves as a reference for ❚ interfacing with streaming media (video, audio, the terminology used in the system’s design and streaming text, streaming parameters for syn- specification: the demultiplexer, the elementary thetic objects); and media decoders (natural audio, natural video, synthetic audio, and synthetic video), the specialized ❚ adding synchronization capabilities. decoders for the composition information, and the specialized decoders for the protection information. The outcome is the specification of a VRML- The following subsections present the compo- based composition format with extensions tuned sition and multiplex aspects of the MPEG-4 Sys- to match MPEG-4 requirements. The scene tems in more detail. description represents complex scenes populated by synthetic and natural audiovisual objects with October–December 1999 Composition their associated spatiotemporal transformations. The MPEG-4 standard deals with frames of The author can generate this description in tex- audio and video (vectors of samples and matrices tual format, possibly through an authoring tool. of pixels). Further, it deals with the objects that The scene’s description then conforms to the make up the audiovisual scene. Thus, a given VRML syntax with extensions. For efficiency, the scene has a number of video objects, of possibly standard defines a way to encode the scene differing shapes, plus a number of audio objects, description in a binary representation—Binary possibly associated to video objects, to be com- Format for Scene Description (BIFS). bined before presentation to the user. Composi- Multimedia scenes are conceived as hierarchi- tion encompasses the task of combining all of the cal structures represented as a graph. Each leaf of 75 Standards the graph represents a media object (audio; video; Even if the time bases for the composition and for synthetic audio like a Musicial Instrument Digital the elementary data streams differ, they must be Interface, or MIDI, stream; synthetic video like a consistent except for translation and scaling of the face model). The graph structure isn’t necessarily time axis. Time stamps attached to the elementary static, as the relationships can evolve over time as media streams specify at what time the access unit nodes or subgraphs are added or deleted. All the for a media object should be ready at the decoder parameters describing these relationships are part input (DTS, decoding time stamp), and at what of the scene description sent to the decoder. time the composition unit should be ready at the The initial snapshot of the scene is sent or compositor input (CTS, composition time stamp). retrieved on a dedicated stream. It is then parsed, Time stamps associated to the composition stream and the whole scene structure is reconstructed (in specify at what time the access units for composi- an internal representation) at the receiver termi- tion must be ready at the input of the composi- nal. All the nodes and graph leaves that require tion information decoder. streaming support to retrieve media contents or In addition to the time stamps mechanism ancillary data (video stream, audio stream, facial (derived from MPEG-1 and MPEG-2), fields within animation parameters) are logically connected to the scene description also carry a time value. They the decoding pipelines. indicate either a duration in time or an instant in An update of the scene structure may be sent at time. To make the latter consistent with the time any time. These updates can access any field of any stamps scheme, MPEG-4 modified the semantics updatable node in the scene. An updatable node is of these absolute time fields in seconds to repre- one that received a unique node identifier in the sent a relative time with respect to the time stamp scene structure. The user can also interact locally of the BIFS elementary stream. For example, the with the scenes, which may change the scene struc- start of a video clip represents the relative offset ture or the value of any field

Load more