Editor: Peiya Liu Standards Siemens Corporate Research

MPEG-4: A Multimedia for the Third Millennium, Part 1

PEG-4 (formally ISO/IEC international stan- tionalities potentially accessable on a single com- Mdard 14496) defines a multimedia system for pact terminal and higher levels of interaction with interoperable communication of complex scenes content, within the limits set by the author. containing audio, video, synthetic audio, and MPEG-4 achieves these goals by providing material. In part 1 of this two-part article standardized ways to support Stefano Battista we provide a comprehensive overview of the tech- bSoft nical elements of the Moving Pictures Expert ❚ Coding—representing units of audio, visual, or Group’s MPEG-4 multimedia system specification. audiovisual content, called media objects. Franco Casalino In part 2 (in the next issue) we describe an appli- These media objects are natural or synthetic in Ernst & Young cation scenario based on digital satellite television origin, meaning they could be recorded with a Consultants broadcasting, discuss the standard’s envisaged camera or microphone, or generated with a evolution, and compare it to other activities in . Claudio Lande forums addressing multimedia specifications. CSELT ❚ Composition—describing the composition of Evolving standard these objects to create compound media MPEG-4 started in July 1993, reached Com- objects that audiovisual scenes. mittee Draft level in November 1997, and achieved International Standard level in April ❚ Multiplex—multiplexing and synchronizing the 1999. MPEG-4 combines some typical features of data associated with media objects for trans- other MPEG standards, but aims to provide a set port over network channels providing a QoS of technologies to satisfy the needs of authors, ser- appropriate for the nature of the specific media vice providers, and end users. objects. For authors, MPEG-4 will enable the produc- tion of content with greater reusability and flexi- ❚ Interaction—interacting with the audiovisual bility than possible today with individual scene at the receiver’s end or, via a back technologies such as digital television, animated channel, at the transmitter’s end. graphics, (WWW) , and their extensions. Also, it permits better manage- The structure of the MPEG-4 standard consists ment and protection of content owner rights. of six parts: Systems,1 Visual,2 Audio,3 Confor- For network service providers, MPEG-4 will mance Testing, Reference , and Delivery offer transparent information, interpreted and Multimedia Integration Framework (DMIF).4 translated into the appropriate native signaling messages of each network with the help of relevant Systems standards bodies. However, the foregoing excludes The Systems subgroup1 defined the framework quality-of-service (QoS) considerations, for which for integrating the natural and synthetic compo- MPEG-4 will provide a generic QoS descriptor for nents of complex multimedia scenes. The Systems different MPEG-4 media. The exact translations level shall integrate the elementary decoders for from the QoS parameters set for each media to the media components specified by other MPEG-4 network QoS exceed the scope of MPEG-4 and subgroups—Audio, Video, Synthetic and Natural remain for network providers to define. Hybrid Coding (SNHC), and Intellectual Property For end users, MPEG-4 will enable many func- Management and Protection (IPMP)—providing

74 1070-986X/99/$10.00 © 1999 IEEE the specification for the parts of the system relat- Composition ed to composition and multiplex. decoder Composition information consists of the rep- resentation of the hierarchical structure of the scene. A graph describes the relationship among Natural audio decoder elementary media objects comprising the scene. The MPEG-4 Systems subgroup adopted an Demultiplexer approach for composition of elementary media Natural objects inspired by the existing Virtual Reality video decoder Modeling Language (VRML)5 standard. VRML pro- vides the specification of a language to describe Synthetic the composition of complex scenes containing 3D audio decoder material, plus audio and video. Compositor The resulting specification addresses issues spe- Synthetic cific to an MPEG-4 system: video decoder

❚ description of objects representing natural audio and video with streams attached, and IPMP decoder

❚ description of objects representing synthetic audio and video (2D and 3D material) with separate entities that make up the scene. Figure 1. MPEG-4 streams attached (such as streaming text or The model adopted by MPEG-4 to describe the high-level system streaming parameters for animation of a facial composition of a complex multimedia scene relies architecture (receiver model). on the concepts VRML uses. Basically, the Systems terminal). group decided to reuse as much of VRML as pos- The techniques adopted for multiplexing the sible, extending and modifying it only when elementary streams borrow from the experience strictly necessary. of MPEG-1 and MPEG-2 Systems for timing and The main areas featuring new concepts accord- synchronization of continuous media. A specific ing to specific application requirements are three-layer multiplex strategy defined for MPEG- 4 fits the requirements of a wide range of networks ❚ dealing with 2D-only content, for a simplified and of very different application scenarios. scenario where 3D graphics is not required; Figure 1 shows a high-level diagram of an MPEG- 4 system’s components. It serves as a reference for ❚ interfacing with (video, audio, the terminology used in the system’s design and streaming text, streaming parameters for syn- specification: the demultiplexer, the elementary thetic objects); and media decoders (natural audio, natural video, syn- thetic audio, and synthetic video), the specialized ❚ adding synchronization capabilities. decoders for the composition information, and the specialized decoders for the protection information. The outcome is the specification of a VRML- The following subsections present the compo- based composition format with extensions tuned sition and multiplex aspects of the MPEG-4 Sys- to match MPEG-4 requirements. The scene tems in more detail. description represents complex scenes populated

by synthetic and natural audiovisual objects with October–December 1999 Composition their associated spatiotemporal transformations. The MPEG-4 standard deals with frames of The author can generate this description in - audio and video (vectors of samples and matrices tual format, possibly through an authoring tool. of pixels). Further, it deals with the objects that The scene’s description then conforms to the make up the audiovisual scene. Thus, a given VRML syntax with extensions. For efficiency, the scene has a number of video objects, of possibly standard defines a way to encode the scene differing shapes, plus a number of audio objects, description in a binary representation—Binary possibly associated to video objects, to be com- Format for Scene Description (BIFS). bined before presentation to the user. Composi- Multimedia scenes are conceived as hierarchi- tion encompasses the task of combining all of the cal structures represented as a graph. Each leaf of

75 Standards

the graph represents a media object (audio; video; Even if the time bases for the composition and for synthetic audio like a Musicial Instrument Digital the elementary data streams differ, they must be Interface, or MIDI, ; synthetic video like a consistent except for translation and scaling of the face model). The graph structure isn’t necessarily time axis. Time stamps attached to the elementary static, as the relationships can evolve over time as media streams specify at what time the access unit nodes or subgraphs are added or deleted. All the for a media object should be ready at the decoder parameters describing these relationships are part input (DTS, decoding time stamp), and at what of the scene description sent to the decoder. time the composition unit should be ready at the The snapshot of the scene is sent or compositor input (CTS, composition time stamp). retrieved on a dedicated stream. It is then parsed, Time stamps associated to the composition stream and the whole scene structure is reconstructed (in specify at what time the access units for composi- an internal representation) at the receiver termi- tion must be ready at the input of the composi- nal. All the nodes and graph leaves that require tion information decoder. streaming support to retrieve media contents or In addition to the time stamps mechanism ancillary data (video stream, audio stream, facial (derived from MPEG-1 and MPEG-2), fields within animation parameters) are logically connected to the scene description also carry a time value. They the decoding pipelines. indicate either a duration in time or an instant in An update of the scene structure may be sent at time. To make the latter consistent with the time any time. These updates can access any field of any stamps scheme, MPEG-4 modified the updatable in the scene. An updatable node is of these absolute time fields in seconds to repre- one that received a unique node identifier in the sent a relative time with respect to the time stamp scene structure. The user can also interact locally of the BIFS elementary stream. For example, the with the scenes, which may change the scene struc- start of a video clip represents the relative offset ture or the value of any field of any updatable node. between the composition time stamp of the scene Composition information (information about and the start of the video display. the initial scene composition and the scene updates during the sequence evolution) is, like Multiplex other streaming data, delivered in one elementary Because MPEG-4 is intended for use on a wide stream. The composition stream is treated differ- variety of networks with widely varying perfor- ently from others because it provides the infor- mance characteristics, it includes a three-layer mation required by the terminal to set up the multiplex standardized by the Digital Media Inte- scene structure and map all other elementary gration Framework (DMIF)4 working group. The streams to the respective media objects. three layers separate the functionality of

Spatial relationships. The media objects may ❚ adding MPEG-4−specific information for tim- have 2D or 3D dimensionality. A typical video ing and synchronization of the coded media object (a moving picture with associated arbitrary (synchronization layer); shape) is 2D, while a wire- model of a per- son’s face is 3D. Audio also may be spatialized in ❚ multiplexing streams with very different char- 3D, specifying the position and directional char- acteristics, such as average bit rate and size of acteristics of the source. access units (flexible multiplex layer); and Each elementary media object is represented by a leaf in the scene graph and has its own local ❚ adapting the multiplexed stream to the coordinate system. The mechanism to combine particular network characteristics in order to the scene graph’s nodes into a single global coor- facilitate the interface to different network dinate system uses spatial transformations associ- environments (transport multiplex layer). ated to the intermediate nodes, which group their children together. Following the graph branches The goal is to exploit the characteristics of each from bottom to top, the spatial transformations network, while adding functionality that these cascade to reach the unique coordinate system environments lack and preserving a homoge- associated to the root of the graph. neous interface toward the MPEG-4 system. Elementary streams are packetized, adding Temporal relationships. The composition headers with timing information (clock refer-

IEEE MultiMedia stream (BIFS) has its own associated time base. ences) and synchronization data (time stamps).

76 Elementary Figure 2. General streams structure of the MPEG-4 Sync Sync Sync Sync Sync multiplex. Different layer layer layer layer layer cases have multiple SL streams multiplexed in SL streams one FML stream and multiple FML streams Flex MUX Flex MUX Flex MUX Flex MUX multiplexed in one TML layer layer layer layer stream. FML streams

Trans MUX Trans MUX Trans MUX layer layer layer

TML streams

They make up the synchronization layer (SL) of chronously with a common system time base. the multiplex. Elementary streams are first framed in SL pack- Streams with similar QoS requirements are ets, not necessarily matching the size of the access then multiplexed on a content multiplex layer, units in the streams. The header attached by this termed the flexible multiplex layer (FML). It effi- first layer contains fields specifying ciently interleaves data from a variable number of variable bit-rate streams. ❚ Sequence number—a continuous number for the A service multiplex layer, known as the trans- packets, to perform packet loss checks port multiplex layer (TML), can add a variety of levels of QoS and provide framing of its content ❚ Instantaneous bit rate—the bit rate at which the and error detection. Since this layer is specific to elementary stream is coded the characteristics of the transport network, the specification of how data from SL or FML streams ❚ OCR (object clock reference)—a time stamp used is packetized into TML streams refers to the defi- to reconstruct the time base for the single nition of the network protocols— MPEG-4 does- object n’t specify it. Figure 2 shows these three layers and the rela- ❚ DTS (decoding time stamp)—a time stamp to tionship among them. identify the correct time to decode an access unit Synchronization layer (timing and synchro- nization). Elementary streams consist of access ❚ CTS (composition time stamp)—a time stamp to units, which correspond to portions of the stream identify the correct time to render a decoded with a specific decoding time and composition access unit time. As an example, an elementary stream for a natural video object consists of the coded video The information contained in the SL headers object instances at the refresh rate specific to the maintains the correct time base for the elementary

video sequence (for example, the video of a person decoders and for the receiver terminal, plus the October–December 1999 captured at 25 pictures per second). Or, an ele- correct synchronization in the presentation of the mentary stream for a face model consists of the elementary media objects in the scene. The clock coded animation parameters instances at the references mechanism supports timing of the sys- refresh rate specific to the face model animation tem, and the mechanism of time stamps supports (for example, a model animated to refresh the facial synchronization of the different media. animation parameters 30 times per second). Access units like a video object instance or a facial anima- Flexible multiplex layer (content). Given the tion parameters instance are the self-contained wide range of possible bit rates associated to the semantic units in the respective streams, which elementary streams—ranging, for example, from have to be decoded and used for composition syn- 1 Kbps for facial animation parameters to 1 Mbps

77 Standards

for good-quality video objects—an intermediate Video multiplex layer provides more flexibility. The SL The most important goal of both the MPEG-1 serves as a tool to associate timing and synchro- and MPEG-2 standards was to make the storage nization data to the coded material. The transport and transmission of digital audiovisual material multiplex layer adapts the multiplexed stream to more efficient through compression techniques. the specific transport or storage media. The inter- To achieve this, both with frame-based video mediate (optional) flexible multiplex layer pro- and audio. Interaction with the content is limited vides a way to group together several low-bit-rate to the video frame level, with its associated audio. streams for which the overhead associated to a further level of packetization is not necessary or MPEG-4 Video functionalities introduces too much redundancy. With conven- MPEG-4 Video2 supports different functionali- tional scenes, like the usual audio plus video of a ties that divide into three nonorthogonal classes motion picture, this optional multiplex layer can based on the requirements they support: be skipped; the single audio stream and the single video stream can be mapped each to a single ❚ Content-based interactivity. This class includes transport multiplex stream. four functionalities focused on requirements for applications involving some form of inter- Transport multiplex layer (service). The mul- activity between the user and the data: con- tiplex layer closest to the transport level depends tent-based multimedia data access tools, on the specific transmission or storage system on content-based manipulation and bit-stream which the coded information is delivered. The editing, hybrid natural and synthetic data cod- Systems part of MPEG-4 doesn’t specify the way ing, and improved temporal random access. SL packets (when no FML is used) or FML packets are mapped on TML packets. The specification ❚ Compression. This class consists of two func- simply references several different transport pack- tionalities: improved coding efficiency and cod- etization schemes. The “content” packets (the ing of multiple concurrent data streams. These coded media data wrapped by SL headers and FML essentially target applications requiring efficient headers) may be transported directly using an storage or transmission of audiovisual infor- Asynchronous Transfer Mode (ATM) Adaptation mation and their effective synchronization. Layer 2 (AAL2) scheme for applications over ATM, MPEG-2 transport stream packetization over net- ❚ Universal access. The remaining two works providing that support, or transport control functionalities are robustness in error-prone protocol/ protocol (TCP/IP) for applica- environments and content-based scalability. tions over the Internet. These functionalities make MPEG-4 encoded data accessible over a wide range of media, Intellectual property protection with various qualities in terms of temporal and The MPEG-4 standard specifies a multimedia- spatial resolutions for specific objects, bit-stream syntax, a set of tools, and interfaces for decodable by a range of decoders with different designers and builders of a wide variety of multi- complexities. media applications. Each of these applications has a set of requirements regarding protection of the The error resilience tools developed for video information it manages. These applications can divide into synchronization, data recovery, and produce conflicting content management and error concealment. The basic scalability tools protection requirements. By implication, the offered are temporal scalability and spatial scala- Intellectual Property Management and Protection bility. MPEG-4 Video also supports combinations (IPMP) framework design needs to consider the of these basic scalability tools, referred to as MPEG-4 standard’s complexity and the diversity hybrid scalability. Basic scalability allows two lay- of its applications. ers of video, referred to as the lower layer and the The IPMP framework consists of a normative enhancement layer, whereas hybrid scalability interface that permits an MPEG-4 terminal to host supports up to four layers. one or more IPMP systems. An IPMP system is a MPEG-4 Video provides tools and non-normative component that provides intel- for lectual property management and protection

IEEE MultiMedia functions for the terminal. ❚ efficient compression of images and video

78 User User interaction interaction

VP #1 VP #1 encoder encoder

VO formation VP #2 VP #2 VO + Multiplexer Demultiplexer encoder encoder composition encoder control

VP #N VP #N encoder encoder

❚ efficient compression of textures for texture reconstructing useful video from pieces of a bit Figure 3. General mapping on 2D and 3D meshes stream by structuring the total bit stream in two or structure of the MPEG-4 more layers, starting from a stand-alone base layer video encoder/decoder. ❚ efficient compression of implicit 2D meshes and adding a number of enhancement layers. The (, VO represents base layer can be coded using a nonscalable syntax video object.) ❚ efficient compression of time-varying geome- or, in the case of picture-based coding, even using try streams that animate meshes the syntax of a different video coding standard. The ability to access individual objects requires ❚ efficient random access to all types of visual achieving a coded representation of their shape. A objects natural video object consists of a sequence of 2D representations (at different points in time) ❚ extended manipulation functionality for referred to here as VOPs. Efficient coding of VOPs images and video sequences exploits both temporal and spatial redundancies. Thus a coded representation of a VOP includes rep- ❚ content-based coding of images and video resentation of its shape, its motion, and its texture. Figure 3 shows the block diagram of the MPEG- ❚ content-based scalability of textures, images, 4 video encoder/decoder. The most important fea- and video ture is the intrinsic representation based on video objects when defining a visual scene. In fact, a ❚ spatial, temporal, and quality scalability user—or an intelligent —may choose to encode the different video objects composing ❚ error robustness and resilience in error-prone source data with different parameters or different environments coding methods, or may even choose not to code some of them at all. MPEG-4 video encoder and decoder structure In most applications, each video object repre- MPEG-4 includes the concepts of video object and sents a semantically meaningful object in the video object plane. A video object in a scene is an scene. To maintain a certain compatibility with

entity that a user may access and manipulate. The available video materials, each uncompressed video October–December 1999 instances of video objects at a given time are called object is represented as a set of Y, U, and V compo- video object planes (VOPs). The encoding process nents, plus information about its shape, stored generates a coded representation of a VOP plus frame after frame at predefined temporal intervals. composition information necessary for display. Fur- Another important feature of the video stan- ther, at the decoder a user may interact with and dard is that this approach doesn’t explicitly define modify the composition process as needed. a temporal frame rate. This means that the The full syntax allows coding of rectangular as encoder and decoder can function in different well as arbitrarily shaped video objects in a scene. frame rates, which don’t even need to stay con- Further, the syntax supports both nonscalable and stant throughout the video sequence (or the same scalable coding. The scalability syntax enables for the various video objects).

79 Standards

Interactivity between the user and the encoder 4 and the scalability add-ons necessary to imple- or the decoder takes place in different ways. The ment large-step scalability by combining different user may decide to interact at the encoding level, coding schemes. Such scalability modules also either in coding control to distribute the available allow the use of the International Telecommuni- bit rate between different video objects, for cations Union-Telecommunications (ITU-T) instance, or to influence the multiplexing to codecs within the scalable schemes. change parameters such as the composition script Each of the natural coding schemes should at the encoder. In cases where no back channel is cover a specific range of bit rates and applications. available, or when the compressed bit stream The following focuses on the tools most impor- already exists, the user may interact with the tant in the current standard. decoder by acting on the compositor to change either the position of a video object or its display Parametric coding depth order. The user can also influence the The HVXC (harmonic vector excitation cod- decoding at the receiving terminal by requesting ing) decoder tools allow decoding of speech sig- the processing of a portion of the bit stream only, nals at 2 Kbps (and higher, up to 6 Kbps), while such as the shape. the individual line decoder tools allow decoding The decoder’s structure resembles that of the of nonspeech signals like music at bit rates of 4 encoder—except in reverse—apart from the com- Kbps and higher. Both sets of decoder tools allow position block at the end. The exact method of independent change of speed and pitch during composition (blending) of different video objects the decoding and can be combined to handle a depends on the application and the method of wider range of signals and bit rates. multiplexing used at the system level. The HVXC decoder’s basic decoding process consists of four steps: inverse quantization of para- Audio meters, generation of excitation signals for voiced To achieve the highest audio quality within frames by sinusoidal synthesis (harmonic synthe- the full range of bit rates and at the same time sis), generation of excitation signals for unvoiced provide extra functionalities, the MPEG-4 Audio3 frames by codebook look-up, and linear predictive standard includes six types of coding techniques: coding synthesis. Spectral postfilters enhance the synthesized speech quality. ❚ Parametric coding modules CELP coding ❚ Linear predictive coding (LPC) modules While the parametric schemes currently allow for the lowest bit rates in MPEG-4 Audio, both nar- ❚ Time/frequency (T/F) coding modules row-band (4 kHz audio gross bandwidth) and wide- band (8 kHz audio gross bandwidth) code-excited ❚ Synthetic/natural hybrid coding (SNHC) inte- linear prediction (CELP) encoders cover the next gration modules higher range of bit rates. In general they offer the following advantages over the parametric encoders ❚ Text-to-speech (TTS) integration modules in the bit-rate range from about 6 to 24 Kbps:

❚ Main integration modules, which combine the ❚ Lower delay (15- to 40-ms algorithmic delay, first three modules to a scaleable encoder compared to about 90 ms for the parametric speech coder) While the first three parts describe real coding schemes for the low-bit-rate representation of nat- ❚ At higher rates, better performance for signals ural audio sources, the SNHC and TTS parts only not easily described by parametric models standardize the interfaces to general SNHC and TTS systems. Because of this, already established In MPEG-4, linear predictive coding (LPC) is synthetic coding standards, such as MIDI, can be realized by means of CELP coding techniques. integrated into the MPEG-4 Audio system. The TTS CELP is a general analysis-by-synthesis model, interfaces permit plugging TTS modules optimized based on the combination of LPC and codebook for a special language into the general framework. excitation. In this model, linear prediction deals The main module contains global modules, with the relevant speech parameters of spectral

IEEE MultiMedia such as the speed change functionality of MPEG- envelope (short-term prediction) and pitch (long-

80 term prediction), and codebook excitation takes into account the nonpredictive part of the signal. The new feature of CELP coding in MPEG-4 is the scalability in audio bandwidth, bit rate, and delay. Different coding schemes, using the same set of basic functions, can be combined, including Figure 4. An MPEG-4 a bit-rate-adjustable narrow-band speech coder application called “Le operating at bit rates from 5 to 12 Kbps with an tour de France” algorithmic delay of 25 ms, a coder offering bet- featuring many ter performance at a higher delay, or a wide-band different A/V objects. speech coder operating with bit rates from 16 to 24 Kbps. Just recently, the addition of a true scal- able coding scheme has been proposed for nar- row-band CELP coding.

Time/frequency coding This coding scheme is characterized best by shapes, pointers, and . The 2D BIFS coding the input signal’s spectrum. The input sig- graphics objects derive from and are a restriction nal is first transformed into a different representa- of the corresponding VRML 2.0 3D nodes. Many tion that gives access to its spectral components. different types of textures can be mapped on Since these codecs don’t rely on a special model of plane objects: still images, moving pictures, com- the input signal, they suit encoding any type of plete MPEG-4 scenes, or even user-defined pat- input signal. The usable bit rate ranges from 16 terns. Alternatively, many material characteristics Kbps for a 7-kHz audio bandwidth to up to more (color, , border type) can be applied than 64 Kbps per audio channel for CD-like quali- on 2D objects. ty coding of mono, stereo, or multichannel audio. Other VRML-derived nodes are the interpola- The most important tools included in the tors and the sensors. Interpolators allow prede- time/frequency part derive from the MPEG-2 fined object animations like rotations, Advanced Audio Coding (AAC) standard. Many translations, and morphing. Sensors generate optional tools modify one or more of the spectra to events that can be redirected to other scene nodes provide more efficient coding. For those operating to trigger actions and animations. The user can in the spectral domain, the option to “pass generate events, or events can be associated to through” is retained, and in all cases where a spec- particular time instants. tral operation is omitted, the spectra at its input MITG provides a Layout node to specify the pass directly through the tool without modification. placement, spacing, alignment, scrolling, and wrapping of objects in the MPEG-4 scene. Still Synthetic and natural hybrid coding images or video objects can be placed in a scene SNHC deals with the representation and cod- graph in many ways, and they can be texture- ing of synthetic (2D and 3D graphics) and natural mapped on any 2D object. The most common (still images and natural video) audiovisual infor- way, though, is to use the Bitmap node to insert a mation. SNHC represents an important aspect of rectangular area in the scene in which pixels com- MPEG-4 for combining mixed media types includ- ing from a video or still image can be copied. ing streaming and downloaded A/V objects. The 2D scene graphs can contain audio sources

SNHC fields of work include 2D and 3D graph- by means of the Sound2D nodes. Like visual October–December 1999 ics, human face and body description and anima- objects, they must be positioned in space and tion, integration of text and graphics, scalable time. They are subject to the same spatial trans- textures encoding, 2D/3D mesh coding, hybrid formations of their parents’ nodes hierarchically text-to-speech coding, and synthetic audio coding above them in the scene . (structured audio). Text can be inserted in a scene graph through the Text node. Text characteristics (, size, Media integration of text and graphics style, spacing, and so on) can be customized by MITG provides a way to encode, synchronize, means of the FontStyle node. and describe the layout of 2D scenes composed of Figure 4 shows a rather complicated MPEG-4 animated text, audio, video, synthetic graphic scene from “Le tour de France” with many differ-

81 Standards

phonemes. In the of MPEG-4, the expres- sions mimic the facial expressions associated with human primary emotions like joy, anger, fear, sur- prise, sadness, and disgust. Animated avatars’ animation streams fit very low bit-rate channels (about 4 Kbps). FAPs can be encoded either with arithmetic encoding or with discrete cosine transform (DCT). FDPs are used to calibrate (that is, modify or Figure 5. Mesh object ent object types like video, icons, text, still images adapt the shape of) the receiver terminal default with uniform triangular for the map of France and the trail map, and a face models or to transmit completely new face geometry. semitransparent pop-up menu with clickable model geometry and texture. items. These items, if selected, provide informa- tion about the race, the cyclists, the general plac- 2D mesh encoding ing, and so on. A 2D mesh object in MPEG-4 represents the geometry and motion of a 2D triangular mesh, 3D graphics that is, tessellation of a 2D visual object plane into The advent of 3D graphics triggered the exten- triangular patches. A dynamic 2D mesh is a tem- sion of MPEG-4 to the third dimension. BIFS 3D poral sequence of 2D triangular meshes. The ini- nodes—an extension of the ones defined in VRML tial mesh can be either uniform (described by a specifications—allow the creation of virtual worlds. small set of parameters) or Delaunay (described by Like in VRML, it’s possible to add behavior to listing the coordinates of the vertices or nodes and objects through Script nodes. Script nodes contain the edges connecting the nodes). Either way, it functions and procedures (the terminal must sup- must be simple—it cannot contain holes. port the Javascript ) that Once the mesh has been defined, it can be - can define arbitrary complex behaviors like per- mated by moving its vertices and warping its tri- forming object animations, changing the values of angles. To achieve smooth animations, motion nodes’ fields, modifying the scene tree, and so on. vectors are represented and coded with half-pixel MPEG-4 allows the creation of much more com- accuracy. When the mesh deforms, its topology plex scenes than VRML, of 2D/3D hybrid worlds remains unchanged. Updating the mesh shape where contents are not downloaded once but can requires only the motion vectors that express how be streamed to update the scene continuously. to move the vertices in the new mesh. An example of a rectangular mesh object bor- Face animation rowed from the MPEG-4 specification appears in Face animation focuses on delineating para- Figure 5. meters for face animation and definition. It has a Dynamic 2D meshes inserted in an MPEG-4 very tight relationship with hybrid scalable text- scene create 2D animations. This results from to-speech synthesis for creating interesting appli- mapping textures (video object planes, still cations based on speech-driven avatars. Despite images, 2D scenes) onto 2D meshes. previous research on avatars, the face animation work is the first attempt to define in a standard Texture coding way the sets of parameters for synthetic anthro- MPEG-4 supports an ad-hoc tool for encoding pomorphic models. textures and still images based on a wavelet algo- Face animation is based on the development of rithm that provides spatial and quality scalabili- two sets of parameters: facial animation parame- ty, content-based (arbitrarily shaped) object ters (FAPs) and facial definition parameters (FDPs). coding, and very efficient over FAPs allow having a single set of parameters a large range of bit rates. Texture scalability comes regardless of the face model used by the terminal through many (up to 11) different levels of spatial or application. Most FAPs describe atomic move- resolutions, allowing progressive texture trans- ments of the facial features; others (expressions mission and many alternative resolutions (the and visemes) define much more complex defor- analog of mipmapping in 3D graphics). In other mations. Visemes are the visual counterparts of words, the wavelet technique provides for scalable phonemes and hence define the position of the bit-stream coding in the form of an image-resolu-

IEEE MultiMedia mouth (lips, jaw, tongue) associated with tion pyramid for progressive transmission and

82 temporal enhancement of still images. For ani- mation, arbitrarily shaped textures mapped onto For More Information 2D dynamic meshes yield animated video objects ISO official site: http://www.iso.ch/ with a very limited data transmission. MPEG official site: http://www.cselt.it/mpeg/ Texture scalability can adapt texture resolution MPEG-4 Systems site: http://garuda.imag.fr/MPEG4/ to the receiving terminal’s graphics capabilities MPEG-4 Visual site: http://wwwam.hhi.de/mpeg-video/ and the transmission rate to the channel band- MPEG-4 Audio site: http://www.tnt.uni-hannover.de/project/mpeg/audio/ width. For instance, the encoder may first trans- MPEG-4 SNHC site: http://www.es.com/mpeg4-snhc/ mit a coarse texture and then refine it with more MPEG-4 Synthetic Audio site: http://sound.media.mit.edu/mpeg4/ texture data (levels of the resolution pyramid). (formerly VRML) official site: http://www.web3d.org/ IPA (International Phonetic Alphabet) site: http://www.arts.gla.ac.uk/IPA/ Structured Audio ipa. Structured Audio allows creating synthetic sounds starting from coded input data. A special synthesis language called Structured Audio mode (fastforwarding, pausing, playing, or Orchestra Language (SAOL) permits defining a rewinding the synthetic speech). synthetic orchestra whose instruments can gener- An M-TTS can also carry the International Pho- ate sounds like real musical instruments or process netic Alphabet (IPA) coded phonemes with their prestored sounds. MPEG-4 doesn’t standardize time duration. Handed to the face animation SAOL’s methods of generating sounds; it stan- engine in the MPEG-4 player, they can produce dardizes the method of describing synthesis. speech-driven face animation. In this case the face Downloading scores in the bit stream controls animation system doesn’t receive a FAP stream the synthesis. Scores resemble scripts in a special from the MPEG-4 demultiplexer; instead it con- language called Structured Audio Score Language verts phonemes into visemes and uses them to (SASL). They consist of a set of commands for the perform the face model deformations. The various instruments. These commands can affect phoneme duration synchronizes model anima- different instruments at different times to gener- tion and speech. ate a large range of sound effects. If fine control Interestingly, applications require a tiny chan- over the final synthesized sound isn’t needed, it’s nel bandwidth—from 200 bps to 1.2 Kbps. easier to control the orchestra through the MIDI This concludes part 1. We’ll look at applications format. Supporting MIDI adds to MPEG-4 scenes and what comes next for MPEG-4 in part 2. MM the ability to reuse and import a huge quantity of existing audio contents. References The bit rate needed by Synthetic Audio appli- Available to MPEG members or from ISO cations ranges from few bits per second to 2 or 3 (http://www.iso.ch) or the National Standards Kbps when controlling many instruments and bodies (for example, American National Standards performing very fine coding. For terminals with Institute (ANSI) in the US: less functionality, and for applications that don’t 1. MPEG-4 Part 1: Systems (IS 14496-1), . N2501, need sophisticated synthesis, MPEG-4 also stan- Atlantic City, N.J., USA, Oct. 1998. dardizes a wavetable bank format. This format 2. MPEG-4 Part 2: Visual (IS 14496-2), doc. N2502, permits downloading sound samples for use in Atlantic City, N.J., USA, Oct. 1998. wavetable synthesis, as well as simple processing 3. MPEG-4 Part 3: Audio (IS 14496-3), doc. N2503, tools (reverb, chorus, and so on.) Atlantic City, NJ, USA, Oct. 1998. 4. MPEG-4 Part 6: DMIF (IS 14496-6), doc. N2506, October–December 1999 MPEG-4 text-to-speech Atlantic City, N.J., USA, Oct. 1998. MPEG-4 doesn’t define a specific text-to-speech 5. VRML (IS 14772-1), “Virtual Reality Modeling Lan- technique but rather the binary representation of guage,” April 1997. a TTS stream and the interfaces of an MPEG-4 text-to-speech (M-TTS) with the other parts of an Readers may contact Casalino at Ernst & Young Consul- MPEG-4 decoder. An M-TTS stream may contain tants, Corso Vittorio Emanuele II, n. 83, 10128 Torino, Italy, many different information types about the syn- e-mail [email protected]. thetic voice apart from text: gender, age, speech Contact Standards editor Peiya Liu, Siemens Corporate rate, , prosody, and lip shape infor- Research, 755 College Road East, Princeton, NJ 08540, mation. It may contain fields that allow trick e-mail [email protected].

83