On the Applications of Multimedia Processing to Communications
Total Page:16
File Type:pdf, Size:1020Kb
Scanning the Technology On the Applications of Multimedia Processing to Communications RICHARD V. COX, FELLOW, IEEE, BARRY G. HASKELL, FELLOW, IEEE, YANN LECUN, MEMBER, IEEE, BEHZAD SHAHRARAY, MEMBER, IEEE, AND LAWRENCE RABINER, FELLOW, IEEE Invited Paper The challenge of multimedia processing is to provide services Keywords—AAC, access, agents, audio coding, cable modems, that seamlessly integrate text, sound, image, and video informa- communications networks, content-based video sampling, docu- tion and to do it in a way that preserves the ease of use and ment compression, fax coding, H.261, HDTV, image coding, image interactivity of conventional plain old telephone service (POTS) processing, JBIG, JPEG, media conversion, MPEG, multimedia, telephony, irrelevant of the bandwidth or means of access of the multimedia browsing, multimedia indexing, multimedia searching, optical character recognition, PAC, packet networks, perceptual connection to the service. To achieve this goal, there are a number coding, POTS telephony, quality of service, speech coding, speech of technological problems that must be considered, including: compression, speech processing, speech recognition, speech syn- • compression and coding of multimedia signals, including thesis, spoken language interface, spoken language understanding, algorithmic issues, standards issues, and transmission issues; standards, streaming, teleconferencing, video coding, video tele- phony. • synthesis and recognition of multimedia signals, including speech, images, handwriting, and text; • organization, storage, and retrieval of multimedia signals, I. INTRODUCTION including the appropriate method and speed of delivery (e.g., streaming versus full downloading), resolution (including In a very real sense, virtually every individual has had layering or embedded versions of the signal), and quality of experience with multimedia systems of one type or another. service, i.e., perceived quality of the resulting signal; Perhaps the most common multimedia experiences are • access methods to the multimedia signal (i.e., matching the reading the daily newspaper or watching television. These user to the machine), including spoken natural language may not seem like the exotic multimedia experiences that interfaces, agent interfaces, and media conversion tools; are discussed daily in the media or on television, but • searching (i.e., based on machine intelligence) by text, nonetheless, these are multimedia experiences. speech, and image queries; Before proceeding further, it is worthwhile to define • browsing (i.e., based on human intelligence) by accessing the exactly what constitutes a multimedia experience or a mul- text, by voice, or by indexed images. timedia signal so we can focus clearly on a set of technolog- In each of these areas, a great deal of progress has been made in the past few years, driven in part by the relentless growth in ical needs for creating a rich multimedia communications multimedia personal computers and in part by the promise of experience. The dictionary definition of multimedia is: broad-band access from the home and from wireless connections. including or involving the use of several media of Standards have also played a key role in driving new multimedia communication, entertainment, or expression. services, both on the POTS network and on the Internet. It is the purpose of this paper to review the status of the A more technological definition of multimedia, as it technology in each of the areas listed above and to illustrate applies to communications systems, might be the following: current capabilities by describing several multimedia applications integration of two or more of the following media that have been implemented at AT&T Labs over the past several years. for the purpose of transmission, storage, access, and content creation: Manuscript received June 9, 1997; revised December 3, 1997. The • text; Guest Editor coordinating the review of this paper and approving it for publication was T. Chen. • images; The authors are with the Speech and Image Processing Services Re- • graphics; search Laboratory, AT&T Labs, Florham Park, NJ 07932-0971 USA. Publisher Item Identifier S 0018-9219(98)03279-4. • speech; 0018–9219/98$10.00 1998 IEEE PROCEEDINGS OF THE IEEE, VOL. 86, NO. 5, MAY 1998 755 • audio; • video; • animation; (a) • handwriting; • data files. With these definitions in mind, it should be clear that (b) a newspaper constitutes a multimedia experience since Fig. 1. Elements of multimedia systems used in (a) per- it integrates text and halftone images and that television son-to-person and (b) person-to-machine modes. constitutes a multimedia experience since it integrates audio and video signals. However, for most of us, when we A. Elements of Multimedia Systems think about multimedia and the promise for future com- There are two key communications modes in which munications systems, we tend to think about movies like multimedia systems are generally used, namely, person-to- Who Framed Roger Rabbit? that combine video, graphics, person (or equivalently people-to-people) communications animation with special effects (e.g., morphing of one image and person-to-machine (or equivalently people-to-machine) to another) and compact disc (CD)-quality audio. On a more communications. Both of these modes have a lot of com- business-oriented scale, we think about creating virtual monality, as well as some differences. The key elements meeting rooms with three-dimensional (3-D) realism in are shown in Fig. 1. sight and sound, including sharing of whiteboards, com- In the person-to-person mode, shown in Fig. 1(a), there puter applications, and perhaps even computer-generated is a user interface that provides the mechanisms for all business meeting notes for documenting the meeting in an users to interact with each other and a transport layer efficient communications format. Other glamorous applica- that moves the multimedia signal from one user location tions of multimedia processing include: to some or all other user locations associated with the • distance learning, in which we learn and interact with communications. The user interface has the job of creat- instructors remotely over a broad-band communication ing the multimedia signal, i.e., integrating seamlessly the network; various media associated with the communications, and • virtual library access, in which we instantly have allowing users to interact with the multimedia signal in access to all of the published material in the world, in an easy-to-use manner. The transport layer has the job of its original form and format, and can browse, display, preserving the quality of the multimedia signals so that all print, and even modify the material instantaneously; users receive what they perceive to be high-quality signals • living books, which supplement the written word and at each user location. Examples of applications that rely on the associated pictures with animations and hyperlink the person-to-person mode include teleconferencing, video access to supplementary material. phones, distance learning, and shared workspace scenarios. It is important to distinguish multimedia material from In the person-to-machine mode, shown in Fig. 1(b), there what is often referred to as multiple-media material. To is again a user interface for interacting with the machine, illustrate the difference, consider using the application of along with a transport layer for moving the multimedia messaging. Today, messaging consists of several types, signal from the storage location to the user, as well as a including electronic mail (e-mail), which is primarily text mechanism for storage and retrieval of multimedia signals messaging, voice mail, image mail, video mail, and hand- that are either created by the user or requested by the user. written mail [often transmitted as a facsimile (fax) doc- The storage and retrieval mechanisms involve browsing and ument]. Each of these messaging types is generally (but searching (to find existing multimedia data) and storage not always) a single medium and, as such, is associated and archiving to move user-created multimedia data to the with a unique delivery mechanism and a unique repository appropriate place for access by others. Examples of appli- or mailbox. For convenience, most consumers would like cations that rely on the person-to-machine mode include to have all messages (regardless of type or delivery mecha- creation and access of business meeting notes, access of nism) delivered to a common repository or mailbox—hence broadcast video and audio documents and performances, the concept of multiple media’s being integrated into a and access to video and document archives from a digital single location. Eventually, the differences between e-mail, library or other repositories. voice mail, image mail, video mail, and handwritten mail will disappear, and they will all be seamlessly integrated B. Driving Forces in the Multimedia into a true multimedia mail system that will treat all Communications Revolution messages equally in terms of content, display mechanism, Modern voice communications networks evolved around and even media translation (converting the media that are the turn of the twentieth century with a focus on creat- not displayable on the current access device to a medium ing universal service, namely, the ability to automatically that is displayable, e.g., text messages to voice messages connect any telephone user with any other telephone user for playback over a conventional