INVITED PAPER The Road to Immersive Communication This comprehensive description of rich-media communications defines how quality of experience should be understood. The increasing role of new modalities such as augmented reality and haptics is discussed.

By John G. Apostolopoulos, Fellow IEEE, Philip A. Chou, Fellow IEEE, Bruce Culbertson, Member IEEE, Ton Kalker, Fellow IEEE, Mitchell D. Trott, Fellow IEEE, and Susie Wee, Fellow IEEE

ABSTRACT | Communication has seen enormous advances KEYWORDS | Immersive environments; signal processing; over the past 100 years including radio, television, mobile telepresence; video conferencing phones, video conferencing, and -based voice and video calling. Still, remote communication remains less natural and more fatiguing than face-to-face. The vision of immersive I. INTRODUCTION communication is to enable natural experiences and interac- Communication technology strives to pull us together tions with remote people and environments in ways that sus- faster than transportation technology and globalization pend disbelief in being there. This paper briefly describes the push us apart. Today we can communicate with almost current state-of-the-art of immersive communication, provides anyone else in the world by voice, e-mail, text, and even a vision of the future and the associated benefits, and considers video.Thishaspermittedustoserveourcustomersfar the technical challenges in achieving that vision. The attributes from where we work, work far from where we live, and of immersive communication are described, together with the live far from our friends and families [1]. Yet today’s frontiers of video and audio for achieving them. We emphasize communication experiences are impoverished compared that the success of these systems must be judged by their to face-to-face interaction. Thus, while advances in com- impact on the people who use them. Recent high-quality video munication have increased the quantity of connections conferencing systems are beginning to deliver a natural we can maintain, they have failed to preserve their experienceVwhen all participants are in custom-designed quality. studios. Ongoing research aims to extend the experience to a Humans are social animals, and have evolved to broader range of environments. Augmented reality has the interact with each other most effectively face-to-face. As potential to make remote communication even better than humans, we continue to meet in person whenever feasible. being physically present. Future natural and effective immer- We continue to meet in conference rooms; we continue to sive experiences will be created by drawing upon intertwined commute to work; we continue to fly to conferences, trade research areas including multimedia signal processing, com- shows, and weddings [2]. The reasons are clear. Colocation puter vision, graphics, networking, sensors, displays and sound enables us to exhibit and interpret various nonverbal reproduction systems, haptics, and perceptual modeling and signals and cues, such as touch, proximity, position, psychophysics. direction of gesture, eye contact, level of attention, etc. [3]–[5]. It enables us to greet each other with a handshake or embrace, dine together, and appreciate objects in a Manuscript received May 22, 2011; revised September 19, 2011; accepted October 20, shared environment. Today’s technologies do not permit 2011. Date of publication February 16, 2012; date of current version March 21, 2012. J. G. Apostolopoulos, B. Culbertson, and M. D. Trott are with Hewlett-Packard us to perform many of these social interactions. Laboratories, Palo Alto, CA 94304 USA (e-mail: [email protected]). A critical point is that the quality of an immersive P. A. Chou is with Research, Redmond, WA 98052 USA. T. Kalker is with Huawei Innovation Center, Santa Clara, CA 95050 USA. communication system is judged by its impact on the S. Wee is with Cisco Systems, San Jose, CA 95134 USA. humans who use it. Unfortunately, it is exceedingly Digital Object Identifier: 10.1109/JPROC.2011.2182069 difficult to develop objective metrics to assess that impact.

974 Proceedings of the IEEE | Vol. 100, No. 4, April 2012 0018-9219/$31.00 2012 IEEE Apostolopoulos et al.: The Road to Immersive Communication

Therefore, designing, analyzing, and predicting the future in this paper. Communication of smell and taste is in its of immersive communication systems is as much art as infancy, and we will not dwell upon it here. science. What humans are able to sense at a given moment is a Our need for more effective ways to communicate with tiny slice of the full environment. For example, high- remote people has never been greater. Numerous societal resolution detail cannot be perceived outside the current conditions are aligning to create this need: the urgency of center of gaze, loud sounds mask quiet ones, and polarized reducing environmental impact, the demand to reduce light cannot be directly sensed at all. Exploiting these travel costs and fatigue, the need for richer collaboration in features of human perception can vastly simplify immer- an ever more complex business environment, and the sive communication. difficulty of travel during natural disasters such as volcanic Immersion evokes a state of mind or emotion of being eruptions and influenza outbreaks. in another place, or of others being in your place; that is, In this paper, we attempt to indicate how, despite the suspending your disbelief in being elsewhere, as when you wide gap between today’s communication technologies are immersed in a book, movie, or video game. These ex- and the experience of actually being in another location, amples indicate the power of the human ability to Bfill in[ this gap will eventually be substantially reduced. Technol- missing or inconsistent audio, video, and related cues, a ogy advances faster than biology, and soon it will be power exploited by present and likely future immersive practical to deliver more bits per second to an observer systems. than can pass through the sensory cutset separating the We do not posit a single ideal notion or degree of im- environment from the nervous system. The human visual mersion. Far from it: there will always be a range of com- system can absorb only about 10 Mb/s; the haptic system munication needs and purposes, each served best by about 1 Mb/s; auditory and olfactory systems about different means. Even simple text messaging is likely to 100 kb/s each; and gustatory system about 1 kb/s [6]. In survive for good reasons. Certain immersive forms of com- contrast, fiber-to-the-home is already delivering on the munication may be undesirable due to lack of privacy and order of 10 Mb/s, while links in the Internet backbone attention requirements [8], [15]. Price, portability, power, are approaching 1 Tb/s [7]. With such vast information and other constraints will prevail. For example, a person rates already possible, the question is not how to com- may shift between a handheld, a wall-sized, and a head- municate information; rather, it is about transduction: mounted display, each providing a rather different degree how to capture and render that information in ways that and style of immersion. Virtual spaces need not depict a match the human system, and more fundamentally, what perfectly consistent physical geometry; indeed, when com- we want our communication and remote experiences to bining several remote sites, nonphysical layouts may be be like. best. Thus diverse and partial illusions of immersion will The answer to the latter question, to the extent that continue to be achieved by ingenious blending of immutable popular culture reflects our desires, is immersive. For the elements of the physical environment with the virtual. purposes of this paper, immersive communication means Similarly, there is no single technical roadmap to im- exchanging natural social signals with remote people, as in mersive communication. The invention, for example, of face-to-face meetings, and/or experiencing remote loca- ideal light field cameras and displays would not suddenly tions (and having the people there experience you) in ways enable immersive communication for a smartphone user; that suspend disbelief in being there. There is transparency, the camera will not be positioned well, and even a near- fluidity of interaction, a sense of presence. Star Trek’s perfect 9-cm screen remains but a small window into Holodeck, Star Wars’ Jedi council meetings, The Matrix’s another space. To enable immersive experiences in this matrix, and Avatar’s Pandora all illustrate visions of im- and many other scenarios, technical innovations must be mersive communication. paired with equally challenging nontechnical ones in areas A key feature of immersive communication is the ability such as creative design. for participants to interact with the remote environment, In addition to natural (Bas good as being there[)com- and to detect and evaluate that interaction via their senses. munication enabled by immersion, supernatural (Bbetter Visually, as we move, we see the world from different than being there[) communication will also be important. perspectives, the world reacts, and people we are looking at For example, it will be possible to look or listen across react, helping us feel as if we are part of the scene. Simi- large distances, rewind or accelerate time, speed through larly, we hear the scene as we turn around, the scene adapts space, record experiences, or provide translation abilities. to us, and we hear the results. This two-way interaction A politician could make eye contact with everyone, simul- gives us a feeling of immersion. Other senses are analogous, taneously. An orchestra, limited today by acoustic delays to but are technically harder to convey. Our vestibular sense perhaps a hundred people, could swell to thousands of (balance and acceleration) is coupled deeply to our visual networked participants. system, and inconsistency between these two can be pro- Many open questions remain about how to foster the foundly distressing. Touch is the next most important sense illusion of immersion. Transparency and fluidity of inter- after vision and hearing, but we will only briefly discuss it action likely plays an important role. For example, joining

Vol. 100, No. 4, April 2012 | Proceedings of the IEEE 975 Apostolopoulos et al.: The Road to Immersive Communication

communication. Section VII discusses the role of human perception on QoE, including peripheral awareness, con- sistency between local and remote spaces, and 3-D cues. Section VIII provides concluding remarks. Two types of scenarios are helpful to consider through- out, as they illustrate a wide range of human interaction and motivate corresponding forms of immersive commu- nication. These scenarios include small group collabora- tions such as business meetings and large social gatherings such as weddings. Business meetings typically have, say, 3–30 participants in a relatively constrained environment with a relatively limited set of interactions. Weddings are usually an order of magnitude bigger, and present a much wider range of social interactions. Immersive communi- Fig. 1. The elements of an immersive communication system. cation for business meetings is on the immediate horizon, while immersive communication for weddings may be more than a generation away from full technical feasibility and social acceptance. However both scenarios, as well as an interaction today is far different from physically enter- other scenarios of interest (such as lectures, dining, sport- ing a room or encountering someone at a party. Immersion ing events, gaming, conferences, trade shows, or poster could also be interpreted to include being bathed in a sea sessions) share a common need for accurate communica- of data, such as text message feeds from compatriots tion of social signals and interactions with the environ- during a revolution. Other extrasensory aspects may also ment, made possible by immersive communication. be important. Nevertheless, in this paper, we will focus primarily on aspects of immersive communication involv- ing perception and the senses, particularly auditory and II. BRIEF HISTORY OF IMMERSIVE visual information. COMMUNICATION We believe that highly immersive communication will Long before it could be realized, people dreamed of tech- become ubiquitous over the coming decades, as the cost of nology that would communicate sight as well as sound. The computation, bandwidth, and device resolution continues invention of the telephone spawned a flurry of examples, to drop. The goal of this paper is to illustrate possible including George du Maurier’s 1878 cartoon in Punch technical paths to this end. Almanack [9], shown in Fig. 2. Fanciful depictions in An immersive communication system consists of the movies and on TV, including Metropolis (1927), Modern user experience, sensory signal processing, and commu- Times (1936), 2001: A Space Odyssey (1968), Star Trek nications as shown in Fig. 1. The sense of immersion that a (1966), and The Jetsons (1962), offer hints of people’s user experiences is driven by the quality of experience hopes and expectations for immersive communication, and (QoE) that the system provides. The QoE rests, in turn, on how video, audio, and other sensory signals are captured, processed, and rendered. Ultimately, in an immersive environment, these signals rely for their reproduction on more bandwidth and more sophisticated communication than is common today. The outline of the paper is as follows. After a brief history of immersive communication in Section II, we lay out a framework for thinking about various categories of immersive scenarios and applications in Section III, in- cluding a discussion on augmented reality, mobility, ro- bots, and virtual worlds. We establish the technical core of the paper in Sections IV–VI, which lay out the fundamen- tal notions of sensory signal processing and communica- tions, specifically in the immersive rendering, acquisition, and compression of visual and auditory immersive com- munication. Special emphasis is paid to the interaction between rendering and acquisition due to interactivity requirements. We also briefly discuss in Section VI the Fig. 2. George du Maurier anticipates immersive important topic of haptics and its compression and communication in this 1878 cartoon.

976 Proceedings of the IEEE |Vol.100,No.4,April2012 Apostolopoulos et al.: The Road to Immersive Communication

actually inspire developers of the technology. Scientists velopers to mimic earlier video functionality, IP is enabling have speculated more seriously on the possibilities of such a host of enhanced and entirely new modes of communi- technology. Alexander Graham Bell [10] wrote about cation. Cave Automatic Virtual Environments (CAVEs) [13] seeing with electricity, Ivan Sutherland [11] expounded on surround users with displays on the walls, and sometimes ideal displays and the interaction they would enable, and even the ceiling and floor, literally immersing them in Marvin Minsky [8] refined and popularized the vision of imagery. Haptic systems convey the sense of touch, for telepresence. Minsky’s telepresence focused on high-quality example, enhancing computer games and helping pilots sensory feedback of sight, sound, and touch/manipulation, fly airplanes remotely. The concept of virtual reality, which would enable people to work remotely, for example, which ultimately would stimulate the senses to create an perform an operation or work in a damaged nuclear experience indistinguishable from reality, significantly reactor. predates actual immersive communication systems. Aug- The telephone and television are critical enabling mented reality enhances a depiction of a real environment, technologies for immersive communication. A number of for example, annotating live video of a mechanism to inventors contributed to the telephone, with Alexander facilitate a repair. Graham Bell receiving the key patent in 1876. Many television-like devices were developed in the early 20th century. A system demonstrated in 1928 [12] used elec- III. CLASSES OF IMMERSIVE SYSTEMS tronic scanning for both image capture and display, This section points out various classes of immersive sys- anticipating virtually all succeeding television systems. tems, by means of the framework illustrated in Fig. 3, One of the first television broadcasts was the 1936 Berlin which shows a spectrum of communication and collabora- Olympic Games. tion systems ranging from communication systems de- Reasonable people will differ on when the history of signed to support natural conversation between people, actual immersive communication systems begins, due to e.g., using voice and video, to collaboration systems de- varying judgments of the amount of immersiveness a signed to support sharing information between people. The system needs to qualify. The German Reichpost, or post framework shows how currently existing systems any- office, built one of the first public video conference where along this spectrum have become and are continu- networks, connecting Berlin, Nuremberg, Munich, and ingtobecomeincreasinglyimmersive. Hamburg, operating from 1936 to 1940. AT&T demon- Specifically, communication systems for natural con- strated its videophone, trademarked Picturephone,atthe versation have progressed from phone calls, to multiparty 1964 New York World’s Fair. The product was dropped video conferences, to high-end telepresence systems such after a few years due to its low video quality and high as HP Halo and Cisco Telepresence that share voices and (US$16 per three minutes) cost. Though quality and cost life-size video at a quality that gives participants the feeling gradually improved, they continued to limit the commer- of being in the same room. This progression increases the cial success of these systems into the 1990s. At the end of presence and connection between remote participants and that decade, systems with high-quality, life-size video began to appear from companies like TeleSuite, Teliris, Hewlett-Packard, Polycom, and Cisco, albeit still at a high price. Microsoft NetMeeting, Apple iChat, and Skype, which added video to PC conferencing beginning in 1995, 2003, and 2006, respectively, provided three additional features whose absence had inhibited widespread use of many earlier systems: ubiquity, ease of use, and extremely low cost. Universal Mobile Telecommunications System (UMTS) enabled video calling on cell phones in the early 2000s. International standards have been created by the ITU, ISO/IEC, IETF and other standardization bodies to foster interoperability across endpoints developed by dif- ferent manufacturers and networks operated by different providers. With new applications appearing at an accelerating pace, immersive communication is poised to advance ex- plosively, aided by progress in computation, bandwidth, and device capabilities. It is also significantly benefiting from the flexibility of the generic data carried by IP packets, in sharp contrast to analog television signaling. Though initially often used by communication system de- Fig. 3. Communication and collaboration framework.

Vol. 100, No. 4, April 2012 | Proceedings of the IEEE 977 Apostolopoulos et al.: The Road to Immersive Communication

leads to more advanced immersive communication systems, IV. VISUAL RENDERING, CAPTURE, which are evolving in a number of directions. One form is AND INTERACTION a high-fidelity telepresence system in which space is Visual information is a key element of immersive com- virtually extended by adjoining remote locations through munication. This section begins by considering the display ideal, wall-sized displays. Another form of immersive and rendering of visual information, and the desired attri- communication gives remote participants a virtual 3-D butes to provide an immersive experience. Given these presence by capturing and analyzing their movements and attributes we examine the capture of the visual informa- rendering them locally using a type of augmented reality. A tion. The desire to provide natural interaction between third form gives remote participants a physical presence people leads to additional constraints on the capture and through telerobotic conferences. In these systems, the rendering. video, voice, and movements of a remote participant are captured and analyzed, and are then emulated by a phy- A. Visual Rendering sical robot. The remote participant may even control the The goal of the visual rendering is to provide an im- robot’s movement to navigate through an area [16]. mersive visual experience for each viewer. An ideal display Collaboration systems for sharing information have places a light field across the pupil of each eye as if the progressed from telegraphs and e-mail to web confer- viewer were viewing the scene directly. This has several encing systems that allow the sharing of documents, key attributes: each eye of each viewer sees an appropriate applications, and computer desktops to interactive shared view (stereoscopy), the view changes as the viewer moves whiteboards. Another degree of information sharing is (motion parallax), and objects move in and out of focus as achieved by augmenting the collaboration topic at hand the viewer changes his or her focal plane. Peripheral vision with digital data mined from the Internet and social media is also very important; an ideal display fills the field of tools such as and , which allow large view. If the display is very close to a viewer, for example, numbers of participants to contribute to a conversation head-mounted, it can completely fill the field of view in a both synchronously and asynchronously. We can imagine compact space. More distant displays must be large enough these tools evolving into truly immersive systems in which to wrap around the viewer to achieve such an effect. Ideal augmented reality techniques are used to share data be- displays can thus be either large or small, with the caveat tween participants, for example, allowing interactive 3-D that the larger, more distant displays can be occluded by manipulation of objects that is augmented by an inflow of objects in the real world and hence such displays may not contextually relevant digital information. always be able to strictly control the light reaching the One can consider a continuum of systems that combine viewer. natural conversation and sharing information to create A light field [23],[24],ormoregenerally,aplenoptic shared environments for collaboration. IM chat can be function [17], expresses the intensity of light rays within a considered one such system, as a chat room is a shared space. Formally, the 7-D plenoptic function Pðx; y; z;;; space where people can go to share information. Further ; tÞ is the intensity of the light ray passing through spatial immersion is achieved in allowing participants to join a location ðx; y; zÞ,atangleð; Þ,forwavelength,andat shared environment in online multiplayer games such as time t. Polarization and phase (important for holography) Xbox Live and in virtual worlds such as Second Life. Pose can also be included by considering the vector-valued controlled environments such as Avatar Kinect allow function ~Pðx; y; z;;;;tÞ,where~P is the vector strength people to be represented by avatars that emulate their ex- of the electric (or magnetic) field in the plane perpendic- pressions and movements. Advanced immersive and shared ular to the direction of the light ray. environments can take different forms: head-mounted vir- Since light travels through free space in a straight line tual reality systems allow a participant to more fully ex- (ignoring relativistic effects), essentially all light rays that perience the shared environment. An ultimate goal is to pass through a volume of space also intersect a nearby mimic collaborative virtual environments depicted in plane. Hence, the plenoptic function within the volume science fiction such as the Holodeck in Star Trek. can be characterized by the value of plenoptic function on Thus, immersive communication is evolving in a va- theplane.Parameterizingpositionsontheplanebyx and y, riety of directions. One possible goal is to emulate reality, and repressing and t for simplicity, we get the sim- in which case high fidelity, realistic renderings are impor- plified 4-D plenoptic function Pðx; y;;Þ,knownasthe tant and spatial relationships must be maintained. Another light field or lumigraph in computer vision and computer possible goal is to achieve a state of immersive commu- graphics [23], [24]. nication that goes beyond reality, for example, enabling An ideal flat display, whether large or small, is able to millions of people to attend a lecture or a music concert in reproduce a light field Pðx; y;;Þ across its surface, such which everyone has a front row seat. as would be produced across an open window. That is, each The next few sections address technologies expected to spatial location ðx; yÞ on the surface of the ideal display is underlie a large variety of such immersive communication able to cast different light rays in different directions systems. ð; Þ, as illustrated in Fig. 4 (left).

978 Proceedings of the IEEE |Vol.100,No.4,April2012 Apostolopoulos et al.: The Road to Immersive Communication

ðx; yÞ coordinate. Such a display bathes a room in light, further enhancing the illusion of a window. The complexity can be dramatically reduced by directing light only to the pupils of the viewers, for example, by using a camera to track the viewers and their pupils, thereby reducing the required number of rays by many orders of magnitude. The feeling of immersion can be improved by providing a large field of view, for example, by employing large dis- plays. Current display fabrication technologies, such as for LCD or plasma displays, are not amenable to large sizes, as Fig. 4. For each spatial location in a rectangular display, an ideal light the yield decreases with increasing size, leading to higher field display (left) emanates different light rays in different directions, as compared to a conventional display (right) which emanates the cost. Therefore, large displays are currently produced same color in all directions. using projector systems, sometimes involving multiple projectors, or a mosaic of smaller LCD displays. A promis- ing approach for future economical manufacture of large area displays is through the use of roll-to-roll (R2R) manu- Conventional displays, in contrast to the aforemen- facturing techniques where the active matrix electronics tioned ideal display, correspond to a 2-D matrix of light backplane for displays is imprinted on a roll of plastic emitters, referred to as pixels, where each pixel emits a substrate. A major challenge for R2R techniques has been single color in all directions ð; Þ corresponding to a large preserving alignment during the various fabrication steps, cone of light, as illustrated in Fig. 4 (right). which induce stretching, shearing, etc., across the large, A stereoscopic 3-D display improves on a conventional flexible, plastic substrate. The development of self-aligned displaybyprovidingtwoviewsforeachðx; yÞ location: one imprint lithography (SAIL) overcomes this problem and view for the right eye and a second for the left eye. This is provides submicron alignment [18], [19]. Displays made often accomplished using temporal, polarization, or wave- on plastic substrate also facilitate the creation of flexible length multiplexing of the emitted light with eyeglasses displays or displays with new form factors, e.g., large, that demultiplex the light so the appropriate view reaches curved displays that surround the viewers, further each eye [32]. Multiple viewers with glasses will see the increasing their sense of immersion. same images, whether or not the images are appropriate Volumetric displays [20] provide a benefit over a planar for their relative position. A multiview display extends display in that the viewers can walk around the display. two-view stereo display to N views, where N is typically on Flexible displays may provide the ability to produce cy- the order of tens of views. In an N-view multiview display lindrical displays, providing a similar effect to volumetric each ðx; yÞ location emits N light cones in N directions. displays. Autostereoscopic displays provide two or more views A head-mounted display can completely fill the field of without requiring glasses. This is often accomplished by view, and can replace or augment the ambient light, for placing a parallax barrier or lenticular array in front of a example, to insert a virtual person into an otherwise empty conventional 2-D matrix of pixels, which directs light from chair. Precise, low-latency head-tracking coupled to the different pixels in different directions [32]. Note that if the rendering system is an essential component for such a 2-D matrix has a total of M pixels, and there are N views, display. As head-mounted displays become more comfort- each view has M=N pixels. Therefore, achieving a large able, higher performing, and less visually intrusive to number of views and a high resolution for each view re- others, they will likely occupy larger and larger niches in quires a large number of pixels. immersive communication. A stereoscopic two-view display forces the viewer to focus on the surface of the screen even when viewing an B. Capture object depicted at a different depth. Indeed, it is the dis- A conceptually ideal visual capture system, comple- play, not the viewer, which selects the parts of the scene mentary to an ideal display that emulates a large window, that are in focus. In contrast, a light field display allows the captures all of the visual information that would go viewer to adapt focus and track depth in a natural manner, through this window, namely the light field or 4-D plenop- as if the viewer were looking through an open window. tic function Pðx; y;;Þ (again repressing color and time The limitations of human perception may allow the for simplicity). light field to be heavily sampled across each dimension. For It is useful to compare this idealized capture device with example, it may be sufficient to sample the angular dimen- a conventional camera. A conventional camera uses a lens sions ð; Þ to resolve different rays at 1/2 pupil width. to focus the visual information on a 2-D sensor, corre- Assuming a 2-mm pupil and a display distance of 1 m, sponding to a 2-D matrix of sensing elements. Each sensor covering all 180 of viewing angle amounts to 3000 rays element captures the integral of the light impinging on it, horizontally and 3000 vertically, or 9 000 000 rays per irrespective of direction. Therefore, a conventional camera

Vol. 100, No. 4, April 2012 | Proceedings of the IEEE 979 Apostolopoulos et al.: The Road to Immersive Communication

captures only a 2-D projection of the 4-D plenoptic func- the display, which does not provide good eye contact or tion. A stereo camera incorporates two lenses and two 2-D gaze awareness. If an individual looks at the display the sensors, to capture two projections of the 4-D plenoptic camera will capture a view in which the individual appears function. A camera array can capture multiple views of the to be looking down. Looking at the camera corrects the scene. remote view but then the local viewer cannot concentrate Efforts to capture the plenoptic function include plac- on the display. The directions of gaze and gestures are also ing a 2-D lens array in front of a 2-D sensor. The lens array incorrectly expressed using current systems. This problem provides spatial sampling of ðx; yÞ, and each lens then occurs because the appropriate position of the camera is focuses different angles onto different sensors in the 2-D behind the display, looking through it. Making a hole in a sensor, thereby capturing both spatial and angular infor- display and placing a conventional camera at the hole does mation. Note that the product of spatial and angular not solve the problem as the viewpoint is too close. These sampling is bounded above by the number of sensors in the problems have motivated the design of systems with see- 2-D sensor. As the lens array gets larger, and the individual through displays to provide correct eye contact and gaze lenses get smaller, and the 2-D sensor increases in reso- awareness [33], [50]. lution, this approaches a light-field capture device, and is In scenarios such as wedding receptions where partici- sometimes referred to as a plenoptic camera [21], [22]. pants are free to move about, the desired viewpoint for The above discussion highlights the need to understand remote participants is not known a priori, hence camera the minimum rate required to sample the plenoptic func- placement becomes problematic. In such cases, view syn- tion, analogous to the Nyquist sampling theory for 1-D thesis algorithms may be required to estimate the desired signals. A number of studies have examined the plenoptic views. Such techniques remain an active and highly function in the spectral domain under different assump- challenging field of research. tions, and the tradeoffs between camera sampling, depth of the scene, etc., [25]–[27]. For example, Do et al. [28] D. Summary showed that for the simplistic case of painting a band- The future promises many advances that will im- limited image on a flat surface in 3-D space, the plenoptic prove the visual immersive experience. Visual ren- function is bandlimited. However, they also showed that dering will move from pixels, which emit the same the plenoptic function is not bandlimited unless the sur- color in all directions to light field displays, which emit face is flat. They provide rules to estimate the bandwidth different colors in different directions. This would enable depending on the maximum and minimum scene depths, glasses-free 3-D viewing, while overcoming the vergence– maximum slope of the surface, and maximum frequency of accommodation conflict that afflicts conventional stereo- the painted signal. scopic 3-D displays. Capture systems will support the The future will see increased use of capture devices that capture of these light fields. In addition, see-through provide more informationthanwhatahumancansee display systems will provide correct eye contact and gaze through a window. For example, capturing other spectrum awareness. It is important to emphasize that these systems bandssuchasinfraredorultravioletmaybeusefulinsome involve an immense amount of data to capture, process, applications including enhancement. Also, depth cameras compress, transport, decode, and display. The implications extend the conventional camera or the light field camera to on compression and transport are discussed in Section VI. include an associated depth per pixel. Depths can be esti- The challenge of working with the immense amount of data mated using a variety of techniques including passive stereo associated with entire light fields motivates techniques to depth estimation, and active time-of-flight or structured intelligently select a subset of information to capture or light patterns [29], [31] where the scene is illuminated with render. For example, if the goal is to reproduce the light typically a near-infrared signal. Both passive and active field at the location of the human eyes, then only those techniques have strengths and weaknesses, which are associated light rays need to be captured. In addition, somewhat complimentary, and their strengths can be com- sometimesthelightraysneededfordisplaymaynothave bined by fusing both passive and active depth estimation been captured, leading to the necessity of interpolating the [30]. Accurate depth information simplifies a variety of missing rays from those available, which relates to - applications including foreground/background seg- known problem of view interpolation [34], [35]. It is also mentation, pose estimation, gesture recognition, etc. [31]. noteworthy that future displays may include both emitters and sensors on the same surface, where the surface can C. Two-Way Interaction simultaneously act as a display and a capture device [36]. Two-way, and generally N-way, interaction between people leads to additional requirements on capture and rendering. For example, face-to-face discussions employ V. AUDITORY RENDERING, CAPTURE, important nonverbal signals such as eye contact, gaze di- AND INTERACTION rection, and gestures such as pointing. However, conven- Two-way audio connectivity is universally compelling, as tional video conferencing systems have the camera above witnessed by the remarkable uptake first of wired

980 Proceedings of the IEEE | Vol. 100, No. 4, April 2012 Apostolopoulos et al.: The Road to Immersive Communication

telephony and then cell phones. Paradoxically, however, only with the limited experiences of today’s portable music once basic connectivity is available, humans seem remark- players, online recordings (see, e.g., [69]) are an effective ably tolerant of bad audio quality; historic wired telephony demonstration of how much more spatial richness is readily is passable for voice but it can hardly be considered high achievable. Further, by tailoring the sound to the listener’s fidelity, while cell phones are noticeably worse. It remains ear and head shape, and by tracking the position and commonplace,forexample,toendurewiredandwireless orientation of the listener’s head and re-rendering the sound phone calls where the remote party is marginally intelli- field appropriately, a stereo headset is capable at least in gible. A range of human experiences need better audio, principle of delivering a fully immersive experience [68]. with fidelity beyond the distortions of today’s phone calls, While possible in principle, real-time rendering of and going beyond single-channel audio to some form of spatially consistent binaural audio for headsets has proved multichannel audio that provides a sense of spatial rich- difficult. Sound interacts in important ways with the outer ness and immersion not fully achievable even with today’s ear, head, torso, and the surrounding room. Every ear is research systems. While the beneficial effects are difficult shaped differently, and people shift their head and body to to quantify, it appears that immersive audio grows in im- sample the sound field from different poses. The human portance when the number of participants increases, when auditory system is especially sensitive to and in fact relies the interaction entails moving through a real or virtual on these static and dynamic interactions. An aspirational space, or when emotion plays a role, for example, when goal in binaural sound reproduction is to compute in real teaching or persuading. time, with sufficient fidelity to fool a human listener, a In a formal meeting, immersive audio allows partici- sound pressure field in the ear canal that corresponds to a pants to quickly determine who is talking, to detect the sound source with arbitrary direction and range, ideally subtle signals that direct conversation flow (e.g., throat without resorting to extensive measurements of listeners, clearing), and to discern mood (e.g., a snort of derision rooms, and transducers. Psychoacoustics plays a key role in versus an appreciative chuckle). In an informal group ga- current approaches; human insensitivity to certain acous- thering like a wedding reception, side conversations form, tic details allows many modeling shortcuts to be taken people talk over each other, and there are social rules while preserving an illusion of directionality. See [70] for a around eavesdropping that govern when it is appropriate to deeper discussion and a survey of results. join a conversation and when to move along. These rich Specialized audio capture solutions can partly sidestep forms of human interaction need more than intelligibility; this challenging problem. For example, to immerse a re- most crucially they need a sense of distance and direc- mote participant into a wedding reception, one could use a tionality. For example, a virtual party is nearly impossible mobile robot avatar equipped with a synthetic head and when all voices are merged and rendered with equal stereo microphones. For improved results, the head can be volume through a single loudspeaker atop a table. Nor is it sonically matched to the target listener and can track the easy to replicate the spontaneous pre- and post-meeting listener’s head pose in real time [71]. discussions that occur as people shuffle in and out of a Apart from headphones, sound fields can also be conference room. created via one or more stationary loudspeakers. The key In the remainder of this section, we discuss aspects of challenge remains: what do we emit from the loudspeakers audio capture and rendering that better enable a sense of to induce a desired sound field in the ear canals of the immersion. How is a desired audio field created in an listeners? If we constrain the listener’s position, apply ap- environment? How is audio information extracted from an propriate sound treatment to the room and furniture, and environment? Perhaps most importantly, exactly what au- limit the sound locations we attempt to synthesize, a well- dio field is desirable, especially when coupled with video? engineered stereo or three-way loudspeaker system can be remarkably effective. This is exactly the situation in many A. Audio Rendering high-end telepresence systems. Relaxing these constraints An audio experience is created, for those with normal makes the rendering problem harder, potentially even hearing, by establishing a sound pressure field in each ear infeasible. canal. (Low frequencies felt by the whole body are less Given a sufficient number of loudspeakers, coupled important for human communication.) This can be with sufficiently accurate models for the room and its achieved, for example, by using a stereo headset, by posi- occupants, including tracking of listener head positions, tioning an array of transducers in a hemisphere around a one can in principle deliver completely different sounds to person’s head, or by using one or more transducers posi- every eardrum in the room. In ideal circumstances it suf- tioned some distance from the listener. fices to have just one loudspeaker per eardrum. Mathema- The shortcomings of the naive use of a stereo headset tically, if Hijð!Þ is the transfer function from loudspeaker are familiar: sounds often seem to originate from inside i 2 1; ...; N to eardrum j 2 1; ...; M, by assembling the the skull of the listener, and their apparent position moves transfer functions into an N M matrix and computing its inappropriately when the listener’s head moves. Yet stereo inverse, we can calculate the N loudspeaker excitation headphones can do far better. For those listeners familiar signals needed to deliver the desired sound signals to the

Vol. 100, No. 4, April 2012 | Proceedings of the IEEE 981 Apostolopoulos et al.: The Road to Immersive Communication

M target eardrums. Many challenges must be overcome to collaborative meeting between two rooms with several implement this idea in practice: the transfer functions are people in each room. Specifically, this helps provide the difficult to determine with sufficient accuracy, especially necessary audio/video correspondence, described next. in reverberant rooms and at high frequencies; the transfer functions can change quicklyduetoheadmovementsor C. Audio/Video Correspondence because the occupants’ movements alter the room acous- Video is powerful cue for audio, particularly when the tics; inversion can be ill-conditioned, which can drive the video is perceived as Brelevant[ to the audio, such as a speakers past linearity; and the signal processing demands talking head [74]. When confronted with a conflict be- a low-latency implementation. The single-listener imple- tween visual position and audio position, even a trained mentations of [72] and [73] illustrate the progress made listener will fuse the audio to the video position for a over the last decade. The jump from a single- to a many- separation up to 10. An untrained listener will tolerate up listener system, particularly in a poorly controlled environ- to 20. Thus it would seem that one could relax some ment like a typical meeting room, appears at present to be constraints on audio rendering when video is present, an singularly challenging. effect that is relied upon in movies. Indeed, in a commu- nication setting, if the remote side of a collaboration B. Audio Capture session is portrayed on a display that subtends less than The goal of immersive audio capture is the identifica- 40 of view (which is about as close as most people get to tion and capture of sound field elements that are relevant a screen today), for untrained listeners it may be for remote rendering. One broad category of capture acceptable to place all spoken audio in the middle of the methods isolates clean versions of individual sound display. sources, such as the individual voices of active talkers, The simplification above likely breaks down when from other interfering sound sources, in particular am- there are multiple remote parties talking, or when the bient noise, reverberation, and remote sounds rendered ambient sounds are an important part of the communica- locally by loudspeakers. This can be achieved by mechan- tion. Consider the case of a family dinner in which the two ical and algorithmic methods, including close talking halves of the dinner table are separated by a thousand microphones, microphone arrays, and blind source sepa- miles with a display in between. There will be multiple ration techniques. In many cases, local reverberations are simultaneous talkersVat least in many familiesVand the undesirable at the remote site; removing reverberation ambient sounds are an important part of the social impact algorithmically remains an active area of research. Remote of the event. Audio should be good enough in this sounds rendered by local loudspeakers lead to the well- situation to allow the Bcocktail party effect[ to take place known and hard problem of echo cancellation,whichis [75], [76]. endemic to two-way audio communication. Immersive Audio and video interact strongly with respect to rendering systems that employ loudspeakers must thus quality; it is well known that good audio makes people feel solve two coupled difficult problems: how to induce de- that the video is better. Also, nonsynchronized audio and sired sounds in the ears of the listeners, and how to remove video, resulting in lack of lip-sync, may cause a reduced the resulting contamination from the microphone signals. overall experience. A second broad category of capture methods does not The requirement for immersive audio rendering be- attempt to isolate individual sounds, but rather captures a comes more acute when the listener is surrounded on sound field at a location or over an area, for example, at the several sides by collaboration screens, or when the video ears of a virtual listener, around the head of a virtual lis- portrays a virtual geometry with a more complex structure. tener, or across a virtual window. The sound field is then Forexample,inSecondLife[77],thereisthesensethat conveyed in its entirety to a remote location. An advantage the world continues left, right, and behind what can be of this approach is its ability to provide all the ambient seen on the display, hence there is a need to render sounds sounds in an environment, not just human speech. It also from positions that are not currently visible. sidesteps many of the difficult algorithmic issues described If we have a video rendering system that can place a above, though not echo cancellation. Disadvantages virtual Yoda [78] on a physical chair, then we want his include the need for a potentially high network bandwidth, voice to follow him. But in the absence of an augmented the difficulty in synthesizing a virtual geometry different reality representation of a person, do we want audio to from that of the capture location, and the difficulty in arrive as a disembodied voice floating over an otherwise knitting together captured sound fields from multiple empty chair? Maybe so, maybe notVwhat seems socially remote locations simultaneously. Certain high-end tele- awkward today could be viewed as entirely natural in the presence rooms today adopt a highly simplified but effec- future. tive version of this capture method: three microphones in a What happens when we combine two wildly disparate local room separately drive three correspondingly posi- acoustic environments, like a living room and a train tioned speakers in a remote location. This approach is station? Do we want the noise and echoing openness of the effective at providing a natural form of spatial audio for a train station to flood in? The answer may depend on the

982 Proceedings of the IEEE | Vol. 100, No. 4, April 2012 Apostolopoulos et al.: The Road to Immersive Communication

context and purpose of the communication; a business standardized solutions for compression, transport, and meeting will likely have different requirements than the interaction. case of a family keeping an at-home relative immersed in Four likely trends in the future are: 1) increased scale an ongoing vacation. of the raw size for the captured and rendered signals; 2) increased scale for the range of bit rates and diversity of endpoints which need to be supported; 3) new sensors VI. COMMUNICATION ASPECTS which capture different attributes; and 4) new associated Multimedia communication has been made possible today representations, compression algorithms, and transport by massive improvements in our ability to compress multi- techniques. These four trends are briefly discussed next. media signals coupled with ever increasing network The future will likely see large immersive displays, as capacity. Better compression has arisen from a better described in Section IV-A, with 2-D displays going from understanding of compression principles as well as from their current several megapixels per frame, to tens or enormous increases in available computation and memory. hundreds of megapixels per frame (and possibly gigapixels The striking improvements in computation in recent de- per frame in special cases). Light field displays will require cades help suggest the scale of potential future improve- several orders of magnitude larger number of light rays. ments. When HDTV was being developed in 1991 it took This scaling of visual information is illustrated in Table 1. 24 h to compress 1 s of 720P HDTV video on a high-end These dramatic increases in the size of the raw visual or workstation; in 2011, this can be done in real time on a auditory information will require new compression tech- smartphone you hold in your hand. The improvements niques to squeeze them within the available network arose from faster processors and greater memory band- bandwidth and storage capacity in the future. One im- width, as well as instruction-level parallelism, multicore portant characteristic which makes this feasible is that as processors, and graphics processing units which can con- the sampling rate increases, so does the redundancy be- tain hundreds of cores. Similarly, the advent of high tween samples, and therefore the compressed bit rate in- bandwidth wired and wireless packet networks, with their creases much slower than the sampling rate. For example, flexibility for dynamic resource allocation, has been crucial when increasing the number of light rays by 10 for every for transporting these multimedia signals. The future spatial location, the redundancy across the light rays re- promises additional improvements in computation, stor- sults in the compressed data rate increasing by much less age, and networking, and these are necessary to enable the than a factor of 10. This is one of the assumptions made in immersive communication envisioned in this paper. In ad- Table 1. Furthermore, a second important characteristic is dition, the future will see more processing being performed that the perceptually relevant information also does not in the cloud. This is especially true for mobile devices, scale linearly with sampling rate. High-quality multimedia which may have limited compute capability or battery life. compression relies on reducing the redundancy and the For example, many applications such as visual search or 3-D perceptually irrelevant information within the media scene reconstruction can be partitioned such that select signals [54]. The research community has many years of processing is performed on the device (e.g., feature understanding of the redundant and perceptually irrele- extraction) and the remaining processing in the cloud vant information in 2-D video and audioVthe near future (e.g., compute-intensive search across large databases). will see an increased focus on understanding these issues Video and audio coding has been a research focus in the for immersive visual, auditory, and haptic data. This new academic and industry communities for many years and knowledge will most likely direct the creation of new this has resulted in a variety of international standards by multimedia representations and compression algorithms. the ISO/IEC, ITU, etc., that have enabled interoperable Thefuturewilllikelyseeawidediversityofcommuni- multimedia communication [37]. The increasing desire for cation endpoints, ranging from custom-designed immer- interoperability across different end-devices and systems is sive studios, to systems that can be rolled into an existing likely to continue the pressure to develop and deploy room or environment, to desktop and mobile devices.

Table 1 Example of the Large Increase in Raw and Compressed Data Rates for Going From a Conventional Digital TV Video (Row 1) to a Large Display Providing an Immersive Experience (Row 2), and to Future 3-D Light Field Displays (Rows 3 and 4). For Simplicity, 60 Frames/s Is Assumed for all Applications, and 1000 1000 Light Rays (Horizontally and Vertically) for Each ðx; yÞ Coordinate

Vol. 100, No. 4, April 2012 | Proceedings of the IEEE 983 Apostolopoulos et al.: The Road to Immersive Communication

These endpoints will be able to support, and require, vastly to help estimate the number of speakers and their bearing different versions of the multimedia content. For example, relative to a microphone array, and thereby the visual an immersive studio may need 100 or 1000 the number information can guide the audio processing. These object- of pixels of a desktop or mobile system. Similarly, there basedsystemswouldanalyzethecapturedsignals,decom- will probably be wide diversity in the bandwidth available pose them into meaningful objects, and appropriately to different endpoints based on the network they are compress and deliver each of them. The separate coding connected to, and the time-varying demands placed on the and transport of the different objects greatly facilitates network by other applications. These heterogeneous and object-based processing, such as the addition or removal of time-varying requirements provide additional challenges objects, or the placement of objects within a virtual visual for future systems, which may need to simultaneously or auditory environment. Another trend is toward model- communicate in a multiparty scenario with, for example, an based image/video synthesis, where textures such as hair or immersive studio, a desktop system, and a mobile device. grass are created which are conceptually faithful, though Today, video and audio coding directly compresses the not pixel-wise accurate. Future collaborative meetings may captured visual or auditory waveforms from 2-D imaging also have some participants photo-realistically rendered, sensors or discrete microphones. This has been a practical, and others expressed as avatars because of limited band- robust, and successful approach for current applications. width, lack of available camera, or privacy preferences. The The future promises a number of significant extensions. compression technique selectedforaspecificimmersive Three-dimensional depth cameras capture both the red– communication session woulddependontheavailable green–blue (RGB) colors and depth for every pixel. Ple- bandwidth, computational capability, available sensors, noptic cameras capture multiple light rays coming from and application-level constraints such as privacy concerns. different directions. Holographic capture devices capture Immersion in a remote environment requires the abi- the amplitude and phase of the optical wavefront [38]. lity to physically interact with objects in that environment. These, and other, new sensors will provide a more detailed This relies on the haptics modality, comprising tactile understanding of the visual and auditory scene. To exploit sense (touch, pressure, temperature, pain) and kinaesthe- this requires the development of new compression and tics (perception of muscle movements and joint positions). transport techniques to both efficiently deliver the content To provide timely haptic feedback requires a control loop and leverage the new understanding to provide new capa- with roughly 1-ms latency. This corresponds to a high bilities to modify or enhance the scene. packet transmission rate of nominally 1000 packets/s. To The future is likely to see the successful development of overcome this problem, recent work leveraged perceptual novel multimedia representations and associated com- models to account for the human haptic sensitivity. For pression algorithms. Video compression for stereo and example, based on Weber’s law of the just noticeable dif- multiview video coding has recently been standardized ference (JND), the JND is linearly related to the intensity [37], and ongoing efforts are incorporating depth maps in of the stimulus. Changes less than the JND would not be the compression. Distributed coding principles have also perceived and hence do not need to be transmitted. This been applied for both independent coding of the video from leads to the notion of perceptual dead-bands within which multiple cameras with joint decoding, and for creating low- changes to the haptic signal would in principle be im- complexity encoders [40]. In the future, the visual informa- perceptible to the user [44]. Accounting for haptic signal tion may be expressed as a collection of 2-D layers or 3-D perceptibility, coupled with conventional compression objects, lighting information, and a scene description of techniques, can lead to significant reductions in the trans- how to compose them to render the scene [37], [42]. mitted packet rate and the total bit rate transmitted [45]. Similarly, the auditory information may be expressed as a Haptics, and the compression and low-delay two-way collection of audio sources (e.g., individual people) and transmission of haptics information, is an emerging area environmentaleffectssuchasreverberation,andascene that promises to dramatically increase the realism and description of how to compose them to render the auditory range of uses of immersive communication systems. scene [43]. Object-based video and audio coding was developed within the MPEG-4 standard in the late 1990s, however it was not successful because of the difficulty of VII. QUALITY OF EXPERIENCE AND decomposing a scene into objects, and the high computa- HUMAN PERCEPTION tional requirements. For example, segmenting video into Ultimately, the success or failure of any system for immer- meaningful objects is a very challenging problem. Fortu- sive communication lies in the quality of the human ex- nately, the advent of 3-D-depth cameras provides a major perience that it provides, not in the technology that it uses. step forward. Similarly, algorithms for determining the As mentioned in the Introduction, immersive commu- number of speakers in a room and separating their voices nication is about exchanging natural social signals between are improving. Since immersive communication involves remote people and/or experiencing remote locations in both video and audio, multimodal processing can help, ways that suspend one’s disbelief in being there. Funda- where, for example, face detection and tracking is applied mentally, the quality of such an immersive experience

984 Proceedings of the IEEE | Vol. 100, No. 4, April 2012 Apostolopoulos et al.: The Road to Immersive Communication

disbelief in being with each other in the same room. In fact these high-end telepresence systems preserve many of the important immersive cues listed in Table 2, auditory and visual, beginning with peripheral awareness and consis- tency. Peripheral awareness is the awareness of everything that is going on in one’s immediate vicinity, providing a sense of transparency, of being there. Halo achieves that, in part, by a large field of view. Consistency between local and remote sites is important for maintaining the illusion of a single room. So all elements of room design, such as tables, chairs, colors, wall coverings, acoustics, and light- ing, match across all locations by design. In addition, a number of measures were taken to maintain consistent 3-D cues. For example: • remote people appear life size and occupy the same field-of-view as if they were present; • multichannel audio is spatially consistent with the video; Fig. 5. HP Halo (top) and Cisco Telepresence (bottom). • video streams from remote sites are placed in an order that is consistent with a virtual meeting lay- out, improving eye contact and maintaining con- relates to human perception and mental state. For immer- sistent gaze direction. sive communication, while it may be sufficient to repro- In addition: duce the sensory field at a distant location, it may not be • low latency and full-duplex echo-controlled audio necessary, or even feasible, to do so. More important to allow natural conversation; achieving a high QoE is inducing in each participant an • people’s facial expressions are accurately conveyed intended illusion or mental state. by a combination of sufficient video resolution and Hence, it may come as no surprise that the latest constrained seating that keeps participants in front generation of immersive communication products, which of the cameras. set a new standard for QoE, was conceived by veteran Despite the attention paid to these details, other con- storytellers in Hollywood. Moviemaker DreamWorks Ani- flicting cues can serve to break the illusion. One example is mation SKG, in partnership with technologist Hewlett- eye contact. In a real physical space, each person is able to Packard, developed the HP Halo Video Collaboration make eye contact with at most one other person. But in Studio, shown in Fig. 5 bringing a moviemaker’s perspec- today’s telepresence systems, eye contact is compromised tive to video conferencing. (Various companies, including because all local participants see the same view of each Cisco, Tandberg, Teliris, and Polycom, have since offered remote participant. Future multiview 3-D displays promise similar systems.) Unlike their predecessor video confer- to improve on this aspect. encing systems, these high-end telepresence systems are Evaluating the QoE of an immersive communication carefully designed to induce participants to suspend their system can be done on several levels, which one could call

Table 2 Immersive Cues

Vol. 100, No. 4, April 2012 | Proceedings of the IEEE 985 Apostolopoulos et al.: The Road to Immersive Communication

the performance, psychometric, and psychophysical levels. of a set of stimuli provided by the system. Psychophysical At the performance level, QoE is evaluated in terms of the studies of audio and visual quality have been well studied overall effectiveness of the experience with respect to through mean opinion scores (MOS) [53], JNDs due to performing some task. For example, Bailenson et al. [46] quantization noise [54], frequency sensitivity, and so forth. studied whether gaze awareness could reduce the number Rulesofthumbforhigh-frequencysensitivityarethat of questions, time per question, or completion time in a human hearing falls off beyond 15–20 kHz [55] and visual Bgame of 20 questions[ with multiple players. Such acuity in the fovea falls off beyond 60 cycles per degree evaluations are often performed as user studies in a lab, [56]. There are also well-known results on delay, such as since well-defined, measurable tasks must be carried out. interactive voice conversations can tolerate up to 150 ms of Clearly, measures of performance may be completely dif- delay [57], and that for lip sync, audio can lag video by up ferent for one task (e.g., playing poker) compared to to 100–125 ms, but can lead video by up to only 25–45 ms another (e.g., negotiating a business deal). Hence, it is [58]. Interestingly, recent evidence suggests that video possible for different immersive communication systems tends to mask audio delay, and that interactive audiovisual to excel in different settings. conversations can tolerate delays significantly larger than At the psychometric level, QoE is evaluated in terms of 150 ms [59], the hypothesis being that intent to speak can how the participants feel about the experience. Such eval- be signaled through video instead of audio. Eye gaze has uations can be carried out either as a study in a lab (often in also been studied: eye contact can be perceived (at 90th the context of performing a task) or as a field study (in the percentile) within an error of 7 downward, but only 1 context of a regular workload). As an example of a field left, right, or upward [60]. study, Venolia et al. [47] studied the deployment of em- However, many other psychophysical characterizations bodied social proxies, or telepresence robots, in four real- of QoE for immersive communication are incipient. One world software development teams in which one member area that is important to immersive communication is of each team was remote. Within each team, an embodied quality of 3-D modeling [61] and stereoscopy [62], which social proxy for the remote team member was deployed for are largely still art forms. We are only beginning to under- six weeks. A baseline survey was completed before the stand how humans perceive mismatches between the deployment, and the same survey was completed after the various cues in Table 2, such as any mismatch between the deployment, by all team members. The survey listed a set of spatially perceived source of audio and its visual counter- assertions to be scored on an eight-point range from 0 (low part [63]; between visual stereo parallax, motion parallax, agreement) to 7 (high), known as a Likert scale in the and focus [64]; between the lighting, color, or acoustics of psychometric literature [48]: one set of assertions related local and remote sites; etc. And as of yet we have no to meeting effectiveness (e.g., BI think X has a good sense of studies of the tolerable delay between head motion and the my reactions[); another set of assertions related to aware- change of visual field for motion parallax. We have little ness (e.g., BI am aware of what X is currently working on understanding of the fidelity of geometric space, for and what is important to X[); and a final set of assertions example, how much it can be warped before one perceives related to social aspects (e.g., BI have a sense of closeness to changes in gesture. One thing is clear: people can watch X[). Comparing the surveys before and after the deployment video on flat screens of any size at a wide range of orien- yielded quantitative psychometric results on the overall tations, and perceive the scene from the camera’s point of experience with respect to meeting effectiveness, awareness, view, easily accommodating changes in the camera’s and social aspects. Qualitative results were also obtained position and focal length. There is an analogous robustness through freeform comments on the surveys as well as direct to color balance. These factors point to remarkable observations of the teams in action by ethnographers. Bconstancy[ properties of human perception [65] whose Other examples of psychometric evaluations include role in immersive communication has yet to be understood. those of [49]–[51], which showed the importance of Overall, some elements of an immersive experience may be awareness of remote participants’ object of attention on quite flexible, while others must match physical reality, or collaborative tasks, and [52], which showed that trust can else risk user fatigue that threatens long-term use. be improved using a multiview display to support proper Finally, it must be noted that human psychology can eye gaze. These were assessed in lab studies using surveys. play an unexpectedly critical role in determining QoE. The third level on which to evaluate QoE is the psy- This is evident, for example, in our ability to detect when chophysical level. Psychophysics is the study of the rela- the representation of a human has entered an Buncanny tion between stimuli and sensation. Weber’s law, that the valley[ [66] and in our tendency to imbue devices with threshold of perception of change in a stimulus is propor- anthropomorphic properties [66], [67]. tional to the amplitude of the stimulus (resulting in a logarithmic relationship between a stimulus’ amplitude and its perception), is a well-known psychophysical result. VIII. CONCLUDING REMARKS Psychophysical evaluations of the QoE of an immersive After many decades of steady improvements, underlying communication system are based on subjective perception technical trends (e.g., in computation, bandwidth, and

986 Proceedings of the IEEE | Vol. 100, No. 4, April 2012 Apostolopoulos et al.: The Road to Immersive Communication

resolution) are rapidly accelerating progress towards new workersreportthattheycouldfulfilltheirjobdutiesfroma and more immersive communication systems. Indeed, we remotelocationiftheywerepermittedtodoso[79]. are experiencing a blossoming variety of immersive com- Moreover, the time saved commuting would be substantial munication systems beyond standard audio and video, as (in the United States, 45-min average daily commute: outlined in Section III. almost 10% of overall productivity). These are gains we Moreover, as technology pushes (i.e., supplies), society cannot afford to ignore. pulls (i.e., demands). The societal need for more effective In a sign of the times, physical security has also risen to ways for remote people to communicate with each other, the top of needs addressable by more immersive commu- as if they were face to face, has never been greater. nication. When Iceland’s Eyjafjallajo¨kull volcano erupted Numerous societal conditions are aligning to create in April 2010, clouds of ash ended up canceling over this need: the need to reduce environmental impact, the 100 000 flights, stranding over eight million passengers in need to increase economic efficiency through reduced tra- Europe, and costing the airline industry over 2.5 billion vel costs and fatigue (both in terms of fees and wasted euros. Similarly costly disruptions accompanied hurricanes time), the need for richer collaboration in an ever more in the United States, earthquakes and tsunamis in Japan, complex business environment, and the difficulty or unde- and bird flu in Asia, not to mention acts of terrorism. The sirability of travel during natural disasters such as volcanic economic impact of such threats could be reduced by eruptions and influenza and influenza outbreaks [79], improved forms of communication, which could allow [80]. Some of the foremost in importance relate to climate, most business to be carried on as usual. energy, and the environment. Many of us travel interna- Then, of course, there are the ever-present needs to tionally several times per year to attend conferences and bring families closer together and to improve the availabil- other events. Each time we do so, we consume several ity of quality of education and healthcare around the world. barrels of oil and release on the order of a ton of CO2 into In short, the technology is advancing and the social the atmosphere, per person, contributing to global warm- need is there. In this paper, we have tried to put technol- ing at an unprecedented scale. ogy, applications, and scenarios in a historical perspective The economy and productivity are not far behind in with an eye on promising future technology directions. importance. Whether an economy is booming or busting, it Furthermore, we have tried to emphasize the importance is becoming more and more critical to connect people with of evaluating the experience from the human user’s per- jobs and businesses with workers, wherever they may be spective. Some of the most important future advances will around the world. The possibilities for immersive commu- come from ingenious combinations of core components nication to address this need are ripe. Studies have shown (e.g., light field displays and 3-D depth sensors) in unanti- that information workers around the world spend over cipated ways. We are confident that we will see substantial 60% of their time in some form of communication with progress over the next decade on the road to immersive other people. Furthermore, over 60% of information communication. h

REFERENCES portal/site/emergingtech/techindex.jsp? perspective,[ Keynote, Packet Video techId=1140. Workshop, Dec. 2004. [1] T. Friedman, The World Is Flat: A Brief B [ History of the Twenty-First Century, [8] M. Minsky, Telepresence, OMNI Mag., [16] N. Jouppi, S. Iyer, S. Thomas, and pp. 44–52, Jun. 1980. A. Slayden, BBiReality: Mutually-immersive 3rd ed. New York: Farrar, Straus and [ Giroux, 2007. [9] G. du Maurier, BEdison’s telephonoscope telepresence, Proc. ACM Int. Conf. [ Multimedia, 2004, DOI: 10.1145/1027527. [2] T. Clark, Faraway, So Close: Doing Global (transmits light as well as sound), Punch’s Almanack, Dec. 9, 1878. 1027725. iPhone Dev from New Zealand, Jan. 2011. B [Online]. Available: http://arstechnica.com/ [10] A. G. Bell, BOn the possibility of seeing [17] E. H. Adelson and J. R. Bergen, The by electricity,[ in Alexander Graham Bell plenoptic function and the elements gaming/2011/01/sell-in-america-live-in-new- [ zealand.ars. Family Papers. Washington, DC: Library of early vision, in Computation Models of Congress, Apr. 10, 1891. of Visual Processing, M. Landy and [3] J. N. Bailenson, A. C. Beall, and J. A. Movshon, Eds. Cambridge, MA: B [11] I. E. Sutherland, BThe ultimate display,[ in J. Blascovich, Mutual gaze and task MIT Press, 1991, pp. 3–20. performance in shared virtual environments,[ Proc. IFIP Congr., 1965, pp. 506–508. [18] A. Jeans, M. Almanza-Workman, R. Cobene, Personality Social Psychol. Bull., pp. 1–15, [12] BS.F. man’s invention to revolutionize R. Elder, R. Garcia, R. F. Gomez-Pancorbo, 2003. television,[ San Francisco Chronicle, W. Jackson, M. Jam, H.-J. Him, O. Kwon, vol. CXXXIII, no. 50, Sep. 3, 1928. [4] A. Pentland, Honest Signals: How They H. Luo, J. Maltabes, P. Mei, C. Perlov, Shape Our World. Cambridge, MA: [13] C. Cruz-Niera, D. J. Sandin, T. A. DeFanti, M. Smith, and C. Taussig, BAdvances B MIT Press, 2008. R. V. Kenyon, and J. C. Hart, The in roll-to-roll imprint lithography for [5] A. Vinciarelli, M. Pantic, and H. Bourland, CAVE: Audio visual experience automatic display applications,[ in Proc. SPIE, [ BSocial signal processing: Survey of virtual environment, Commun. ACM, Feb. 2010, vol. 7637, p. 763719. [ vol. 35, no. 6, Jun. 1992, DOI: 10.1145/ an emerging domain, J. Image Vis. [19] H. J. Kim, M. Almanza-Workman, 129888.129892. Comput., vol. 27, no. 12, pp. 1743–1759, B. Garcia, O. Kwon, F. Jeffrey, S. Braymen, 2009. [14] J. Fihlo, K. M. Inkpen, and M. Czerwinski, J. Hauschildt, K. Junge, D. Larson, D. Stieler, B [6] M. Zimmerman, BThe nervous system Image, appearance and vanity in the A. Chaiken, B. Cobene, R. Elder, W. Jackson, in the context of information theory,[ in use of media spaces and videoconference M. Jam, A. Jeans, H. Luo, P. Mei, C. Perlov, [ Human Physiology, R. F. Schmidt and systems, Proc. ACM Int. Conf. Supporting and C. Taussig, BRoll-to-roll manufacturing G. Thews, Eds. New York: Springer-Verlag, Group Work, May 2009, DOI: 10.1145/ of electronics on flexible substrates using 1989, pp. 166–173. 1531674.1531712. self-aligned imprint lithography (SAIL),[ [7] B. Hoang and O. Perez, Terabit Networks. [15] D. Lindbergh, BThe past, present, and J. Soc. Inf. Display, vol. 17, p. 963, 2009, [Online]. Available: http://www.ieee.org/ future of video telephony: A systems DOI: 10.1889/JSID17.11.963.

Vol. 100, No. 4, April 2012 | Proceedings of the IEEE 987 Apostolopoulos et al.: The Road to Immersive Communication

[20] A. Jones, M. Lang, G. Fyffe, X. Yu, LCD for 3D interaction using light fields,[ [54] N. S. Jayant, J. D. Johnston, and R. J. Safranek, J. Busch, I. McDowall, M. Bolas, and ACM Trans. Graphics, vol. 28, no. 5, 2009. BSignal compression based on models of B [ P. Debevec, Achieving eye contact [37] L. Chiariglione, BMultimedia standards: human perception, Proc. IEEE, vol. 81, no. 10, in a one-to-many 3D video Interfaces to innovation,[ Proc. IEEE,tobe pp. 1385–1422, Oct. 1993. [ teleconferencing system, ACM Trans. published.. [55] B. C. J. Moore, Psychology of Hearing, Graphics, vol. 28, no. 3, Jul. 2009, [38] V. M. Bove, BDisplay holography’s digital 5th ed. New York: Academic, 2003. DOI: 10.1145/1531326.1531370. second act,[ Proc. IEEE, vol. 100, no. 4, [56] M. H. Pirenne, BVisual acuity,[ in The Eye, [21] E. H. Adelson and J. Y. A. Wang, Apr. 2012, DOI: 10.1109/JPROC.2011. vol. 2, H. Davson, Ed. New York: Academic, B Single lens stereo with plenoptic 2182071. 1962. camera,[ IEEE Trans. Pattern Anal. Mach. [39] A. Vetro, T. Wiegand, and G. J. Sullivan, [57] One-Way Transmission Time, ITU-T Intell., vol. 14, no. 2, pp. 99–106, Feb. 1992. BOverview of the stereo and multiview video Recommendation G.114, 2003. [22] R. Ng, M. Levoy, M. Bredif, G. Duval, coding extensions of the H.264/MPEG-4 [58] Relative Timing of Sound and Vision for M. Horowitz, and P. Hanrahan, BLight [ AVC standard, Proc. IEEE, vol. 99, no. 4, Broadcasting, ITU-R BT.1359-1, 1998. field photography with a hand-held pp. 626–642, Apr. 2011. plenoptic camera,[ Stanford Univ. [59] E. Geelhoed, A. Parker, D. J. Williams, and [40] B. Girod, A. M. Aaron, S. Rane, and B Comput. Sci., Stanford, CA, Tech Rep. M. Groen, Effects of latency on D. Rebollo-Monedero, BDistributed video [ CSTR 2005-02, Apr. 2005. telepresence, HP Labs, Tech. Rep. coding,[ Proc. IEEE, vol. 93, no. 1, pp. 71–83, B HPL-2009-120, 2009. [23] M. Levoy and P. Hanrahan, Light Jan. 2005. [ [60] M. Chen, BLeveraging the asymmetric field rendering, Proc. ACM SIGGRAPH, B [41] J. Y. A. Wang and E. Adelson, Representing sensitivity of eye contact for 1996, pp. 31–42. [ moving images with layers, IEEE Trans. videoconferencing,[ Proc. SIGHI Conf. [24] S. J. Gortler, R. Grzeszczuk, R. Szeliski, Image Process., vol. 3, no. 5, pp. 625–638, Human Factors Comput. Syst., 2002, and M. Cohen, BThe lumigraph,[ Proc. Sep. 1994. DOI: 10.1145/503376.503386. ACM SIGGRAPH, 1996, pp. 43–54. B [42] A. Smolic and P. Kauff, Interactive 3D video [61] I. Cheng, R. Shen, X.-D. Yang, and [25] C. Zhang and T. Chen, BA survey on [ representations and coding technologies, P. Boulanger, BPerceptual analysis of image-based rendering-representation, Proc. IEEE, vol. 93, no. 1, pp. 98–110, level-of-detail: The JND approach,[ Proc. sampling and compression,[ in Jan. 2005. Int. Symp. Multimedia, 2006, pp. 533–540. EURASIP Signal Process., Image Commun., [43] B. L. Vercoe, W. G. Gardner, and [62] A. Benoit, P. Le Callet, P. Campisi, and 2004, vol. 19, pp. 1–28. B E. D. Scheirer, Structured audio: Creation, R. Cousseau, BQuality assessment of [26] J. X. Chai, S. C. Chan, H. Y. Shum, and transmission, and rendering of parametric stereoscopic images,[ EURASIP J. Image X. Tong, BPlenoptic sampling,[ in Proc. [ sound representations, Proc. IEEE, vol. 86, Video Process., vol. 2008, 2008, SIGGRAPH, 2000, pp. 307–318. no. 5, pp. 922–940, May 1998. DOI:10.1155/2008/659024. [27] C. Zhang and T. Chen, BSpectral analysis [44] P. Hinterseer, S. Hirche, S. Chaudhuri, [63] S. Boyne, N. Pavlovic, R. Kilgore, and for sampling image-based rendering data,[ B E. Steinbach, and M. Buss, Perception-based M. H. Chignell, BAuditory and visual IEEE Trans. Circuits Syst. Video Technol., data reduction and transmission of haptic facilitation: Cross-modal fusion of vol. 13, no. 11, pp. 1038–1050, Nov. 2003. [ data in telepresence and teleaction systems, information in multi-modal displays,[ in [28] M. N. Do, D. Marchand-Maillet, and IEEE Trans. Signal Process., vol. 56, no. 2, Meeting Proc. - Visualisation Common B M. Vetterli, On the bandwidth of the pp. 588–597, Feb. 2008. Operational Picture., Neuilly-sur-Seine, [ plenoptic function, IEEE Trans. Image [45] E. Steinbach, S. Hirche, J. Kammerl, France, pp. 19-1–19-4, 2005, Paper 19, Process., vol. 21, no. 2, pp. 708–717, I. Vittorias, and R. Chaudhari, BHaptic RTO-MP-IST-043. [Online]. Available: Feb. 2012. data compression and communication,[ http://www.rto.nato.int/abstracts.asp. [29] J. Salvi, S. Fernandez, T. Pribanic, and IEEE Signal Process. Mag., vol. 28, no. 1, [64] M. Lambooij, W. IJsselsteijn, and B X. Llado, A state of the art in structured pp. 87–96, Jan. 2011. I. Heynderickx, BStereoscopic displays [ light patterns for surface profileometry, [46] J. N. Bailenson, A. C. Beall, and J. Blascovich, and visual comfort: A review,[ in SPIE Pattern Recognit., vol. 43, no. 8, BGaze and task performance in shared Newsroom, Apr. 2, 2007, DOI: 10.1117/2. pp. 2666–2680, Aug. 2010. virtual environments,[ J. Vis. Comput. Animat., 1200703.0648. [30] Q. Yang, K. H. Tan, B. Culbertson, and vol. 13, no. 5, pp. 313–320, 2002. [65] V. Walsh and J. Kulikowski, Eds., Perceptual B J. Apostolopoulos, Fusion of active [47] G. Venolia, J. Tang, R. Cervantes, S. Bly, Constancy: Why Things Look as They Do. [ and passive sensors for fast 3D capture, G. Robertson, B. Lee, K. Inkpen, and Cambridge, U.K.: Cambridge Univ. Press, Proc. IEEE Int. Workshop Multimedia S. Drucker, BEmbodied social proxy: 1996. Signal Process., Oct. 2010, pp. 69–74. [ Connecting hub-and-satellite teams, Proc. [66] J. Blascovich and J. Bailenson, Infinite Reality: [31] A. Kolb, E. Barth, and R. Koch, Comput. Supported Cooperative Work, 2010. B Avatars, Eternal Life, New Worlds, and the ToF-sensors: New dimensions for [48] P. E. Spector, Summated Rating Scale [ Dawn of the Virtual Revolution. New York: realism and interactivity, Proc. IEEE Construction: An Introduction. Thousand Harper-Collins, 2011. Conf. Comput. Vis. Pattern Recognit., Oaks, CA: Sage Publications, 1992. Jun. 2008, DOI: 10.1109/CVPRW.2008. [67] B. Reeves and C. Nass, The Media Equation: [49] J. Tang and S. Minneman, BVideoWhiteboard: 4563159. How People Treat Computers, Television, and Video shadows to support remote New Media Like Real People and Places. [32] H. Urey, K. V. Chellappan, E. Erden, and collaboration,[ Proc. SIGHI Conf. Human B Stanford, CA: CSLI Publications, 1996. P. Surman, State of the art in stereoscopic Factors Comput. Syst., 1991, pp. 315–322, [ [68] QSound Labs, Virtual Barber Shop,May5, and autostereeoscopic displays, Proc. DOI: 10.1145/108844.108932. IEEE, vol. 99, no. 4, pp. 540–555, 2011. [Online]. Available: http://www. [50] H. Ishii and M. Kobayashi, BClearBoard: Apr. 2011. qsound.com/demos/binaural-audio.htm. A seamless medium for shared drawing [33] K.-H. Tan, I. Robinson, R. Samadani, [ [69] V. R. Algazi and R. O. Duda, and conversation with eye contact, BHeadphone-based spatial sound,[ B. Lee, D. Gelb, A. Vorbau, B. Culbertson, Proc. SIGHI Conf. Human Factors Comput. B IEEE Signal Process. Mag.,vol.28,no.1, and J. Apostolopoulos, Connectboard: Syst., 1992, DOI: 10.1145/142750.142977. A remote collaboration system that pp. 33–42, Jan. 2011. [51] A. Tang, M. Pahud, K. Inkpen, H. Benko, supports gaze-aware interaction and [70] B. Kapralos, M. R. Jenkin, and E. Milios, J. C. Tang, and B. Buxton, BThree’s company: sharing,[ Proc. IEEE Int. Workshop BVirtual audio systems,[ Presence, vol. 17, Understanding communication channels in Multimedia Signal Process., 2009, no. 6, pp. 527–549, 2008. a three-way distributed collaboration,[ Proc. DOI: 10.1109/MMSP.2009.5293268. [71] I. Toshima, S. Aoki, and T. Hirahara, Comput. Supported Cooperative Work, 2010, BAn acoustical tele-presence robot: [34] S. C. Chan, H. Y. Shum, and K. T. Ng, pp. 271–280. [ BImage-based rendering and synthesis,[ TeleHead II, in Proc. IEEE/RSJ Int. Conf. [52] D. T. Nguyen and J. Canny, BMultiview: IEEE Signal Process. Mag., vol. 24, no. 7, Intel. Robots Syst., Sendai, Japan, Sep. 2004, Improving trust in group video conferencing pp. 22–33, Nov. 2007. pp. 2105–2110. through spatial faithfulness,[ Proc. SIGHI B [ [35] R. Szeliski, Computer Vision: Algorithms and [72] W. Gardner, 3-D audio using loudspeakers, Conf. Human Factors Comput. Syst., 2007, Ph.D. dissertation, Progr. Media Arts Sci., Applications. New York: Springer-Verlag, DOI: 10.1145/1240624.1240846. 2011. Schl. Archit. Planning, Massachusetts Inst. [53] Methods for Objective and Subjective Assessment Technol., Cambridge, MA, 1997. [36] M. Hirsch, D. Lanman, H. Holtzman, and of Quality, ITU-T Recommendation P.800, [73] T. Gra¨mer, BEfficient modeling of head R. Raskar, BBiDi screen: A thin, depth-sensing 1996. movements and dynamic scenes in virtual

988 Proceedings of the IEEE | Vol. 100, No. 4, April 2012 Apostolopoulos et al.: The Road to Immersive Communication

acoustics,[ M.S. thesis, Univ. Zurich/ETH, Bull. Rev., vol. 8, no. 2, pp. 331–335, com/presspass/download/features/2010/ Faculty Med., 2010. 2001. NationalRemoteWorkingSummary.pdf [74] D. R. Begault, BAuditory and non-auditory [76] S. Haykin and Z. Chen, BThe cocktail party [80] BIceland volcano ash hits European flights factors that potentially influence virtual problem,[ Neural Comput., vol. 17, no. 9, again,[ Agence France-Presse, May 8, 2010. acoustic imagery,[ in Proc. 16th Audio Eng. pp. 1875–1902, 2005. [Online]. Available: http://news.ph.msn.com/ Soc. Int. Conf. Spatial Sound Reproduction, [77] Second Life. [Online]. Available: http://www. top-stories/article.aspx?cp-documentid= Mar. 1999. secondlife.com 4076015; http://www.google.com/hosted- news/afp/article/ALeqM5gBr_mBjSK4JC- [75] A. R. A. Conway, N. Cowan, and [78] Yoda. [Online]. Available: http://en. W3iODp3Hhw5S5SiQ. M. F. Bunting, BThe cocktail party .org/wiki/Yoda phenomenon revisited: The importance [79] U.S. National Remote Working Survey, 2010. of working memory capacity,[ Psychonomic [Online]. Available: http://www.microsoft.

ABOUT THE AUTHORS John G. Apostolopoulos (Fellow, IEEE) re- served as Consulting Associate Professor at Stanford University (1994– ceived the B.S., M.S., and Ph.D. degrees in electri- 1995), Affiliate Associate Professor at the University of Washington cal engineering and computer science from (1998–2009), and Adjunct Professor at the Chinese University of Massachusetts Institute of Technology (MIT), Hong Kong (since 2006). He has longstanding research interests in data Cambridge. compression, signal processing, information theory, communications, He is Director of the Mobile & Immersive and pattern recognition, with applications to video, images, audio, Experience Lab, within Hewlett-Packard Labora- speech, and documents. He is co-editor, with M. van der Schaar, of the tories, Palo Alto, CA, directing about 75 research- book Multimedia over IP and Wireless Networks (Amsterdam, The ers in the United States, India, and the United Netherlands, Elsevier, 2007). Kingdom. His research interests include mobile Dr. Chou served as an Associate Editor in source coding for the IEEE and immersive multimedia communications and improving their natu- TRANSACTIONS ON INFORMATION THEORY from 1998 to 2001, as a Guest Editor ralness, reliability, fidelity, scalability, and security. In graduate school, for special issues in the IEEE TRANSACTIONS ON IMAGE PROCESSING, the IEEE he worked on the U.S. Digital TV standard and received an Emmy Award TRANSACTIONS ON MULTIMEDIA (TMM), and the IEEE SIGNAL PROCESSING MAGA- Certificate for his contributions. He enjoys teaching and conducts joint ZINE in 1996, 2004, and 2011, respectively. He was a member of the IEEE research at Stanford University, Stanford, CA, where he was a Consulting Signal Processing Society (SPS) Image and Multidimensional Signal Associate Professor of Electrical Engineering (2000–2009), and is a Processing Technical Committee (IMDSP TC), where he chaired the frequent visiting lecturer at MIT’s Department of Electrical Engineering awards subcommittee (1998–2004). Currently, he is chair of the SPS and Computer Science. Multimedia Signal Processing TC, member of the ComSoc Multimedia TC, Dr. Apostolopoulos received a best student paper award for part of his member of the IEEE SPS Fellow selection committee, and member of the Ph.D. dissertation, the Young Investigator Award (best paper award) at TMM and ICME Steering Committees. He was the founding technical the 2001 Visual Communication and Image Processing (VCIP 2001) chair for the inaugural NetCod 2005 workshop, special session and Conference for his work on multiple description video coding and path panel chair for the 2007 IEEE International Conference on Acoustics, diversity, was named Bone of the world’s top 100 young (under 35) Speech, and Signal Processing (ICASSP 2007), publicity chair for the innovators in science and technology[ (TR100) by MIT Technology Re- Packet Video Workshop 2009, and technical co-chair for the 2009 IEEE view, and was coauthor for the best paper award at the 2006 Interna- International Workshop on Multimedia Signal Processing (MMSP 2009). tional Conference on Multimedia and Expo (ICME 2006) on authentication He is a member of Phi Beta Kappa, Tau Beta Pi, Sigma Xi, and the IEEE for streaming media. His work on secure transcoding was adopted by the Computer, Information Theory, Signal Processing, and Communications JPSEC standard. He served as chair of the IEEE Image, Video, and Multi- societies, and was an active member of the MPEG committee. He is the dimensional Signal Processing Technical Committee, and member of recipient, with T. Lookabaugh, of the 1993 Signal Processing Society Multimedia Signal Processing Technical Committee, technical co-chair for Paper Award; with A. Seghal, of the 2002 ICME Best Paper Award; with the 2007 IEEE International Conference on Image Processing (ICIP’07), Z. Miao, of the 2007 IEEE TRANSACTIONS ON MULTIMEDIA Best Paper and recently was co-guest editor of the special issue of the IEEE SIGNAL Award; and with M. Ponec, S. Sengupta, M. Chen, and J. Li, of the 2009 PROCESSING MAGAZINE on BImmersive Communication.[ He serves as ICME Best Paper Award. technical co-chair of the 2011 IEEE International Conference on Multi- media Signal Processing (MMSP’11) and the IEEE 2012 International Conference on Emerging Signal Processing Applications (ESPA’12) (and Bruce Culbertson (Member, IEEE) received the co-founder). M.A. degree in mathematics from the University of California San Diego, La Jolla, in 1976 and the M.S. degree in computer and information science Philip A. Chou (Fellow, IEEE) received the B.S.E. from Dartmouth College, Hanover, NH, 1981. degree from Princeton University, Princeton, NJ, in He is a Research Manager currently leading a 1980 and the M.S. degree from the University of project in continuous 3-D at Hewlett-Packard California Berkeley, Berkeley, in 1983, both in Laboratories, Palo Alto, CA, where he has worked electrical engineering and computer science, and since 1984. He previously led a project in immer- the Ph.D. degree in electrical engineering from sive communication and remote collaboration. He Stanford University, Stanford, CA, in 1988. has also worked on projects involving virtual reality, novel cameras and From 1988 to 1990, he was a member of immersive displays, 3-D reconstruction, defect-tolerant computer de- Technical Staff at AT&T Bell Laboratories, Murray sign, reconfigurable computers, logic synthesis and computer-aided de- Hill, NJ. From 1990 to 1996, he was a Member of sign of digital logic, computer design, and voice and data networks. His ResearchStaffattheXeroxPaloAltoResearchCenter,PaloAlto,CA.In current research interests are in computer vision, 3-D, and immersive 1997, he was manager of the compression group at VXtreme, an Internet communication. video startup in Mountain View, CA, before it was acquired by Microsoft Mr. Culbertson has served on the program committee of conferences in 1997. From 1998 to the present, he has been a Principal Researcher including the IEEE Conference on Computer Vision and Pattern Recog- with Microsoft Research in Redmond, WA, where he currently manages nition (CVPR) and the IEEE International Conference on Computer Vision the Communication and Collaboration Systems research group. He has (ICCV).

Vol. 100, No. 4, April 2012 | Proceedings of the IEEE 989 Apostolopoulos et al.: The Road to Immersive Communication

Ton Kalker (Fellow, IEEE) graduated in mathe- Susie Wee (Fellow, IEEE) received the B.S., M.S., matics in 1979 and received the Ph.D. degree in and Ph.D. degrees from the Massachusetts Insti- mathematics from Rijksuniversiteit Leiden, The tute of Technology (MIT), Cambridge, in 1990, Netherlands, in 1986. 1991, and 1996, respectively. He VP of Technology at the Huawei Innovation She is the Vice President and Chief Technology Center, Santa Clara, CA, leading the research on and Experience Officer of Collaboration at Cisco next generation media technologies. He made Systems, San Jose, CA, where she is responsible significant contributions to the field of media for driving innovation and user experience in security, in particular digital watermarking, robust Cisco’s collaboration products and services which media identification, and interoperability of digital include unified communications, IP voice and rights managements systems. His watermarking technology solution was video, telepresence, web conferencing, social collaboration, and accepted as the core technology for the proposed DVD copy protection mobility and virtualization solutions. Prior to this, she was at Hewlett- standard and earned him the title of Fellow of the IEEE. His subsequent Packard Company, Palo Alto, CA, in the roles of founding Vice President research focused on robust media identification, where he laid the of the Experience Software Business and Chief Technology Officer of foundation of the content identification business unit of Philips Client Cloud Services in HP’s Personal Systems Group and Lab Director Electronics, successful in commercializing watermarking and other of the HP Labs Mobile and Media Systems Lab. She was the coeditor of identification technologies. In his Philips period, he has coauthored 30 the JPSEC standard for the security of JPEG-2000 images and the editor patents and 39 patent applications. His interests are in the field of signal of the JPSEC amendment on File Format Security. While at HP Labs, she and audiovisual processing, media security, biometrics, information was a consulting Assistant Professor at Stanford University, Stanford, theory, and cryptography. He joined Huawei in December 2010. Prior to CA, where she co-taught a graduate-level course on digital video Huawei, he was with Hewlett-Packard Laboratories, focusing on the processing. She has over 50 international publications and over 40 problem of noninteroperability of DRM systems. He became one of the granted patents. three lead architects of the Coral consortium, publishing a standard Dr. Wee was an Associate Editor for the IEEE TRANSACTIONS ON CIRCUITS, framework for DRM interoperability in summer 2007. He served for six SYSTEMS AND VIDEO TECHNOLOGY and for the IEEE TRANSACTIONS ON IMAGE years as visiting faculty at the University of Eindhoven, Eindhoven, The PROCESSING.ShereceivedTechnology Review’s Top 100 Young Innovators Netherlands. award, ComputerWorld’s Top 40 Innovators under 40, the INCITs Dr. Kalker participates actively in the academic community, through Technical Excellence award, and the Women In Technology International students, publications, keynotes, lectures, membership in program Hall of Fame award. She is an IEEE Fellow for her contributions in committees and serving as conference chair. He is a cofounder of the multimedia technology. IEEE TRANSACTIONS ON INFORMATION FORENSICS.Heistheformerchairofthe associated Technical Committee of Information Forensics and Security.

Mitchell D. Trott (Fellow, IEEE) received the B.S. and M.S. degrees in systems engineering from Case Western Reserve University, Cleveland, OH, in 1987 and 1988, respectively, and the Ph.D. degree in electrical engineering from Stanford University, Stanford, CA, in 1992. He was an Associate Professor in the Depart- ment of Electrical Engineering and Computer Science, Massachusetts Institute of Technology, Cambridge, from 1992 until 1998, and Director of Research at ArrayComm, Inc., San Jose, CA, from 1997 through 2002. He is now a Distinguished Technologist at Hewlett-Packard Laboratories, Palo Alto, CA, where he leads the Seamless Collaboration project. His research interests include streaming media systems, multimedia collaboration, multiuser and wireless communication, and information theory.

990 Proceedings of the IEEE | Vol. 100, No. 4, April 2012