LiU-ITN-TEK-A--09/050--SE

3D Teleconferencing

Magnus Lång

2009-09-07

Department of Science and Technology Institutionen för teknik och naturvetenskap Linköping University Linköpings Universitet SE-601 74 Norrköping, Sweden 601 74 Norrköping LiU-ITN-TEK-A--09/050--SE

3D Teleconferencing Examensarbete utfört i medieteknik vid Tekniska Högskolan vid Linköpings universitet Magnus Lång

Handledare Jonas Unger Examinator Jonas Unger

Norrköping 2009-09-07 Upphovsrätt

Detta dokument hålls tillgängligt på Internet – eller dess framtida ersättare – under en längre tid från publiceringsdatum under förutsättning att inga extra- ordinära omständigheter uppstår. Tillgång till dokumentet innebär tillstånd för var och en att läsa, ladda ner, skriva ut enstaka kopior för enskilt bruk och att använda det oförändrat för ickekommersiell forskning och för undervisning. Överföring av upphovsrätten vid en senare tidpunkt kan inte upphäva detta tillstånd. All annan användning av dokumentet kräver upphovsmannens medgivande. För att garantera äktheten, säkerheten och tillgängligheten finns det lösningar av teknisk och administrativ art. Upphovsmannens ideella rätt innefattar rätt att bli nämnd som upphovsman i den omfattning som god sed kräver vid användning av dokumentet på ovan beskrivna sätt samt skydd mot att dokumentet ändras eller presenteras i sådan form eller i sådant sammanhang som är kränkande för upphovsmannens litterära eller konstnärliga anseende eller egenart. För ytterligare information om Linköping University Electronic Press se förlagets hemsida http://www.ep.liu.se/

Copyright

The publishers will keep this document online on the Internet - or its possible replacement - for a considerable time from the date of publication barring exceptional circumstances. The online availability of the document implies a permanent permission for anyone to read, to download, to print out single copies for your own use and to use it unchanged for any non-commercial research and educational purpose. Subsequent transfers of copyright cannot revoke this permission. All other uses of the document are conditional on the consent of the copyright owner. The publisher has taken technical and administrative measures to assure authenticity, security and accessibility. According to intellectual property law the author has the right to be mentioned when his/her work is accessed as described above and to be protected against infringement. For additional information about the Linköping University Electronic Press and its procedures for publication and for assurance of document integrity, please refer to its WWW home page: http://www.ep.liu.se/

© Magnus Lång Abstract

This report summarizes the work done to develop a 3D teleconferencing system, which enables remote participants anywhere in the world to be scanned in 3D, trans- mitted and displayed on a 3D display with correct vertical and horizontal parallax, correct eye contact and eye gaze. The main focus of my work, and thus this report, has been the development of this system and especially how to in an efficient and general manner render to the novel 3D display. The 3D display is built out of modified com- modity hardware and show a 3D scene for observers in up to 360 degrees around it and all heights. The result is a fully working 3D Teleconferencing system, resembling communication envisioned in movies such as holograms from Star Wars. The system transmits over the internet, at similar bandwidth requirements to concurrent 2D video conferencing systems. Preface This master-thesis project was completed at the Graphics Lab of University of Southern Californian’s Institute for Creative Technologies (ICT), in Los Angeles, California, between the summers of 2008-2009, and it has been a pleasure to work and learn at ICT. The project was a part of a bigger project conducted in cooperation with Andrew Jones, Graham Fyffe, Xueming Yu, Jay Busch, Ian McDowall, Mark Bolas and Dr. Paul Debevec. Many thanks to all of you, I had a great time and learned a lot. The idea for this project came from Andrew Jones and Dr. Paul Debevec and was contingent on the resources and advanced research of the Graphics Lab at ICT. I also wish to thank my examiner Jonas Unger at Linkping University, ITN and Pieter Peers and Cyrus Wilson from the Graphics Lab of USC ICT. In addition I wish to thank David Krum, Monica Nichelson, Richard DiNinni, Scott Fisher, Bill Swartout, Randy Hill, Randolph Hall and John Parmentola for their sup- port, assistance, and inspiration for this work. This work was sponsored by the U.S. Army Research, Development, and Engineering Command (RDECOM) and the Uni- versity of Southern California Office of the Provost. The high-speed projector was originally developed by a grant from the Office of Naval Research under the guid- ance of Ralph Wachter and Larry Rosenblum. The content of the information does not necessarily reflect the position or the policy of the US Government, and no official endorsement should be inferred.

1 Contents

Preface ...... 1

1 Introduction 8 1.1 Background ...... 8 1.2 Purpose ...... 10 1.3 Method ...... 11 1.4 Contributions ...... 11 1.5 Structure ...... 13

2 Background and prior work 14 2.1 Types of 3D displays ...... 14 2.1.1 ...... 14 2.1.2 Highly multi-view 3D displays ...... 16 2.1.3 ...... 16 2.1.4 This projects’ display class ...... 17 2.2 Definitions and properties ...... 17 2.2.1 Motion parallax ...... 17 2.2.2 Vergence and accommodation ...... 17 2.2.3 Size, resolution and refresh rate ...... 17 2.2.4 Integrated display systems ...... 18 2.2.5 Anisotropic surfaces ...... 18 2.3 Human interaction ...... 18 2.4 Earlier and concurrent 3D displays ...... 19 2.5 System Overview ...... 21 2.6 Conclusion ...... 25

3 Teleconferencing - Overview 26 3.1 Acquisition ...... 26 3.2 Transmission and Processing ...... 28 3.3 Display ...... 28 3.4 Human interaction ...... 29 3.5 Conclusion ...... 29

2 4 Acquisition of Remote Participant 31 4.1 Projector ...... 33 4.2 Cameras ...... 33 4.3 Conclusion ...... 33

5 Transmission and Processing 34 5.1 Fast processing ...... 34 5.2 Data transmission ...... 34 5.3 Data transmission and processing pipeline ...... 35 5.4 Synchronization ...... 37 5.4.1 Capture synchronization ...... 37 5.4.2 Display synchronization ...... 38 5.5 Conclusion ...... 38

6 3D Display system 39 6.1 Hardware ...... 40 6.1.1 Projector ...... 41 6.1.2 Mirrors ...... 41 6.2 Theory ...... 43 6.2.1 Algebraic problem ...... 44 6.2.2 Pre-calculated Look Up Table ...... 44 6.2.3 LUT generation on the GPU ...... 46 6.2.4 LUT function fitting ...... 47 6.3 Software ...... 47 6.3.1 Advancing the possibilities ...... 48 6.3.2 Dithering ...... 51 6.4 Conclusion ...... 52

7 Results 53 7.1 Analysis of the result ...... 53 7.1.1 Acquisition ...... 53 7.1.2 Transmission and Processing ...... 53 7.1.3 Display ...... 54 7.2 User perception ...... 58 7.3 Fulfillment of the purpose ...... 60

8 Conclusions 62 8.1 Future work ...... 63 8.2 Publicity ...... 63

A Appendix 69 A.1 LUT vertex shader code ...... 69 A.2 Fitted LUT vertex shader code ...... 70 A.3 LUT fitting code excerpt ...... 72

3 List of Figures

1.1 An image from the movie Star Wars: Episode III - Revenge of the Sith, in which one of the characters is being transmitted in 3D, holographi- cally, to a meeting, and appears to be there in person...... 8

2.1 These diagrams illustrates the underlying principle of how holograms are stored and recreated by coherent light, often lasers. This is very different from how this 3D teleconferencing system’s display works. . 15 2.2 The 3D display of Jones et al. 07 [24]. To the left a diagram and to the right the later finished construction...... 22 2.3 Projection of a tie-fighter 3D model. (a) Projecting regular perspective images exaggerates vertical perspective and causes stretching when the viewpoint rises. (b) Projecting a perspective image that would appear correct to the viewer if the mirror were diffuse exaggerates horizontal perspective and causes keystoning. (c) The MCOP is correct for any chosen viewpoint height and distance...... 23 2.4 (a) Model of the Jones07 et. al. [24] setup. Rays are shown intersecting and vertically diffusing off of the mirror, then intersecting the cylindri- cal area of possible viewers (Vi) at a given height h and distance d from the mirrors pivot point, the origin. (b) Seen from above, the projector position can be unfolded in the mirror to P0. The projector projects to multiple viewers in the same image. Images courtesy of Jones et al. 07. 24

3.1 System diagram. Shows all the major parts of the teleconferencing system and how they are connected...... 27 3.2 The left column shows the audience and the display, conceptual and final construction. The right column shows the same for the scanning part of the system with the remote participant being scanned...... 30

4.1 The real-time 3D scanning system showing the structured light scan- ning system (120Hz video projector and camera) in use, and the large 2D video feed monitor displaying the audience for the remote partici- pant...... 31 4.2 Schematic diagram of the real-time 3D scanning system, showing the conceptual setup of the scanning rig...... 32

4 5.1 Diagram over the data transmission and processing pipeline from 3D capture to display...... 36

6.1 Photo of the display in action, with multiple viewers interacting in real- time...... 39 6.2 The 3D Display construction showing the two-sided display surface, high-speed video projector, front beam splitter, and 2D video feed cam- era. Crossed polarizers prevent the video feed camera from seeing past the beam splitter...... 40 6.3 The photograph shows a point laser hitting an anisotropic mirror sur- face, creating a conical reflection...... 42 6.4 (a) 3D face shown intersecting the designed display surface. (b) Con- vex, flat, and concave display surfaces made from brushed aluminum sheet metal...... 43 6.5 A reflected ray of light off of the mirror, as seen from above. The cone reflected from the surface intersects the viewing circle. The angle of the conical reflections apex is equal to the incident angle. At the apex there is also a slight specular highlight...... 45 6.6 Diagram over the main structure of the 3D display program...... 49 6.7 Comparison of different dithering techniques. (a) The original image. (b) Thresholding/Average dithering. (c) Random dithering. (d) Or- dered dithering, used on the display. (e) Dithering with Ostromukov’s algorithm...... 52

7.1 The image shows the three different mirrors built and used on the dis- play. These were all designed to have different properties when used, which are discussed in this section. Displayed are, from the left: con- cave, flat and convex...... 55 7.2 (a) Light diverging from a flat anisotropic display surface can illu- minate multiple viewers simultaneously, requiring a single projector frame to accommodate multiple viewer heights. (b) Light reflected by a concave display surface typically projects imagery to at most one viewer at a time, simplifying the process of rendering correct vertical parallax to each tracked viewer...... 56 7.3 A grid showing how a plane will be warped by the LUT-projection technique to reflect perspective correct off of the mirror. Three LUTs for three mirror shapes are shown, evaluated for three different mirror angles. At 0◦ the mirror is facing a viewer standing in front of the display straight on. For mirror angle 30◦, the image is reflected to a viewpoint at 60◦. The black rectangles show the extent of the projector frame. The graphs are oriented so that ’up’ corresponds to the top of the mirror...... 57

5 7.4 The images show, for three different angles, a projection of a plane onto the concave mirror. First using the fitted function LUT in lines (black) and then next to it, on the right, the full 6D LUT rendering translation in a grid (red and blue). As can be seen, both methods yield the same result...... 58 7.5 Comparison of the different mirror shapes for two simultaneously tracked upper and lower viewpoints. For the convex mirror, the geometry was scaled by 0.75 to fit within the smaller display volume. In the 4th and 8th columns the mirror is replaced with the actual mannequin head to provide a ground truth reference...... 59 7.6 A test object (left) is aimed at a camera shown on the 2D display (right). The camera photographs the transmitted 3D displayed image of the test object to measure gaze accuracy for the 3D display. The accuracy of the 2D display is measured in the same way by switching the locations of the camera and test object...... 60

6 List of Tables

7.1 Performance comparison between the developed 3D display, Cossairt et al. [11] and Jones et al. [24] ...... 54

7 Chapter 1

Introduction

1.1 Background

This project is based upon a vision. Imagine a world where traveling far distances to have business meetings is long forgotten. No time, money or greenhouse gases spent on traveling. A wold where distances, even across continents, fades into nothingness because whenever you talk to your friends or family, you feel like you are there with them. When you talk to them, you are sitting in the same room for a little while, look- ing them straight in the eye. If there are more people in the room, they can see who you are talking to, looking at, your gestures and non-verbal communication. For just a short while it is as if you teleported anywhere in the world and could interact with you loved ones, business associates or friends, just as simple as making a phone call today. How would this impact the world? How would we change our habits and relations?

Figure 1.1: An image from the movie Star Wars: Episode III - Revenge of the Sith, in which one of the characters is being transmitted in 3D, holographically, to a meeting, and appears to be there in person.

8 Compare this to the advanced holograms of fictional tv-series such as Star-Trek or the Star Wars movies, where people can interact with holograms which to the partici- pants appear indistinguishable from real persons in the vicinity, whether they are live transmitted people or computer generated characters. This possible future is imagined and depicted in movies via visual effects, displaying important communiques done via 3D holograms, making the transmitted individual appear in 3D in the room. An exam- ple of this from Star Wars is shown in figure 1.1. In reality the human race seems far away from accomplishing actual teleportation of people, so it may yet be a while before we can all have these exact experiences. But how far away are we from being able to simulate these effects? Modern technology is moving ahead at a never before seen pace. Telephones are everywhere, and becoming more and more capable of handling everything from e-mail to showing videos and playing games. Voice-to-voice communication though, is very limited in the amount of information transmitted. Important visual cues which we take for granted in a face-to-face conversa- tion are omitted. How is the other person reacting to what I’m saying? Is he interested? Paying attention? Looking at something else? The modern society produces visual ef- fects, movies and immersive experiences today which might be static or dynamic and specialized to certain conditions, settings or occasions. But the fact remains that the border for the stored or saved experiences shown are being pushed forward all the time.

The telephone communication system has been around for well over a century and has surely revolutionized the world, but has only recently evolved to the next level. At the end of the 20th century it became feasible to have 2D video feeds at the same time as the voice communication for the general public. 2D video definitely adds another dimension to the remote communication. To see the other party’s expressions, non- verbal cues and appearance adds greatly to the valuable information transmitted. To use this kind of system, for example over the internet with web cameras, moves the communication from being voice-only to an interactive version of the television. This is a lot more eye catching and informative than the telephone, but still with limitations and far away from the vision described. For example, teleconferencing systems built with 2D technology does not as of today fully replace in-person conferences. There are many details which are not communicated. These include eye contact and correct eye gaze, which is a major part of our interpersonal communication.[5] What are we mostly looking at when we meet someone and talk? Is the other part looking at me? Embarrassedly looking away? Looking at/adressing a colleague on my right? In a 2D teleconferencing system or 1-to-1 video feed the participants are actually looking at their screens, offset from the capturing camera or eye position of the other participant. If in a one-to-many or many-to-many system, whenever a participant looks away from the camera, to another participant or at the screen all remote participants will see that participant looking away from them, but will be unable to intuitively tell where his fo- cus shifted to.

This project is not trying to teleport. It is not trying to create a full body hologram indistinguishable from other persons in the room. The goal with this project is to advance the front lines in this direction. To create new possibilities for at least pushing

9 some of these aspects, properties and visions of future communication into realities today. This does not only present the possibility to do novel and interesting work but has the possibility to, in the future, make a great impact on the entire world and how it communicates and functions. Studies show that a major part of our communication takes place through visual cues and timed non-verbal gestures. Head movement for example is not only used when actually speaking to someone, but is a major communication tool for us whilst listening to the other party in a dialogue. The head can for example be used to indicate ’yes’ or ’no’, confusion, interest, impatience, attention to what is being said or signal that you have something to say, to claim your turn to speak and get others to give you a chance to speak and prepare to listen themselves.[37, 23, 9, 18] The problems of achieving eye contact and correct eye gaze are difficult and in- teresting, and there certainly appears to be value in creating a system which can do this.[32, 5, 26] Gaze has been called the window into a human soul, for good reason. It reveals more than we are saying. It provides a channel of information about what we are thinking and feeling inside, maybe even information more true than the words we are saying. These human functions and behaviors has been studied to some extent [19, 27] and for example used to successfully create very immersive virtual human agents.[6]

1.2 Purpose

The main purpose and also challenge of the project is to achieve a 3D teleconferenc- ing system. To be able to visually reproduce someone in 3D at a remote location, that person first has to be captured in a way that makes this possible. Today 3D scanning is applied in many fields using 3D models, for example in industrial design, orthotics and prosthetics, reverse engineering and prototyping, quality control/inspection, docu- mentation of cultural artifacts and the visual effects industries. But it is only recently that research of real-time, high quality 3D acquisition of human faces is developing to be useful for application in this kind of system.[43] The scanning and processing of the data that is to be transmitted needs to be in real- time and at a high enough rate to enable fluent communications. This must be achieved without to much latency. Latency in a bidirectional human interaction system is very detectable by the uses of the system, just as in a phone call across the world. Unin- terrupted and flowing real life conversation needs more or less instant feedback. The capture, processing and transmission latency would also be added to possible latency introduced on the receiving side which has to process and display the information on a 3D display. Another goal for the 3D display is to make it big enough to display a life sized human head, with an audience being able to move around it and see it in 3D without any visual aids like 3D glasses or VR-helmets. In addition, the goal is to display the head for 180 degrees around the face displayed, at an update rate comfortable for the users, and showing correctly from all viewing directions. This is because an outspoken goal of the project is to achieve something new, and hopefully better than existing 3D displays, and to in conjunction with the scanning system make it into a very real and

10 useful system. A more specific goal following the requirement of a new display size is the require- ment of being able to project onto any mirror shape and position of both the mirror and projector. This goal will be further explained in section 6.1.2. Since the goal is to show real persons, the rendered content must not only be ge- ometrically correct, it must also be textured, occlusion capable and lit. The details of these goals and their implementations and limitations will be further described in the coming chapters of this report.

1.3 Method

In the first stages of this project, much effort was spent on planing and reading about how to make the 3D display and the scanning system fulfill the goals and purposes set for the project. Most of the information which informed the decisions made comes from technical papers in the field of computer graphics, and is often concurrent re- search. These papers are referenced in the text of this report and listed in the Bibliog- raphy section at the end of the report. Weekly meetings and coordination had the group working in parallel with the phys- ical construction of the system and the development of software. The project was or- ganized and supervised by Dr. Paul Debevec and Andrew Jones. Once the declaration of purpose and the feasibility study was finished the work of implementing the system was started. The system was developed iteratively, to make sure all the new research was working and proven before the next step could be taken. The coding was done in C++, OpenGL and Nvidia Cg in the MS Visual Studio IDE. Image editing was done in HDRShop 3.0, and scripts for this was written in the JavaScript language. A first version of the system was displayed in December 2008, but it is still under continuous development. The main subject of the project is computer graphics and fast 3D rendering, but also fast 3D capture and human interaction aspects. Several people, listed in the preface of this document, are involved in different parts of the project. My part in this project is to develop algorithms and system parts for fast and efficient processing, transmission and display of 3D data. Most of the time was spent writing programs for this.

1.4 Contributions

This project was developed in collaboration with Andrew Jones, Graham Fyffe and Xueming Yu with Paul Debevec as supervisor, and I can thus not take credit for all the parts of the system I am describing, but a description of the whole system is to some extent necessary to put my work in the right context and explain why and how. My contributions were:

• On the 3D display program which is responsible for rendering to the display:

– Implementing display and projection code for:

11 ∗ Orthographic projection rendering onto the mirror. ∗ Rendering of textured geometry, as opposed to rendering binary col- ored wire frame models. ∗ Real-time dithering of the textured geometry. This enables rendering more than binary images, and is crucial for rendering with color. ∗ Functionality for projection from arbitrary positions, so that the pro- jector can be moved around freely. ∗ Implementing rendering with pre-calculated lookup table projection, which gives an essential increase in rendering speed. ∗ Implementing rendering with function fitted, pre calculated lookup ta- bles for an even faster rendering rate. ∗ Functionality for rendering onto arbitrary mirror shapes, which makes it possible to form the projection surface to any shape and optimize it for the goals of the project. ∗ Code for displaying animated geometry. This includes the possibilities of displaying 3D animations, saving 3D animations and playback of the saved data. ∗ Modes for receiving and displaying real-time streamed, animated data, live captured from the 3D scanning system. – Many speed optimizations for rendering the content at a high enough frame rate. This includes optimizing code both on the CPU and GPU. – Helped generalize the projection math needed to generate the lookup tables used for rendering. – Wrote code for fitting lookup tables to functions for fast data access. – Programmed the shaders needed to implement and render the above func- tionality.

• Helped construct and design the new 3D display rig. Including the new anisotropic mirrors, with new materials and shapes.

• Implemented system communication and networking for the computers and hard- ware involved in the whole system. This includes: – Fast integrated transmission of content through the whole pipeline, end-to- end. – Multi-threading of different functionalities, to utilize multi-core hardware. – Synchronization of the hardware and computers working together in the system and a centralized control system for easy handling and increased useability of the teleconferencing system.

My contributions in the scanning and tracking software are negligible and will therefore not be described more than necessary in this report.

12 1.5 Structure

Following this introductory part, section 2 describes, to some extent, the facts found in the feasibility study and current information within the field of 3D displays necessary to understand many of the design decisions made in the project. It explains terminology and concepts later used to explain the work conducted. The next chapters are a description of the problems encountered and the solutions found for these, the main body of work done. It starts with an overview of the different parts and considerations of the project. The following chapters after the overview goes into greater detail. They are divided into three major parts: Acquisition of Remote Participant, Transmission and Processing and 3D Display System. The 3D Display System chapter is the biggest part, reflecting the amount of contributions made there. This is followed by a results section (7.1) in which the results are presented and analyzed. It is concluded by a discussion in chapter 8.

13 Chapter 2

Background and prior work

The background for this project stems directly from the first version of this system built by Jones et. al [24]. This paper and earlier work developed a 3D display system with a 360 degree autostereoscopic view. To make this more applicable to the real world, as explained in the introduction, the next evolutionary step for this project was to make it into a 3D teleconferencing system, where users can be captured in 3D, in real-time and transmitted to an evolved version of the 3D display. Thus creating a much more live and real communication experience than in today’s communication channels.

This chapter will explain some of the major concepts of 3D displays, then some properties and terms used followed by the starting point, previous work, environmental setting and prerequisites of this project.

2.1 Types of 3D displays

There are many different solutions for displaying imagery which appear as three di- mensional. What is seen is often not physically three dimensional. The displays are constructed with the goal to use the way we perceive visual information to make us believe it is. There are some major categories for classifying these displays, depending on which approach they take towards achieving the three dimensionality. These cat- egories are holographic, volumetric and highly multi-view 3D displays. The display built and used in this project is a multi-view swept volumetric display, implementing properties from two of these major categories.

2.1.1 Holographic display This is a commonly known and used term, but in some cases not correctly. For example, many observers of our system falsely labels what they see as a hologram. A hologram is an image that changes when looked at from different angles. The image changes depending on the position and orientation of the observer, and the perceived image should optimally change in exactly the same way as if the object displayed thereon was

14 Figure 2.1: These diagrams illustrates the underlying principle of how holograms are stored and recreated by coherent light, often lasers. This is very different from how this 3D teleconferencing system’s display works. still present, thus making the recorded image (hologram) appear three dimensional. This requires that you have a way of capturing an object or scene for display as a hologram, a science called . Holography was invented in 1948 by Dennis Gabor, a Hungarian-born scientist. He received the Nobel Prize for Physics more than 20 years later for this invention (1971)[1]. Holography differs in the way that it records not only the intensity of the light like normal photography but also its phase and the degree to which the wave fronts making up the reflected light are coherent. This is done with coherent light scattering of an object and illuminating the recording medium at the same time as lit by a reference beam of coherent light, creating an interference pattern. The holographic light field can then be recreated by illuminating the recording media with the same coherent light beam. The most practical way of generating a sufficiently coherent and intense light beam is by using lasers, which means that almost all holographic displays are built with these. While holography is commonly used to display static 3D pictures, such as creat- ing the shifting pictures on our ID cards, it is not yet possible to generate arbitrary, animated scenes by a holographic volumetric display. The 3D display community has yet to agree on whether or not holograms create volumetric images. The technology continuously advances, and as an example the Zebra Imaging company has created holograms that generate full-color, full-parallax imagery that the user essentially perceives to be volume-filling.[14]

15 2.1.2 Highly multi-view 3D displays Highly multi-view 3D displays is a class of 3D displays which physically reconstructs 3D lightfields by projecting from 30 to 200 views of a 3D scene with various trajecto- ries from or through an image surface. In these cases, the imagery satisfies a so called super multi-view condition in which the observers eye automatically focuses on each voxel as if it were projected from that region of space. [20] Lenticular and parallax barrier displays, which are 2D screens controlling what can be seen depending on viewing position, belongs to this group. When placed in the middle of the image space, a diffuser, such as a business card, appears to be slicing into an object. These characteristics may compel developers to admit highly multi-view 3D displays to the volumetric display family.

2.1.3 Volumetric display A graphical is generally considered to belong to the volumetric display class if it forms a visual representation of the displayed scene or object in three physical dimensions. This differs from the holographic and multi-view displays in the way that they display planar, 2D images, which in different ways visually simulates the 3D effect. For some displays it might be difficult to distinguish which class it belongs to, but in general a display is considered to have to be emitting, scattering, or relaying illumination from well-defined regions in 3D space to be in this class. Volumetric 3D displays have many useful properties, including a wide field of view, and following of their definition they are mostly autostereoscopic. This means that the stereo 3D effect for the viewer is created without any special aids, such as spe- cial glasses. However, nearly all 3-D displays other than those requiring head wear, i.e. stereo goggles and stereo head-mounted displays, are considered to be autostereo- scopic, but this is an important property from a useability standpoint. Displays in this category support multiple simultaneous observers. Most provide high-resolution imagery compared to other techniques, and excel at several generic visualization tasks. A collection of advantages, parameters, and tradeoffs applies to many volumetric displays, some of which will be explained below. The first design and suggestion for a volumetric display was in 1912. It was sug- gested that a volume-filling image could be produced by reflecting or transmitting light from a rotating or oscillating 2D surface within a 3D volume. These kind of volumetric display are called swept volume displays. There are also some volumetric displays which can emit light from a volume with- out major moving parts. These use for example lasers to emit light from chosen regions of space within a volume of a solid, liquid or gas substance. They are referred to as static volume displays. Although it has been almost a century since the first proposal of a volumetric dis- play, they are still under development and there are none commercially available for the general population. There is a variety of systems proposed and some built and in use in small quantities in military, medical and academic fields.[14]

16 2.1.4 This projects’ display class The display developed in this project relates to both swept volumetric displays in that it has a moving surface which enables it to show images originating from points in three physical dimensions, and also to highly multi-view displays in that it shows dif- ferent views for viewers depending on their position around the display. This is further explained in chapter 6.1.2

2.2 Definitions and properties

This section section explains some important definitions and properties, and how they relate to the display developed in this project.

2.2.1 Motion parallax Since volumetric displays create a 3D volume in physical space of the displayed content they have the property of horizontal, vertical or even full parallax imagery. Meaning that it allows one or more observers to see a 3D scene from a variety of horizontal and vertical viewpoints. Unrestricted lookaround is a key benefit of volumetric displays, opposed to multi-view lenticular or parallax barrier displays, which force observers to keep their heads uncomfortably within centimeters of a designated viewing zone. Lenticular and parallax barrier displays also has the limitation that they only permit horizontal lookaround and are thus classified as horizontal parallax only displays.

2.2.2 Vergence and accommodation Accommodation is the equal to focus, applied to our eyes. Vergence is a term which refers to the eyes tendency to rotate so that their optical axes converge at the region they are gazing upon. Because volumetric displays create imagery that truly occupies a spatial region, a viewers eyes comfortably swivel to fix on the same point they are focusing on. This is not the case for many other types of 3D displays, for example using stereoscopic goggles or lenticular and parallax-barrier displays.

2.2.3 Size, resolution and refresh rate There are many different solutions and projets to build a volumetric display, and the size of the displayed image varies from less than one inch to approximately three feet across. Volumetricdisplays generally have resolution advantages over flat-panel autostereo- scopic displays such as lenticular and parallax barrier screens, which trade off display resolution for the number of views they present to the observer. In the case of volu- metric displays the resolution stays constant, and the number of views shown to the observer is instead dependent on the frame rate of the projection device or its’ equiva- lent. This is the same as for the general computer screen, which has a constant refresh rate, since it’s actually independent of the computer or other connected devices’ ren- dering performance and capabilities. Although, the volumes’ buffer content might be

17 updated at a different pace than the display it self, and this will have impact on design of such a system. Resolution and refresh rate is a big issue while choosing and designing hardware and techniques for a volumetric display, since they are often limited by bottlenecks in the system. Together they will determine the need for optical bandwidth in the system, and thus many choices, for example in our case which projector to use. Several 3D displays use the Texas Instruments (DLP chip) technology in their chosen projector, and so does this display. DLP chips can generate imagery with a resolution of 1024x768 pixels at speeds above 10,000 binary frames per second. In terms of raw optical bandwidth, this corresponds to 6.3 GPixels per second. This is only one of the bottlenecks in need of consideration incurred by the resolution and refresh rate, for example, a standard projector DVI cable (single link) can only support a data rate of 3.96 Gbit/s.

2.2.4 Integrated display systems Many, including our, volumetric display systems are built upon standardized technol- ogy, using the same known computer architecture as most of today’s PCs, although machines from the high-end spectra. The system may consist of very complex sub- systems, software, algorithms, and mechanical, optical and electronic parts, but will be integrated with and driven by a modern computer with specialized software. The software is built with standardized APIs and interfaces which allows faster and more efficient development in modern, higher level languages.

2.2.5 Anisotropic surfaces The 3D display of this project is based on the functionality of an anisotropic reflective surface. Anisotropy is the property of being directionally dependent, as opposed to isotropy, which means homogeneity in all directions. It can be defined as a difference in a physical property (absorbance, refractive index, density, etc.) for a material when measured along different axes. This property is leveraged for the purpose of building the 3D display in this project, see section 6.1.2 for more on this.

2.3 Human interaction

As discussed in the introduction to this report and as one of the main motivations for this project to create a teleconferencing system, further developing communication technologies to become more like real person-to-person interactions has great potential value. Studies which discuss the fact that when people communicate in person, nu- merous cues of attention, eye contact, and gaze direction provide important additional channels of information.[26, 19] This makes in-person meetings more efficient and effective than telephone conversations and 2D teleconferences, since these methods cannot communicate many of these properties correctly or even at all.[37, 9, 32] In a world which is becoming more and more globalized, where time is of the essence and travel might not only be a burden but has economical and environmental

18 impacts, long distance communication is and will continue to be an important part of our lives. Thus, improving the amount of information transmitted over a video teleconference is of significant interest.[22, 27] In the communication between humans who meet in person, a significant amount of the important information exchanged lies in the use of gaze, attention and eye contact. The information we get from these behaviors are bound into the conversation, but also what the person is thinking, but not saying out loud.[18] For example, it determines attention and might indicate the focus of the conversation, and transmit important social cues for when to speak and when not too.[5] There are also studies which confirm that this would allow for faster confirmation of the communications channel [30] and for example make it easier to develop trust in a group[31].

2.4 Earlier and concurrent 3D displays

Earlier and simultaneous work includes CNN ”hologram” as shown on the U.S. elec- tion night (2008), the long known and used peppers ghost [38], lenticular sheets in our everyday life and other displays, the Perspecta Spatial 3D display, the Seelinder, the Transpost system, the LiveDimension and so forth, briefly described and compared in this section. Specific attention is given to displays which builds upon the same concept as the one developed in this project, namely a moving anisotropic mirror surface.

Films showing fictional holograms generated by visual effects usually depict a sin- gle person interactively captured and transmitted in 3D from one location to interact with one or a group of persons at another distant location. The films show correct gaze and eye contact cues, making the scenes cinematically more compelling. The goal of 3D displays would be to replicate this vision, but has yet to be accomplished in real life. On November 4th 2008 [42], during election night in the USA, the news channel CNN showed television viewers what was called the full body hologram of a remote correspondent transmitted to the news studio appearing to be making eye contact with the news anchor. This was unfortunately just an effect performed with image composit- ing in postproduction and could only be seen by viewers at home. The anchor actually stared across empty space and had to pretend to be talking to the remote correspondent. [36] There are other examples of systems which claim the same holographic effect or 3D transmission already in use today for special events, even for big audiences. For example the Musion Eyeliner which will show a 2D high definition video with a 3D feel to it on big stages. The 3D effect is achieved by using a Peppers Ghost setup [38]. This effect has been know for a century, and except from not being a hologram has the same problems as the above mentioned technique, it’s only visible from the audience, not projected into real space and whoever want’s to be seen interacting with it will have to pretend that it’s there. Another example of a commercially available teleconferencing system is the CISCO Systems TelePresence system. This system use a spatially controlled setup of high- definition video cameras and life-size video screens to produce the impression of mul-

19 tiple people from different locations sitting around a conference room table. This is however done with 2D screens, and even if you get the feeling that people are sitting around you, the only time you will see them looking at you is when they are actually looking into the camera. This system is thus not really in 3D and will not achieve correct eye contact and gaze directions, and is thus very different from the goal of this project. One dilemma of holograms is that the volume presented is transparent and therefore all the visual elements shown will shine through to the viewer, superimposed. This can however be avoided in swept volumetric displays, by the use of anisotropic screens. With an anisotropic screen, the displayed content can only be visible in one viewing direction at the time, which means that the content can be customized for that particular view. The concept of this solution has been known for a while and tested in previous research [7, 28, 11]. The system built in Maeda et al. [28] uses a spinning LCD monitor an anisotropic privacy-guard film attached to achieve the effect of projecting to one viewing direction at a time, which is similar in concept to this display. But this system has a very limited refresh rate depending on both the refresh rate of the screen and how fast it can safely be rotated. Another resembling approach is taken in the Transpost system, which projects 24 views at the same time in one projector frame. By then projecting this onto a spinning mirror via a circle of static mirrors it achieves a volumetric 3D display. This actually produces 24 different views around the display in color. However, the number of views is not very dynamic, neither is the resolution of the display, as all images must be packed in a circle in one projector image with limited resolution. A great advantage of the Transpost system though is that they easily achieve color.[34] The same concept is extended in the LiveDimension system, where they use an inward-pointing circular array of 12 projectors and a vertically-oriented light-control film, similar to that used in Maeda et al. [28]. This improves on the previous concept in that the resolution is much higher, and still in full color. This does however incur a complexity and a cost with 12 projectors, at the same time as it has even less unique viewing directions, in fact not enough views for binocular parallax.[40] Other approaches are not dependent on a spinning mirror. One project places a horizontal array of projectors behind a large holographic diffuser, creating a multi-user horizontal parallax only display. They produce a display with large, bright, interactive and color, but it is limited in viewing directions to 45 degrees. This limits the audience area, possibility to walk around it and the illusion that the object is there in 3D space. It is also a complicated and expensive system, running off of 16 computers and 64 projectors in need of calibration.[4] Matusik et al. [29] has a related project using a horizontal array of projectors and a lenticular screen to create the different views. This actually requires less projectors than the above mentioned display, since it produces one view per projector. However, it suffers from the same limitations. This approach does not scale well to many views and viewing directions.

More information and surveys of previous and concurrent 3D display techniques can be found in [14, 41, 13].

20 None of the systems described here compensate for changing vertical perspective and parallax and/or they require either many projectors or very specialized hardware. Learning from these projects is what led up to the approach taken in Jones et al. [24]. This is the basis for the project described in this report and therefore described in some detail in the next topic.

2.5 System Overview

Much of the work done on the 3D display developed in this project is either related to or a direct development or extension of the display system of Jones et al. 07 [24]. This paragraph will therefore cover this system in some detail.

System The idea that became the first version of the 3D display was to make a 360 degrees of view light field display, with the novelty of doing this with just one projector and a spinning anisotropic flat mirror. The created setup was built out of a stable construc- tion material, 80/20 metal bars [3], as depicted in figure 2.2. They hold a motor for rotating the mirror in place, rotation along a vertical axis. On top of the motor there is a balanced flywheel, onto which the mirror construction can be mounted. The mirror used was mounted in the center of the rotation axis and had an angle of 45 degrees. The anisotropic surface of the mirror is designed to be highly specular horizontally and very diffuse vertically, which means that light reflected of the mirror will be viewable from all vertical directions, but only the reflected angle horizontally. This means that an image projected onto the mirror can only be seen in one horizontal viewing direc- tion, but the viewer can still se the image if only moving up and down. As the mirror rotates around this enables, with the right timing, the possibility to chose what every viewer around the display will se. Thus the timing was set up in the system and the motion controlled motor connected to be spinning at a multiple of the output frequency of the graphics card, creating a relation between the angular velocity and the frames projected by the computer.

The mirror spins at 15 Hz, 900 rpm. The angular spacing between different dis- played views is therefore dependent upon how many frames can be rendered and pro- jected onto it per rotation. To achieve a fast projection system, an off-the-shelf DLP projector was modified by removing the color wheel, and coupling the data transmis- sion and DMD chip with a customized FPGA, a MULE (Multi-Use Light Engine) card from Fakespace Labs, which is programmed to unpack binary frames packed into the color channels of a normal frame. This technique was first pioneered in Cossairt et al. [10]. Thus, with 24 bit color, every normal projector frame transmitted over a standard DVI cable contains 24 subframes. By then pushing the graphics card to 180 Hz, an effective one bit frame rate of 24 ∗ 180 = 4320 fps is achieved. This delivers 4320/15 = 288 frames per rotation, giving one distinct view for every 360/288 = 1.25 degrees of horizontal spacing all around the full 360 degrees around the display. The volume available for the 3D projection is limited to the size of the mirror,

21 Figure 2.2: The 3D display of Jones et al. 07 [24]. To the left a diagram and to the right the later finished construction. and the resolution to a square size of the pixels possible for the projector, in this case 768x768. What is projected onto the display is also limited in the way that the graphics card must also be able to render it effectively at the given frame rate of 4320 fps. This is quite fast compared to the normal rate of the average computer monitor today which updates at 60 Hz. Therefore the system is limited to mesh geometry of around 3000 vertices, or 2000 faces. To be able to project onto the moving reflective surface and be able to know what is reflected to the user, some calculations and mathematics need to be done. Otherwise, the image seen will possibly only look good from one angle. Think of it as positioning a projector against a wall to project images onto it. If the projector is not positioned straight on, the image will be skewed or keystoned. This is only one of the unwanted effects from the moving surface of the spinning mirror. Figure 2.3 (a) and (b) shows rendering to the spinning mirror without using special projection mathematics to cor- rect for this, while (c) shows the correct projection. The next section explains how this is done.

Projection mathematics, MCOP To project onto the spinning mirror some design constraints are made, and careful calibration of the projector position in relation to the mirror. The model chosen for ray tracing assumes that the mirror is centered at the origin of the coordinate system,

22 (a) perspective (b) projective (c) MCOP

Figure 2.3: Projection of a tie-fighter 3D model. (a) Projecting regular perspective images exaggerates vertical perspective and causes stretching when the viewpoint rises. (b) Projecting a perspective image that would appear correct to the viewer if the mirror were diffuse exaggerates horizontal perspective and causes keystoning. (c) The MCOP is correct for any chosen viewpoint height and distance. spinning around the y-axis, and that the projector is positioned at the nodal point P above the mirror. The viewer is then said to be at height h along the y-axis and at the distance d from the y-axis, se figure 2.4 a. By positioning the mirror at 45 degrees and the projector right above it, the system will be rotationally symmetric as the mirror moves around, simplifying the ray tracing since it behaves the same for every mirror position. Thus, the rendering for any given moment in time (angle of the spinning mirror) the same calculations will have to be made. As shown in figure 2.4, the projected image from the projector will be seen by multiple viewers at the same time, although because of the anisotropy, each viewer will see a different plane of reflection, in the image seen as a plane from the intersection point M to a viewer Vi. In effect this means that a single viewer will only see a vertical stripe of the image at any given moment in time. Multiple viewers around the display sees the same image, but different stripes. This means that the image rendered has to be rendered for multiple viewpoints. To be able to render 3D scenes and models to the display, one needs to know for a given 3D point, denoted Q in the diagrams, which projector 2D pixel (u,v) will reflect off of the mirror and hit the sought viewpoint V 0. Note that V 0 is chosen to be at a

23 P

M 1

1 2 P’ 2 d 3 Q 4 h 3

4 M

(a) (b)

Figure 2.4: (a) Model of the Jones07 et. al. [24] setup. Rays are shown intersecting and vertically diffusing off of the mirror, then intersecting the cylindrical area of possible viewers (Vi) at a given height h and distance d from the mirrors pivot point, the origin. (b) Seen from above, the projector position can be unfolded in the mirror to P0. The projector projects to multiple viewers in the same image. Images courtesy of Jones et al. 07. specific h and d. If the system is viewed from above, the anisotropic mirror will behave as a normal mirror, so the projectors reflected path can be unfolded in the plane of the mirror. P can be seen as P0, see figure 2.4(b). The rays from the unfolded projector point, −−→ going through a given 3D scene point Q, P0Q, will continue to intersect the cylindrical viewing area at distance d. It won’t however in most cases intersect the viewing circle V, which also has a specific height h. By positioning the projector straight over the mirror, and the mirror at 45 degrees incident angle, the reflection off of the mirror can be assumed to be almost in a vertical plane. Thus, the intersection of the plane −−→ containing P0Q, is intersected with the viewing circle V to get the viewpoint V 0 which will se the 3D point Q. The reverse ray tracing from V 0 to Q onto a point on the mirror M, and then from M to P, will yield a projector pixel (u.v). When this pixel is projected to, the viewer at V 0 will see the 3D point Q lit. As the mirror rotates the point Q will eventually be rendered for all other viewpoints V. This enables correct perspective rendering for all 3D points and all viewers around the display with just a few geometrical intersections. Figure 2.3 compares the results, as seen by a viewer, to that of a perspective (orthographic) rendering and a projective rendering (for an isotropic mirror). This rendering cannot be done by a traditional matrix, but can be implemented as a vertex shader allowing a mesh to be rendered in one pass.[24] This will actually render multiple-center-of-projection images (MCOP). This solution was informed by Hou et al. [21].

24 Limitations The first version of the 3D display described above has some limitations. Summarized these limitations are:

• The display size is only 5 inches. • The frame rate is limited to 15 frames per second, and is to some degree per- ceived as flickering. • The implementation only supports black and white wire frame or point cloud rendering of static scenes or pre-recorded and processed lightfield playback, not animated textured and lit geometry • Assumptions are made that the projector rays are perpendicular to the anisotropic axis, meaning that the projector can only be in one position. • Needs to use a physical tracker to render correct vertical parallax

• No real-time 3D model acquisition or transmission • No consideration of eye contact

2.6 Conclusion

This chapter explains major terms and concepts of the field of 3D displays. It summa- rizes the prerequisites for this project, the basis for the further development as well as an overview of concurrent work and research in the same field to give the reader an understanding of the choices and decisions made in this project, and the base for its development. The next section explains how the challenges listed here are faced and solved, be- ginning with an overview of the project in chapter 3, it is then continued with a more detailed explanation in the chapters 4, 5 and 6. These chapters are intended to give the reader insight into the system from scanning a user on one end of the system to creating a 3D visualization on the other end.

25 Chapter 3

Teleconferencing - Overview

This section is meant to give the reader an overview of the teleconferencing system and what is built. As explained in the introductional section 1.1 there is a big interest and potential in developing a teleconferencing system. This is done by further developing the system described in 2.5, with the addition of a scanning apparatus and a system for transmis- sion. This chapter explains the system built, through the natural flow in it, beginning with the acquisition of data, then the transfer and processing of data through the system and then how it ends up being displayed on the 3D display constructed. It also discusses an encompassing goal for the system; how to maximize the usability and experience for users interacting with the system. A schematic overview of the system is given in figure 3.1.

3.1 Acquisition

From the initial experiences, studies, descriptions and visions, several new demands were posed on the new system this project is developing. First of all the person being transmitted has to be scanned in 3D, in real-time, and the data this generates would have to be efficiently transmitted over the internet, so a 3D scanner is needed. Inherently a 3D system is very suitable for displaying created 3D content. While there is an abundance of content, and fairly easy to create new with today’s knowledge and computers, there is a great value in being able to capture real life objects and display it in 3D. For example, instead of looking at a flat 2D video or image of someone you want to have a dialogue with, one could chat in 3D. This would create a very different experience, where eye contact and other subtle non-verbal gestures were much more prevalent and real. This idea is to make a system where the participants would feel that the remote participant, being located anywhere in the world, is present in the room more than ever before in a telecommunication system. To achieve this goal, a real-time 3D scanning system is constructed. This system had many possible solutions and also many challenges. Options for solving this were

26 iue31 ytmdarm hw l h ao at ftetlcneecn system teleconferencing the of parts connected. major are they the how all and Shows diagram. System 3.1: Figure

RP 2D view & control

Sent: textures, geometry Recv: audience A/V

Polarized screen Fly wheel

Pattern control Mirror

Projector Audience

27 Camera & mic Motor 3D display projection Internet Scanning RP Screen Motor control Camera

Pattern projector Timing sync. sig. Textures, geometry PIC

Audience A/V feed Processed A/V feed Pattern sequences explored, and we concluded that not many systems fulfill all of the requirements wished for, but an implementation of Zhang and Huang’s[44] system is well adapted for the purposes of this project. It acquires 3D geometry of the remote participant (RP) by projecting phase shifted patterns onto the target object and capturing how the patterns change from a system of calibrated camera and projector. This enables the generation of a detailed and robust geometry with only 3 frames and a marker, or 4 frames of patterns. In this project 4 patterns is the best option and is possible to achieve at a rate of 30 geometry frames per second, when the system runs at 120 Hz. The scanning system is able to:

• capture geometry at high frame rate • process in real-time • produce decent quality geometry, texture and lighting

• transmit geometry at a sustainable rate • be compatible for rendering with the 3D display part of the system • do all of this with as little human detectable delay/lag as possible

For a deeper explanation of the scanning system see section 4.

3.2 Transmission and Processing

A big part of the system is the connections between the scanning and the display side. Data has to be processed at high rates and then transmitted through the system to be displayed on the other side. This is not a very visible side of the system, but many design choices were made with consideration to this. Networking code is implemented to give full control and centralized command over the four computers involved as well as integrated, fast transmission of data. The two physically separate systems, acqui- sition and display, are connected here. Both of the systems has the same solution for the synchronization. Chapter 5 goes into details of these solutions and explains the synchronization and the entire pipeline for processing and transmission.

3.3 Display

The 3D display presented in section 2.5 is the basis for the new 3D display, but it is developed to become much larger, to display a life-sized head. This is achieved by developing a new mirror with new size, shape and materials. The display build for this project is a horizontal-parallax multi-view 3D display that combines one or more video projectors to generate view-dependent images on a non-stationary anisotropic screen. Depending on the viewers position, different views of the 3D scene being displayed is generated and shown to the surrounding viewers. Via face tracking it also displays correct vertical parallax projections for the viewers.

28 The display hardware as well as software is developed to render textured solid content, animation and live streaming. This is to show a human head, for which it does not suffice with the static wireframe models possible in Jones et al. [24]. Since MCOP thus far was the the only method which provided the correct perspec- tive for all viewing angles at a given h and d, it would be used for the new display. But specific demands on the new system is set to get rid of some of the restrictions from MCOP projection. To make the display better, the mirror and the angle of its inclination is optimized for displaying a face, and the projector accordingly positioned. This means that the assumption used in the MCOP technique no longer holds, a new rendering technique and fast enough implementation is therefore developed. For a deeper explanation of the 3D display system see section 6. The resulting solution is a fully operational system which solves many of the issues and goals discussed in section 1.2.

3.4 Human interaction

There has been attempts to build systems which can achieve eye contact, for example by leveraging the properly that eye contact sensitivity is asymmetric.[8] It is also pos- sible to configure and train users on a system so that they can determine which gaze directions means mutual eye contact. [17] These approaches does however not provide a general and straight forward solu- tion for achieving natural eye contact and the fact of the usefulness of such a system remains. The developed 3D teleconferencing system described in this report is intended to be a one-to-many teleconferencing system which achieves accurate reproduction of gaze and eye contact. This is done by accurately calibrating all the components of the system, display projector and camera, scanning projector and camera and the wide screen remote participant monitor to be in the same virtual coordinate space. The remote participant is scanned then modeled and textured in a coordinate space set to be centered for head height right in front of the remote monitor. It is then displayed in the same coordinate space at the display side, making the remote monitor virtually be in the same place as the audience. The same is done on the display side, where the unfolded path of the capturing camera is in the head displayed on the 3D display. This is done by reflecting the camera off of a polarized semi-reflective Lexan plexiglas surrounding the spinning mirror. Thus the audience will see trough the glass to the displayed head, but the polarized camera sees only the reflection. Images of the setup together with conceptual diagrams are shown in figure 3.2. The effectiveness of this setup to create a system which intrinsically solves the above discussed issues is to some extent measured in section 7.2.

3.5 Conclusion

This chapter gave an overview of the system. It explained what was built, prominent properties of the system and the motivation for implementing it.

29 mirror high-speed vertically polarized projector beamsplitter n ee mirror r reflected camera high-speed vertically polarized position projector beamsplitter eed sc eed n f ee r reflected camera spinning display position surface eed sc eed f spinning display surface video 2D remote horizontal participant polarizer 2D video video 2D remote horizontal st (RP) participant polarizer ruc ptur str (RP) ed lig uctu roje p red lig c h audience roje tor t ct ht audience remote participant remote participant or camera (RP) camera (RP) Audience & 3D display Remote Participant & scanner

Figure 3.2: The left column shows the audience and the display, conceptual and final construction. The right column shows the same for the scanning part of the system with the remote participant being scanned.

The next sections give a more in-depth explanation of the whole system. Starting with the acquisition of data on the scanning side (section 4), then explaining how this data is processed to meet the requirements set and then transmitted over the internet to the display side (section 5). The last section, 6, details the display system.

30 Chapter 4

Acquisition of Remote Participant

Figure 4.1: The real-time 3D scanning system showing the structured light scanning system (120Hz video projector and camera) in use, and the large 2D video feed monitor displaying the audience for the remote participant.

The scanning system is set on the side of the remote participant (RP), and is the system responsible for the acquisition of the 3D geometry seen on the display side. It also has a 2D display for the RP to see the audience watching the 3D display.

31 mirror high-speed vertically polarized projector beamsplitter n ee r reflected camera position eed sc eed f spinning display surface

2D video video 2D remote horizontal participant polarizer str (RP) uctu p red lig roje ct ht audience remote participant or (RP) camera

Figure 4.2: Schematic diagram of the real-time 3D scanning system, showing the con- ceptual setup of the scanning rig.

The remote participant is being scanned in 3D by projecting sinusoidal patterns onto her at 120 Hz. Four different patterns are needed to recover the 3D geometry, and thus the output rate of the textured geometry is 30 Hz. This is all processed in real time and then streamed over the network to the display side of the system. The geometry is generated in the same manner as Zhang et al. [43]. A lot of time in this project was spent on optimizing the code and buffers reading the data from the camera and processing it to generate the geometry and textures in a timely manner. This is a challenging task, since much processing is to be done in very little time, and with high accuracy, in a environment with many variables and sources for errors in the generated geometry. The setup, shown in a photo 4.1 and a schematic diagram 4.2, is setup so that the remote participant, while being scanned, is positioned in the same virtual space as being displayed on the other end of the system in front of the audience. The monitor is likewise calibrated to show the audience in the same way, so that they appear at the same size and position as in real life on the display side of the system. For more on this see section 3.4. The data is captured by using highly programmable Point Grey Grasshopper re- search cameras in conjunction with another of the high-speed projectors described in chapter 2.5. The images being processed in real-time on an Intel Quad core machine with an Nvidia Quadro 5600 with 1,5 Gb graphics memory, one of the fastest and most capable graphics card available at the time of construction. The processing pipeline is further detailed in section 5.

32 4.1 Projector

This is another MULE projector modified in the same way as the 3D display projector in section 2.5. It is rigidly mounted in a stand, made from 80/20 construction material [3], together with 2 Point Grey Grashopper research cameras. These are calibrated together to provide the basis for the 3D capture of the remote participant.

4.2 Cameras

Highly programmable Point Grey Grasshopper cameras is used to capture the projected patterns at a high frame rate, 640x480 at 200 frames per second in full color. They have a Firewire 800 interface, necessary for the high capture rates used for the geometry capture. The cameras are made to be programmed and used for research. To still enable capture of such a high frame rate (120Hz) in color, a customized shader was programmed to do real-time Bayer pattern interpolation on the graphics card.

4.3 Conclusion

This chapter details the hardware and the workings of the scanning system used to scan the remote participant. This is the basis for the processing done to get the 3D geometry transmitted to the display side. How this data is handled is the subject of chapter 5.

33 Chapter 5

Transmission and Processing

This section explains the need for fast processing and transmission of the data acquired and generated in the teleconferencing system. It goes into detail on the pipeline and how it is implemented and handles the demands posed. Since there is a high demand for synchronization in the constructed system, a section on this topic follows.

5.1 Fast processing

Initial calculations showed that to actually transmit the captured 3D geometry and tex- ture data for the new system and preserve a high quality, large amounts of data would have to be transmitted in realtime. If possible, the rate would be 30 frame per second in both directions. This does not only impose demands on internet bandwidth required for the system, but might also introduce lag and delays in a system that is already hard pressed to run in real-time at all parts of the system. A delay in this system makes it harder to use and not as well perceived, as it effects the conversation, as discussed in section 1.2. This would not be in line with the goal of giving the participants a feeling of having a real life face-to-face conversation. The bandwidth of the internet connection is not the only issue, especially not in the development environment, where a local area gigabit network was used. To process all the captured data in real-time a quad-core computer running parts of the processing is utilized, and that in turn uses a modern GPU to accomplish this in a timely manner. The captured frames is also Bayer-interpolated to get them in color. The frames sent from the display side are sent in monochrome Bayer patterns, since this format requires less data for the transmission. They are then interpolated to be shown in color on the remote participant side of the system. Since the scanning is already CPU intensive, this is done on the GPU, in real-time.

5.2 Data transmission

The data transmission discussed above required its’ own implementation of networking classes. The decision was made early to have the system send all communication over

34 the commonly available TCP/IP network also know as the Internet. The TCP/IP code is closely integrated with the allocated buffers holding the data from the cameras and the processed data, both on GPU and CPU. Since there was a concern that this part would introduce lag/delay this is implemented using the low level Winsock functions in windows and integrated into the rest of the code to give full control over the data transmission, as opposed to using a 3rd party library or code. The network classes are also extended to enable connection of all parts of the sys- tem, so that central control of the 4 participating computers is possible. The networking code transmits the generated geometry (in the form of a depth map) to the display computer. These processes are threaded and are running in parallel using the same memory buffers to achieve highest possible speed.

5.3 Data transmission and processing pipeline

The data transmission an processing pipeline is explained step-by-step, from beginning to end, following the diagram depicted in figure 5.1. The system is based on three com- puters working together, the Pattern generating computer which generates the timing and patterns needed for the scanning, the Capture computer which does the acquisition and processing of the data and the Display computer which receives and renders to the display.

1. A control program, running on the Pattern computer CPU receives and sends timing signals to the GPU. 2. The GPU is generating the time multiplexed subframe patterns which will create the sinusoidal gray scale patterns needed for geometry capture. This is dependent on the right synchronization together with the cameras. 3. The patterns are time-multiplexed and sent via DVI to the projector at 120Hz. 4. The projector illuminates the Remote Participant (RP) with the patterns. The projector is mounted rigidly in the same stand as the camera and they are cali- brated to have known intrinsics and extrinsic. 5. The camera captures the projected frames, synchronization is in part handled by an external PIC which fine tunes the synchronization based on the DVI signal refresh rate. 6. The images are saved in a camera buffer and read of via a FireWire800 interface to a buffer on the Capture computer. 7. The capture computer CPU automatically processes the pattern frame order to know which frame contains what information. It also processes light levels and finds a point in the image that is guaranteed to be on the face/object be- ing scanned. This is an important starting point for the phase unwrapping of the sinusoidal patterns, for more on this see Zhang and Huang [43]. Before send- ing the data on to the GPU, the CPU also does the phase unwrapping and finds correspondences between the camera and projector.

35 Pattern computer

1 CPU

Projector 2 3 GPU 4

5 RP Capture computer Camera 6 7 CPU FW/ 800 9

8 GPU Net

10 Display computer

CPU Net 3D display 11 Projector

GPU 12

Figure 5.1: Diagram over the data transmission and processing pipeline from 3D cap- ture to display.

8. The Capture computer GPU does triangulation of the generated correspondences, it also does noise filtering on the output to get rid of obvious errors. Then it ren- ders the geometry and texture from a given view to get a depth map and the associated texture. This is what we need to send to the display side of the tele- conferencing system. 9. Before being sent over the network though, the CPU half-sizes the geometry and performs some hole-filling. The textures are kept in full resolution and are currently not compressed. 10. Data is sent over a TCP/IP connection to the display side via threads dedicated to the quick and efficient transfer to minimize latency and/or lag. 11. The Display computer CPU builds 3D geometry from the depth map and filters

36 out geometry outside of the calibrated 3D volume that the system can project onto. Simultaneously textures are loaded onto the GPU via pre-allocated Framed Buffer Objects. The generated geometry is asynchronously loaded onto the GPU, and as soon as possible passed to the rendering part of the program. 12. The Display computer uses different methods to render the given scene to the 3D display at 4320fps, for more on this see chapter 6.3

Much of this pipeline can be built for high speed, but there are still some bottle- necks. To read the data of the first graphics card is not the most efficient part of the pipeline, since the graphics card is not optimized for this kind of operation. But the geometry needs to be processed at the same size as the texture, which gives a qual- ity/speed balance weighting. Processing of the camera captured pattern images also takes a while even on a fast CPU, this is a critical component since it decides the frame rate of the captured geometry, and thus for the 3D display too. Quite some time was spent on keeping the algorithms for this fairly simple and highly optimized. Once the data is transmitted over a sufficiently fast network connection and received in a timely manner by the network code, the delay from rendering and introduced lag must be minimized. This is done by preloading all the data onto the GPU before render- ing, so that the GPU always has data to render from. Although this might introduce a delay in the system, it is constant and avoids lag, which would be even more detectable by an audience. This solution gives a smooth rendering at the transmitted frame rate as long as the 3D display keeps up with the speed of other parts of the system.

5.4 Synchronization 5.4.1 Capture synchronization The whole teleconferencing system requires a very reliable synchronization between the many computers, projectors and cameras. It also requires the mirror and projector to synchronize. This synchronization is all based off of the feed given to the projector on the respective sides of the system. Since the system is set up to pack 24 subframes into one normal frame, these frames are first unpacked on the FPGA on the projector. For every subframe the projector displays, it also generates a sync-signal output. This signal is then used as the basis for triggering the cameras, so that they are always capturing the pattern projected by the MULE projector for the capturing, and likewise to assure the mirror is always synchronized to the images being projected. The synchronization signal is output as a positive DC current pulse from the scan- ner projector, and read by a Peripheral Interface Controller (PIC) [2], which is pro- grammed to control the other units connected to the system. It will count the number of subframes and send signals to the cameras to expose for the right amount of sub- frames projected by the pattern displaying projector, at the exact right time. This gives the pattern computer the ability to time multiplex the one bit subframes into gray scale frames. The cameras are then connected to the scanning computer and setup to let the computer know whenever there are new frames captured in the camera buffers. Upon the event that a new frame is ready the computer is programmed to automatically, as

37 fast as possible, read this back from the cameras and make it available for the program charged with the task to generate the geometry and texture which will be transmitted to the display computer.

5.4.2 Display synchronization The synchronization on the display side is dependent on the mirror being regulated by the output sync signal from the display projector. Thus the image projected only needs to be callibrated/offset on the first spin-up of the mirror. It will then know for which angle it’s rendering at any given time. The mirror also has a mechanical counter con- nected to the computer. This is to adjust for the case when the computer for any reason is rendering slower than the current display speed. In this case the mirror hardware is doing what it should, but the computer might be slower at times as it’s being pushed to it’s absolute limits. This is corrected for by detection and then skipping ahead. Great care is taken to time the subframes right, since the system must always transmit 24 packed subframes at once to the projectors.

5.5 Conclusion

This chapter detailed how the flow of data goes through the system. How this is achieved in a fast and timely manner through network code and hardware synchroniza- tion. The final result of this pipeline is passed to the 3D display program for rendering on the 3D display. This is the topic of the coming chapter (6).

38 Chapter 6

3D Display system

Figure 6.1: Photo of the display in action, with multiple viewers interacting in real- time. This chapter of the report gives the reader a deeper understanding of the hardware and software built for the 3D display. Some of the hardware is a modified version of the system described in chapter 2.5 and some is new, specialized, and constitutes an integral part of the system. The hardware provides solutions necessary for the whole concept to work, and sets the staging area for all the software developed. First the physical construction and hardware used is explained. Thereafter the

39 mathematics and theoretical aspects of rendering to this kind of moving display is detailed in section 6.2. The next part is an explanation of the software which renders to the display in section 6.3.

6.1 Hardware

mirror high-speed vertically polarized projector beamsplitter n ee r reflected camera position eed sc eed f spinning display surface

2D video video 2D remote horizontal participant polarizer str (RP) uctu p red lig roje ct ht audience remote participant or (RP) camera

Figure 6.2: The 3D Display construction showing the two-sided display surface, high- speed video projector, front beam splitter, and 2D video feed camera. Crossed polariz- ers prevent the video feed camera from seeing past the beam splitter.

The hardware for the 3D display has many similarities with Cossairt 07 [12] in that both systems use a single high-speed DLP projector to project patterns onto a spinning anisotropic surface. In contrast the system developed in this project uses a non-proprietary system architecture and correctly addresses the problem of rendering 3D scenes with both correct horizontal and vertical perspective to this type of display. The perspective-correct projection technique for arbitrary mirror shapes and projector positions is a central focus and contribution of this work. The 3D display works by projecting aimed views of a 3D scene onto a rapidly spin- ning anisotripic mirror, revolving at 900 rpm. The mirror and projector is synchronized and calibrated so that the system can project a different view to every viewer around the display. The views are displayed at an angular horizontal spacing of 1.25 degrees, which gives the solution for the horizontal parallax. The vertical parallax problem is solved by having a face tracking camera detect viewers around the display and then adjusting accordingly. A diagram of this setup is displayed in 6.2. A fisheye camera observes the audience, reflected of the polarized glass surface. This allows for the cam- eras unfolded position to be in the 3D display, watching the audience at the same time as the audience will see the through the polarized film to the 3D display. The display setup was built from 80/20 material [3], which is a very modular and adaptable way of constructing a sturdy, stable and durable rig.

40 This chapter will go in depth into the solutions making it possible to project onto this kind of mirrors, and how this was achieved at a very high frame rate.

6.1.1 Projector To project onto the spinning mirror in the chosen setup, there is a need to project at very high speeds. This could possibly be solved by using multiple projectors, but that would introduce problems like synchronization and overlapping. Another solution would be to use a high speed projector, then the problems are limited to bottlenecks in data transfer and frame rate. Some high-speed projectors use proprietary PCI data transfer boards [10] [39]. Typically such boards generate voxel geometry which is rasterized on the display it- self. The voxel transfer is relatively slow. In order to achieve interactive rates, the DepthCube display [39] limits the field of view and transfers only the front surface of the scene volume. This project uses the projector setup described in chapter 2.5, with some modifica- tions. The projector is placed on top of the display but aimed and reflected off a first surface mirrors so that the unfolded position of the projector is above the heads of the audience, as shown in figure 6.2. This unfolded position is optimal for reflecting as much light as possible to the audience with the mirror having a low inclination and the projected light coming from above. The projected area is also extended to use the full resolution of the projector, 1024x768.

6.1.2 Mirrors Spinning mirror The concept for this kind of display is based on having a spinning mirror which can display one image for every angle around the display by using the properties of an anisotropic surface. By projecting images onto this mirror the reflected image will then hit different viewers around the display and enable the 360 degree field of view, and since this is done with an anisotropic diffuser mirror which will only reflect incoming light in one direction horizontally, something similar to a one-to-one projection (instead of a one-to-many projection) is achieved from the projector. The incoming direction of projection towards the mirror should preferably be in a plane perpendicular to the direction of anisotropy, as it then would reflect light in a linear vertical plane, otherwise the reflection will be in a bent conical plane with it’s origin at the point of intersection with the mirror, see figure 6.3. As shown in earlier work [24] it’s possible to impose constrictions and then assume that the the approximation of a linear plane is good enough, and thus significantly simplify the problem of finding the correct projection perspectives onto the moving surface. One goal for this project was to investigate the effects of more general mirror shapes, and to decouple the position of the projector in respect to the mirror. As a consequence the developed projection system must move away from these restrictions and become more general. This will enable arbitrary projector positions and mirror shapes, but will

41 Figure 6.3: The photograph shows a point laser hitting an anisotropic mirror surface, creating a conical reflection. mean real-time calculations and rendering of quadric intersections (explained in 6.2), thus introducing a speed concern which needed an efficient solution.

Mirror shapes The first mirror built is flat, an thus the intersection calculations and light reflectance is fairly straightforward. Two more mirrors were built, one concave and one convex. The theory is that when projected on, the convex mirror spreads the light wider apart across the viewers, giving more time to project to each viewer and thus more light and a brighter image. The concave mirror focuses the light at a calculated distance from the display, giving more control over which viewer will see the projection at any given time, even giving control over which eye the system is projecting to. Images of the mirrors and a conceptual sketch of size and form is included as figure 6.4.

Finding a suitable mirror material The mirror used in Jones et al. [24] was a special anisotropic diffuser surface, consist- ing of very small, parallel, plastic, half cylinders. This kind of surface is complicated to make and expensive. A cheaper, more common, alternative is used in the current display: brushed aluminum.

42 (a) (b)

Figure 6.4: (a) 3D face shown intersecting the designed display surface. (b) Convex, flat, and concave display surfaces made from brushed aluminum sheet metal.

Whilst searching for a suitable replacement material, the brushed metal interior of the department elevator seemed to have the correct anisotropic properties. After investigating laser pointer reflections off of every refrigerator, sink, microwave and all other brushed metals found in the vicinity and local hardware stores, it was found that brushed aluminum had the brightest specular highlight and in general reflected the most light with the right anisotropic properties. Another big advantage of the aluminum is that this material can be bent and shaped, unlike the previously used material. Thus big sheets of brushed aluminum was bought. A sheer, roll and break tool procured. All mirrors are built by hand in the lab. Plastic frames are modeled in Maya and sent to be laser cut before mounting. The resulting double sided mirrors fit onto the spinning fly-wheel that fits onto the display systems engine. All display surfaces have the same 15◦-from-vertical, double-sided design and a surface area of 20cm wide by 25cm high. They are double sided in a tent shape, so that at every rotation of the construct two frames will be reflected, effectively doubling the frame rate seen by the audience. The inclination was chosen as to best as possible match the shape of a human face For a mathematical explanation of why the brushed aluminum has anisotropic prop- erties see chapter 6.2.

6.2 Theory

As suggested in chapter 3 and detailed in chapter 3, a new projection method was needed due to three important restrictions with the old MCOP technique: fixed pro- jector position, mirror angle and mirror shape. In addition to this we need to handle textures with dithering and streaming interactive data, the solution for this is explained in chapter 6.3.

43 6.2.1 Algebraic problem Looking at the problem from a physical point of view and setting up an algebraic model for the rendering shows that the function needed to be evaluated for each pixel is in six variables. What is needed is a function between the x, y, and z coordinates of the 3D points to be projected, here called (Qx,Qy,Qz), to the corresponding projector pixel, (pu, pv). This is dependent upon the mirror rotation, i.e. the angle at the moment of projec- tion, θ. It is dependent on the height, h, of the viewer, since the reflection off of the mirror is vertically diffused, the projection must be different if the viewer for exam- ple is looking from straight on or from above. Same for the distance, as the light will spread more the further away the viewer is from the mirror. The function sought is therefore on the form:

(Qx,Qy,Qz,θ,Vh,Vd) 7→ (pu, pv) (6.1) The solution to this is going to find the best point on the mirror to project onto for a certain viewer circle V given by (Vh,Vd), but there can be many viewers around the display at different azimuths Vψ , see figure 6.5. The sought projector pixel (pu, pv), as a function of the above parameters, will be given by tracing a ray from a given view point (Vh,Vd,Vψ ) on the viewer circle V, but at one moment in time, it will be projecting to many Vψ simultaneously. This problem was solved in Jones et al. 2007[24], but in this setup the reflections in the mirror will be conical in nature before intersecting the viewing circle. In addition the mirror surface can be of arbitrary shape, so the same assumptions that allowed for their analytical solution are no longer valid. This is due to the fact that the reflection in the brushed aluminum surface reflects light in accordance with a cylindrical micro-facet surface model of anisotropic reflec- tion [35, 25]. Incoming light is reflected to all angles perpendicular to the cylinders (brushed curvatures) in the material. The outgoing light will form a cone, with the apex of the cone having the same angle as the angle of incidence. At the apex there will also be a slight specular highlight, where more of the light is reflected. This is taken into consideration later so that as much light as possible reaches the viewers. The intersection calculation between the conical reflection and the circle is a quartic equation, since they are both quadratic equations, and thus has four potential solutions. In this system, which is already limited to less than 2000 faces, it is however not practical to implement a real-time quartic equation solver.

6.2.2 Pre-calculated Look Up Table The above described projection calculations were necessary to calculate the correct perspective projections for all the viewers around the display, but were not possible to do in realtime. Since the function (Qx,Qy,Qz,θ,Vh,Vd) 7→ (pu, pv) is smooth over all dimensions, it should however be possible to calculate samples which creates a reasonable approx- imation over all dimensions. Further, this discrete representation should be nicely in- terpolated to get an approximation of the smooth function, or even better: fitted to low

44 axis of viewing anisotropy P circle A

mirror projection ray

M V5 reflected cone

V Q 4 cross section of lookup table

V3

V2

V1

Figure 6.5: A reflected ray of light off of the mirror, as seen from above. The cone reflected from the surface intersects the viewing circle. The angle of the conical re- flections apex is equal to the incident angle. At the apex there is also a slight specular highlight. degree functions, easily evaluated at runtime. Thus, a look up table would grow in the amount of data in 6 dimensions, but would not have to be very densely sampled. With the modern graphics cards at hand, memory is a good trade-off for speed. The look up table (LUT) would have to cover the entire projection space of the spinning mirror, which was measured to be fully contained by a box of 30cm3 for all mirrors built. In this box a 3D grid of points, Q’s, will be evaluated. Since the sample density in the volume grows to the power of 3, it was chosen to be as sparse as possible. The viewing positions are in a cylindrical coordinate system, evaluated at every 1.25◦ around the displays 360◦-field of view, for a mathematic representation of the double sided mirrors. It’s evaluated for typical viewing distances for the audience, at the depth d of 0.5m to 2.0m in increments of 50cm and the height −50cm to +10cm in increments of 10cm, all relative to the position of the mirror. To find the best possible projection for every 3D point Q, all the points are evaluated for every Vh, Vd and every mirror angle θ. Some of the cones reflected from the mirror might not be seen from any possible viewer, and some might be seen from more than one mirror position, depending on the mirror shape. Since the viewer positions are discrete, there is no guarantee for an exact solution (i.e. that the reflected cone is hitting exactly that viewer position. The ray generating the cone with the most light (the apex with its specular highlight) which is closest to some viewing angle Vψ is chosen to be

45 the best (pu, pv) for that (Qx,Qy,Qz,θ,Vh,Vd). The calculations are well suited for parallel processing, so a GPU-accelerated nu- merical search was implemented to generate the lookup table. This is done bye eval- uating a 2D slice of the 3D space Q, for example (Qx,Qy), and then iterating over the 3’rd dimension Qz. This becomes significantly faster than a CPU implementation.

6.2.3 LUT generation on the GPU First the mathematical model of the mirrors used is used to generate a 3D mesh of triangles, with normal and tangent information at each vertex, and also which (pu, pv) projects to this vertex from the projector. For the discretized possible viewer positions along the viewing circle V, given by:

0 V = (Vh,Vd,Vψ ) (6.2) a vertex shader then projects the vertices onto the current 2D plane in the lookup table being evaluated, i.e. with the frustum set to the corners of this plane. The distance from the view point to the vertices is computed and saved pd. The projected mirror surface now contains all the information necessary to compute how well the incident and reflected rays match at each point. The image is rasterized and discretized in (Qx,Qy), this will be the cells in the lookup table. The evaluation of how big the difference, ε, is between the traced ray and the reflected cone is efficiently done in a fragment shader, which calculates:

ε = |~I ·~a −~L ·~a| (6.3) yielding the absolute value of the difference between between the angle of the two rays, ~I, the direction toward the viewer position and~L the direction toward the projec- tor. ~a is the direction of anisotropy, which determines where the cone is reflected and is static on the mirror surface but changes with the mirrors rotation, i.e. information al- ready in the 3D mesh. This calculation will find the closest intersection of the reflected cone and the viewer circle. This is iterated for each of the possible samples of Vψ over viewing circle V, and most positions will yield a value for the difference ε at every point (Qx,Qy), thus the values are stored in the alpha channel of the frame buffer, and overwritten for every value found to be less than the previously stored value, and the values (pu, pv, pd) are written to the RGB-channels of the frame buffer. When all the iterations are done, the best values found for the current 2D section of the LUT are read back from the GPU and stored. This whole process is then iterated for every sought mirror angle θ, and has become a search over 6 variables, but at the end only with a simple mathematical evaluation over the 1D space of viewer positions along the viewer circle. The distance, pd, is later used in the rendering to enable correct occlusion via depth buffering. It is worth especially noting some facts of this method: First, the output is given as the sought mapping (Qx,Qy,Qz,θ,Vh,Vd) 7→ (pu, pv) with the addition of extra information, pd.

46 Secondly, since the geometry given as input is an arbitrary scene, modeled or math- ematically generated, there are no restrictions on the shape of the mirror.

This technique is further evaluated in section 7, including a visualization of the LUT, and the results for different mirror shapes.

6.2.4 LUT function fitting For the mirror chosen in the project, the lookup table data is well behaved and can be well approximated with a second degree polynomial in three variables. This is done as an optimization for the rendering, since evaluating this polynomial is done faster on graphics hardware than a 3 dimensional texture lookup. The function:

2 2 2 a∗Qx +b∗Qx ∗Qy +c∗Qx ∗Qz +d ∗Qy +e∗Qy ∗Qz + f ∗Qz +g∗Qx +h∗Qy +i∗Qz + j 7→ pu (6.4) is fitted to the spatial data (Qx,Qy,Qz) in a lest-quare solution. The same is done for pv. Given the (θ,Vh,Vd) the shader can now quickly calculate (pu, pv) given the variables a to j, for all the possible Q~ s. The fitting is done in a Matlab script, which also provides the functionality to plot and give a visual verification of the fitting. Visu- alizations of this are shown in section 7.1 and excerpts from the code can be viewed in Appendix A.3.

6.3 Software

The program which renders the 3D scene to the display is maybe the biggest part of this contribution and is detailed in this section. The program was rewritten and optimized in several steps, as every new part of functionality had high demands on speed and efficiency. The program was written in C++ and uses the OpenGL graphics libraries and the shaders are written in Nvidias’ Cg language. This section assumes in part that the reader has some knowledge of coding and techniques for rendering graphics on computers. This program was written especially for the high-end hardware at hand, the dual Nvidia Quadro 5600 in SLI mode with 1.5 GB of memory each. The final program outputs 4320 frames per second of rendered output, implement- ing the following functionality:

• rendering of a standardized scene in .obj format or geometry generated from depth maps • textures

• realtime ordered dithering, 4 by 4 bayer matrix • lighting shader

47 • several different modes of rendering, including wireframe, textured, dithered • several different modes of projection mathematics, including LUT, MCOP, and combinations of horizontal or vertical correct perspective development modes • streamed, live data modes

• video: save and playback • fast data transmission/reception for live video via OpenGL Buffer Objects • network connectivity and synchronization

• keyboard and Wii-mote control interface

As seen from the list above, the rendering program handles more than just rendering the scene. It has network capabilities, enabling it to receive streamed display data, and also acts at the control center for the pattern computer, relaying commands via the keyboard and Wii-mote. For the 3D display program to be able to project anything with any certainty, the angular velocity of the mirror must be known or constant and the angle of the mirror determined. The setup which could provide this is described in section 2.5. However, since the graphics cards are constantly being pushed to the max, they are always on the verge of their capacity and risk dropping a frame (not finishing rendering it in time). If a frame is dropped, the mirror angle and the display program will be out of synch. This is detected by the interrupt signal from a sensor in the flywheel (shown in figure 3.1). To correct for a dropped frame, special care has to be taken, since the program is actually rendering 24 subframes into one normal frame. This means that if a frame is dropped, the system is actually 24 subframes behind. Similarly, when adjusting the initial timing to render for the right mirror angle, the frames shifted is on a subframe level, but can only be sent to the projector as a package of 24.

6.3.1 Advancing the possibilities The program was developed one functionality at a time, with testing and completion of every part before the next was commenced. In many cases the the different steps led to results which guided and informed the further development of the program. The diagram shown in figure 6.6 is a schematic description for the structure of the 3D display program source code. This section goes into some detail about how it was build, in what order and explains the most important solutions.

Orthogonal projection The first step in advancing the display program was to implement a normal orthogonal projection mode. This mode could display static content on the display at the rendered rate of 3000 faces at the needed 4320 frames per second. As discussed in section 6.2, the orthographic projection will not work well for different angles of this kind of display, but directly in front of the display it produces a visible, first-stage result. This

48 .obj, Textured, Streamed Point Rendering white dithered content cloud

Orthographic Modified LUT MCOP Perspective Projection Orthographic

Animation Animated scene Static scene

Mode Wireframe Solid

Output Normal 24 packed subframes

Scene buffer Data HDD Network Stored models Scanned data

Figure 6.6: Diagram over the main structure of the 3D display program.

mode was used to test and develop the rendering to include textured content at the same rendering rate in real time. The textures can only be displayed in binary, this conversion is done by a process called dithering. A real time ordered dither was implemented in a fragment shader to do this, see section 6.3.2.

Modified orthogonal projection An experiment was made where the orthogonal projection was modified according to ocular measurement to approximate the rendering needed for the flat mirror, since this would possibly bend the light for different angles around the display as a somewhat well behaved function relative to the angle of the display. The resulting projection looked fairly decent, but it became even more evident that an implementation of the projection math described in section 6.2 was imperative.

LUT projection Next step was to implement the correct projection mode for the new projector position and mirror shapes. This was implemented as a lookup table, LUT, as motivated and explained in section 6.2, which also explains how the LUT was generated off-line.

49 The 3D display program will load a lookup table generated for the currently mounted mirror, and use it to achieve the correct projection. Since the LUT’s used are discrete samples of well behaved functions in 6 dimensions, they are sparsely sampled, and the entire data structure occupies a size of less than 40Mb. This is loaded onto the graphics card, formatted as a 3D texture for each mirror angle, and viewer height and distance (θ,Vh,Vd). Each of these 3D textures will then contain the samples for the three dimensions (Qx,Qy,Qz), and each pixel in the texture contains the data for the mapping (pu, pv, pd) in the single-precision floating-point RGB color channels. There are some important advantages to this solution. First, the graphic card has functions for fast texture lookups, i.e. in this case for fast LUT lookups. Second, the graphics card used supports automatic interpolation of values within the 3D texture in three dimensions, making it possible to easily render for any point in the texture, i.e. any (Qx,Qy,Qz). The texture lookup from the 3D textures are not as fast as the more commonly used 2D texture lookups, but performs very well in this case. The mirrors angular velocity is constant and synchronized (see section 2.5) so after startup and an initial manual calibration, the mirror angle is always known. Since the viewer might be positioned at any distance and height around the display, the position is measured by face tracking, and the closest four discrete values of this height and depth is linearly interpolated in these two dimensions. This gives the final (pu, pv, pd) to be lit by the projector. This was all implemented in a vertex shader for this projection mode, shown in appendix A.1. There is also an appendix included with the next advancement of this method, the LUT fitting. This gives a further speed improvement. There is sample code for how to convert the LUT to a fitted function (appendix A.3) LUT and then modify the shader to use the fitted LUT (appendix A.2).

MCOP projection The MCOP (Multiple Center Of Projection) projection mode was used for reference and as a starting point. It was developed and explained in section 2.5 and 2.5 respec- tively.

Streaming mode In the next step, the capabilities for displaying live updated data is implemented. To be able to stream data from the capture side of the system, fast transport was needed, but once the data is received it needs to be loaded onto the GPU and used in the ren- dering with as little latency as possible. The incoming data is in the form of a texture and a depth map. The incoming texture is loaded directly to the graphics card from system memory, a common operation and fast by the design of the graphics card. The depth map is turned into geometry by connecting the 3D point cloud given by the depth values as vertices with edges between close by vertices in a triangular fashion. The ge- ometry generation also takes in to account that the surface (a face) is assumed to be somewhat smooth and removes errors and spikes. This yields a list of vertices, indices of faces and texture coordinates. For fast rendering, these are loaded into pre-allocated memory in the graphics card. This was done by pre-allocating Vertex Buffer Objects

50 (VBOs) for the vertices and texture coordinates, and an Element Array Buffer Object (EABOs) for the indices. Buffer Objects are a newly introduced, powerful feature of OpenGL and supported by modern hardware and drivers. It allows direct, managed, access from the CPU code to high performance memory on the graphics card. Data can be encapsulated in a Buffer Object, and handled without having to read back or overwrite the data, increasing the rate of data transfers. They are especially optimized for data pointed to by client/state functions and arrays of indices for drawing sets of elements, turning these into server/client functions. Once initiated these can be bound in the same manner as display lists or textures for manipulation and/or rendering. The memory mapped via Buffer Objects is seamlessly accessible for the CPU code, in this case the C++ code. Behind the mapping is however a complex memory manage- ment system optimized for this usage, and closely integrated with the systems graphic drivers. There are also option flags used in the initialization code to tell the drivers to optimize in specific ways, in this case for fast streaming of live data, write and modify only. Another important optimization is that the data transfer in this case is done asyn- chronously, saving precious time, since the CPU need not wait for the GPU functions to complete before continuing processing. Appendix ?? demonstrates some of the code used for these optimizations. Except for creating the capability to have live streamed data displayed, the possibility to save a sequence and later play it back or do an off-line rendering for debug purposes was implemented. This also functions as video playback and for computer animated sequences.

Wireframe rendering All rendering is by default done in solid mode, rendering the scene, but there is an option to draw in wireframe mode. This was used in the development to make artifacts or problems easier to spot at times. The wireframe mode does two rendering passes, one rendering of the geometry all in black and then one rendering of the wireframe, or edges between the vertices of the objects mesh. If the first pass is skipped, there will be no occlusion and the wireframe will be transparent.

Handling occlusion To correctly handle occlusion through the depth buffer, another rendering pass is nec- essary. First one pass assessing the depth and occlusion, and then another one to render the color. This technique is known as deferred shading.

6.3.2 Dithering To convert the gray-scale images sent to the display program from the capture computer these images are dithered. Dithering is a way of converting floating point images or discretized images to binary images, it is also called half toning or color reduction. This process heavily affects the quality of the resulting image and can be done in many different ways. In the case of the display program the ordered dither is done to be able to pack data in to subframes to achieve the fast frame rates required, but it is for example also commonly used when printing newspapers.

51 (a) (b) (c) (d) (e)

Figure 6.7: Comparison of different dithering techniques. (a) The original image. (b) Thresholding/Average dithering. (c) Random dithering. (d) Ordered dithering, used on the display. (e) Dithering with Ostromukov’s algorithm.

In this case we are dithering to reduce the color resolution (not spatial resolution), to be able to show the rendered scene with 1 bit per pixel, i.e. each pixel can only be on or off. One simple way of doing this would be to threshold all the value of each pixel against some value, for example 50% of the maximum pixel value. This method i called thresholding or average dithering (when using the average pixel value as the threshold) and yields a result which loses a lot of detail and creates a lot of contours, see figure 6.7 (b). Attempts were made to use the dithering algorithm introduced in Ostromoukhov (2001)[33] (figure 6.7 (e)), as it is produces a very good image and is reasonably fast, as the widely used and popular Floyd-Steinberg algorithm [15]. These algorithms give a more accurate result as they implement error diffusion, which takes into account the errors from the dithering and tries to minimize these. While Ostromoukhov’s algorithm was implemented, it was found to be to slow to run at the high speeds required. The algorithm found to produce a good result at the required speed was ordered dithering. Ordered dithering uses an array of optimized values for the thresholding to get ’gray levels’ in small regions. The fragment shader used to do the dithering in real time in the display program uses a 4 by 4 matrix, which produces 16 different perceived gray levels. The small regions created are however visible in the result as cross-hatch pattern artifacts, see figure 6.7 (d). These artifacts is the major drawback of the technique which is otherwise very fast and powerful.

6.4 Conclusion

This chapter detailed the hardware build for the 3D display, the requirements and limits this set for the system. It also explained the software created on these premises. With this system built and tested in many steps, the resulting research and system will be analyzed and discussed in the coming chapter.

52 Chapter 7

Results

This chapters analyzes the result of this master-thesis project. The resulting system will be compared to the goals and purpose set, but also to the closest similar 3D display system and the previous version of the 3D display.

7.1 Analysis of the result 7.1.1 Acquisition The scanning side of the system performs well at 30 frames per second of captured and processed geometry, sent on to the display side. There are still some spikes and holes in the geometry, even though hole filling and smoothing filtering is done. This is due to the many variables in this part of the system and the relative roughness of the geometry. Hair can for example not be captured, but will instead generate artifacts. Many of the errors are also due to motion. With time and more development this could probably be solved to some extent, but it is made difficult by the limited time for processing.

7.1.2 Transmission and Processing This part of the system is what connects the display and acquisition side, and it does so making sure to fulfill the demands posed on it. The data, textures and a depth map are transmitted over regular TCP/IP (Internet). Currently the data is uncompressed and requires approximately 10.5 MB/s for all of the data transmitted. This would be significantly less with compression since most of the data consists of images. The data transmitted from the display side is only the video stream from the display side camera. With time, the code could be implemented to compress these data feeds, but since the network in the current setup of the system was fast enough, this is regarded as a non-essential goal at this time.

53 3D Teleconferencing system Jones07 Cossairt et.al 2007 Visual FPS 30 15-20 (30-40 color) 30 Projected FPS 4320 4320-4800 6000 Resolution 1024x768 768x768 768x768 Occlusion capable Yes Yes Yes Interactive content Yes Yes No Streamed content Yes No No Textured content Yes No No Light fields Yes Yes No Color depth dithered B & W ditherd B & W or 2-color dithered RGB Image diameter 20-25 cm 13 cm 25 cm Angular resolution 1.25 1.25 0.91 Horizontal field of view 180 360 180 Horizontal parallax Yes Yes Yes Vertical parallax Yes Yes, for 1 viewer w/ tracking No Electronic interface DVI DVI SCSI-3 Ultra Mirror shapes Arbitrary Flat, 45 angle Flat Projection technique 6D LUT based MCOP1 MCOP1 single-view perspective

Table 7.1: Performance comparison between the developed 3D display, Cossairt et al. [11] and Jones et al. [24]

7.1.3 Display The 3D display system is fully functional and has some very good properties. There are some advantages and disadvantages of the developed system compared to the systems developed in Cossairt et al. [11] and Jones et al. [24], these differing properties are summarized in table 7.1. The 3D display is able to render perspective correct images with quadric rendering equations at 4320 frames per second for scenes consisting of less than 2000 faces by pre-calculating a 6 dimensional look up table (LUT) for a given mirror geometry and projector position. This was the only solution found to be fast enough to render the needed scenes. They are also textured and dithered with ordered dithering. The size of the LUTs used is dependent on the spacial resolution for the presumed viewers. It was found that LUTs with good enough resolution easily fits on the graphics cards, two Nvidia Quadro 5600 in SLI mode. The frame rate seen on the 3D display by the user is analog to the most common frame rates of today’s video and TV, 30 frames per second, for all angles around the 3D display. Since the mirror is constantly moving, the image seen is slightly blurred. This actually helps the displayed dithered images by blending the pixels for the simulated gray-levels. The positions of the faces of the audience is tracked, so that the display can render the correct vertical perspective of the face to each viewer. It is also occlusion capable, so that it can depict solid objects, as opposed to many other volumetric 3D displays which have the omnidirectional diffuser voxels shining through each other. With the new mirrors built, the projected light is spread in different ways and enable

1MCOP, multiple centers of projection, the same image is rendered for many different viewer positions, explained in 2.5

54 Figure 7.1: The image shows the three different mirrors built and used on the display. These were all designed to have different properties when used, which are discussed in this section. Displayed are, from the left: concave, flat and convex. the fine tuning of chosen aspects and properties of the display. This is a great advantage, and very informative about how to move on with the development of the display during the developing process. The flat mirror, in the middle in 7.1, is projecting a whole frame to every viewer. The viewer will perceive the whole frame at once, with both eyes, and the frame will sweep somewhat over the viewers face. The view displayed will correctly change as the viewer moves around the display giving the impression of a 3D scene. The diverging beam from the projector continues spread horizontally after reflecting in the flat display surface, so that approximately a 20◦ wedge of the audience area can see part of what is projected at the same time, see figure 7.2(a). The flat mirror is the simplest to build and calibrate. The convex mirror built, to the right in 7.1, spreads the light out in a wider wedge, giving a longer time of visibility for each user, and thus more light and a brighter im- age. It is in the shape of a 40◦ cylindrical arc, curving ±20◦ over its 20cm of width. The convex curve spreads reflected light over 100◦ of the audience. The benefit of this mirror shape is that the line of reflected light traces over the audience more slowly compared to the flat mirror. To see that this is the case, compare it to the case when the surface forms a single complete cylinder, the specular reflection would not move at all. Thus, the convex mirror yields higher angular resolution of the three-dimensional imagery, enabling higher-quality 3D imagery. There is however some disadvantages to this shape. Due to the large angular divergence of the mirror, many projector rays reflect to the far side of the display, creating a viewable image from 360 degrees around the display. Unfortunately, more than 180 degrees can not be scanned, and are there- fore unseen. At the same time many forward-facing rays can not be reflected into by any mirror angle. The result is a smaller usable volume relative to the mirror size. The missing sample squares in row 3 of figure 7.3 indicates this in the curved plane where projection is not possible. The convex approach would potentially be very useful if the mirror was large compared to the light intensity of the projector, or if the light condi-

55 flat mirror concave mirror

unfolded axis axis unfolded projector projector

(a) audience (b) audience

Figure 7.2: (a) Light diverging from a flat anisotropic display surface can illuminate multiple viewers simultaneously, requiring a single projector frame to accommodate multiple viewer heights. (b) Light reflected by a concave display surface typically projects imagery to at most one viewer at a time, simplifying the process of rendering correct vertical parallax to each tracked viewer. tions in the room could not be controlled. As for the conditions in the test environment, with a relatively small mirror and controllable light conditions, this setup has more than enough light, and Neutral Density filters were used to reduce the light intensity. The concave mirror, to the right in 7.1, concentrates the light reflected off of it at the focus of its parabolic shape. The mirror built has a focus point at a distance of approximately one meter from the center of rotation, shown in figure 7.2(b). This enables projection of a single frame for each eye, optimized at the distance of the focus point. With this mirror the display has an even better autostereoscopic 3D capability, which gives a better experience for the users. The the focal surface seen by any given viewer is composed of the multiple mirror slices that are illuminated as the mirror spins. For the flat mirror, the focal surface is a cone centered around the mirror’s axis. For the other mirror the focal surfaces are not so well behaved. They have asymmetric focal surfaces, which change based

56 on viewing angle. The convex mirror produce a concave focal plane and the concave mirror produce a convex focal plane. In this case this is an advantage for the concave mirror, since the human face is better represented by a convex display surface, matching its shape. When the shape of the focal surface approximates the object being displayed, accommodation cues are more accurate and aliasing [45] is minimized. concave flat convex

−30◦ 0◦ +30◦

Figure 7.3: A grid showing how a plane will be warped by the LUT-projection tech- nique to reflect perspective correct off of the mirror. Three LUTs for three mirror shapes are shown, evaluated for three different mirror angles. At 0◦ the mirror is facing a viewer standing in front of the display straight on. For mirror angle 30◦, the image is reflected to a viewpoint at 60◦. The black rectangles show the extent of the projector frame. The graphs are oriented so that ’up’ corresponds to the top of the mirror.

The three different mirrors is used to demonstrate the LUT projection technique developed. The output of the LUT is geometry mapped into projector UV coordinates. If

57 directly projected onto a flat screen, the geometry will appear warped in order to com- pensate for the mirror shape, mirror angle, and projector perspective and keystoning. This is shown in figure 7.3 for the three different mirrors, projected from three differ- ent rotation angles of the mirrors. As can be seen, the transform applied by the LUT is generally smooth, though curvature increases towards grazing angles. The change of using the fitted function LUTs to speed up the code is shown in figure 7.4. The values generated are very close to the ground thruth, the only time they deviate is at the steepest angles of the mirror rotation, when the reflection is facing further away than 90 degrees to either the left or the right. These angles are not rendered to, since the display only renders to the front facing 180 degrees, the fitted approximation is thus a very good one for the region of interest.

−30◦ 0◦ +30◦

Figure 7.4: The images show, for three different angles, a projection of a plane onto the concave mirror. First using the fitted function LUT in lines (black) and then next to it, on the right, the full 6D LUT rendering translation in a grid (red and blue). As can be seen, both methods yield the same result.

The result is a teleconferencing system which takes a significant step towards main- taining the many nonverbal cues used in face-to-face human communication and adding to the experience perceived by the user.

7.2 User perception

This system is all about perception. How well it performs is the result of its’ many difficult parts, but how it will be measured is dependant on how people perceive it. To improve this, the frame rate was increased to double the rate of the first version. This is important both for the viewers and for the possibility to film the system with a normal camera, running at 30 fps. A tradeoff was made to increase the frame rate at the cost of degrees of viewing angle for the display. Since the system is unable to scan the back of the head of the subject being transmitted, there is no need to display the back-facing 180 degrees of view. Instead a double sided tent mirror was built, and the system made to render only the front of the given object, at twice the rate. Thus giving double the frame rate for the front-facing 180 degrees. The resolution of the projected image was also increased to use the full resolution of the projector at 1024x768. This produces an image of 4:3 aspect ratio, which is

58 leveraged by turning the projector on it’s side and projecting an oblong image along the vertical axis. This is more optimal since it better frames the human head. The system is built and the new mathematics were done such that the projector can be arbitrarily positioned. This is leveraged for useability. Since the specular highlight from the mirror material is strongest at the directly reflected vertical direction, the projector is best positioned over the viewers head. The projector is therefore placed on top of the display and reflected off of a system of mirrors so that the unfolded position will be the over the audiences heads. With this solution, the projector is not in the way of the audience, but still in the optimal position. concave flat convex

Figure 7.5: Comparison of the different mirror shapes for two simultaneously tracked upper and lower viewpoints. For the convex mirror, the geometry was scaled by 0.75 to fit within the smaller display volume. In the 4th and 8th columns the mirror is replaced with the actual mannequin head to provide a ground truth reference.

An evaluation of the three mirrors was done by photographing them during run- time, simultaneously from two different tracked viewpoints at different heights. The scenes displayed can be seen in figure 7.5. The scenes are displayed with the LUT projection technique, with each LUT calculated specifically for the appropriate mirror. The images shows a generated scen of a cube, a scanned person and a scanned man- nequin. Next to the scanned mannequin image is a photo where the mirror have been replaced by the real mannequin head for comparison. As can be seen, the reproduced 3D display version is quite close in size and shape to the ground truth, for the changing view points as well as the changing mirror geometry. Note that when using the con- vex mirror, it was necessary to shrink objects to fit within the smaller usable display volume. In general, the concave mirror yields the best combination of display volume, user

59 addressability, and focal cues for a head-sized display.

Figure 7.6: A test object (left) is aimed at a camera shown on the 2D display (right). The camera photographs the transmitted 3D displayed image of the test object to mea- sure gaze accuracy for the 3D display. The accuracy of the 2D display is measured in the same way by switching the locations of the camera and test object.

Audiences watching the display perceived eye contact, and could see at which per- son the scanned person were looking, but to measure the eye contact accuracy a test object is captured and transmitted. The test object features five registration targets that enable angular orientation to be measured, see figures 7.6. By placing a camera at one end of the teleconferencing system, and the test object at the other end of the system, and sighting the transmitted image of the camera lens through the sighting hole of the test object, the transmitted image of the test object is photographed. By measuring in the photograph how far the foremost registration target is from the center of the four edge registration targets, its angle relative to the camera can be calculated. The measured errors ranges between 3 to 5 degrees on the 3D display, most of which is at- tributable to geometric noise and the 2.5 degree separation between independent views. For the 2D display in front of the remote participant, the error ranges between 1 to 2 degrees.

7.3 Fulfillment of the purpose

The purpose of achieving a fully functional 3D teleconference system is achieved. The live, real-time scanning of a persons head likewise. All connectivity between the two sides of the system is handled over standardized internet connection. The goal of ex- tending the 3D display to correctly display a life size head autostereoscopically has been achieved at more than the required 180 degrees. It is also done with different mirrors and the projector position is arbitrarily chosen. The success if the goal stated in the introductory vision to create novel results within the field of 3D displays and try to push the boundaries is hard to measure, but the project has gotten some attention, see section 8.2, and has future prospects for

60 development.

61 Chapter 8

Conclusions

This report describes how these problems were solved and why specific solutions was favored. It concentrates on the main part of my contributions; the efficient and timely transmission and display of the captured 3D data of the remote participant. Many users have tested and seen the teleconferencing system but there has yet to be done a qualified user study to verify to what degree the system achieves eye contact and how connected the participants feel. Another mention about the useability of the system is that the spinning mirror and engine makes an audible sound, but not enough displease the audience. As the system is built now, the remote participant does not see the audience in 3D, even though the 2D screen is placed and calibrated to optimally match the actual audience position. To achieve a many-to-many, fully 3D, teleconferencing system in a meeting between N participants, in L different locations, N × (L − 1) 3D displays would be needed, since it will only show one participant at a time. In this project I have been part of building an intricate hardware system and devel- oped software controlling it in parallel. This has been a great experience, as well as learning new concepts and techniques within the field of computer graphics, the hands- on experience has been very educating. This in conjunction with doing this project abroad, with the involvement of very driven members from many nationalities has cer- tainly taught me a lot.

Contributions The system developed is quite large and complex. My contributions are in the 3D dis- play part of the system and the communication between the different parts. In essence, this entails software and hardware for the 3D display and networking software for communication. These contributions are mainly described in chapter 5 and 6. A more detailed enumeration can be found in section 1.4.

62 8.1 Future work

Experience gained during the development of this system show several areas where it could be improved. For example by enabling the projections to be in color. One way to do this would be to build a projector with 3 DLP chips, one for each color channel. This is not as simple as it might sound, since this would demand a much higher data rate, already a bottle neck in the system. To get full RGB-color would mean 24 times the projected data rate. To have a 3 color system with dithered color, three times the data rate would suffice, but this is more than even a Dual link DVI cable supports at the rates used. Another approach to achieve color could be to place two or more synchronized projectors in the same optical path, e.g. using a beam splitter (trichroic prism or similar) in reverse to combine several color channels. The grey level reproduction by the halftoning (dithering) could be improved, by enabling an algorithm that implements error diffusion, as shown in section 6.3.2. This is a matter of a fast enough implementation. The size of the 3D projection could be made larger, with a bigger mirror and a projector with higher resolution and potentially higher intensity. This would also re- quire the larger mirror to be constructed in a safe, balanced manner, spun by a stronger motor. The textured content transmitted between the capture side and the display side could be efficiently compressed. This requires computing resources on both sides of the system. If possible, it would be a great improvement to enable the projection without a spinning mirror that moves air around, or to have a mirror spinning without moving air around, for example in a vacuum. This would make it more quiet, and enable larger sizes. To find further areas for improvement from a useability point of view, it would also be of interest to perform a fully qualified user study.

8.2 Publicity

The teleconferencing system has at the moment of writing been displayed at the lab in ICT for guests and on the Army Science Conference 2008 in Orlando, Florida. A technical paper has been written on the system, which has been accepted to the SIGGRAPH09 international conference, where they are to be presented in the follow- ing categories:

• Technical Papers

• Emerging Technologies, as an exhibition

It has been published in the following media:

• The Wall Street Journal • Forbes.com

63 • Telepresence Options • Computer Government News • The Chronicle of Higher Education • USC News

• IEEE Spectrum Online • Jump Into Tomorrow

Videos of the results of this project and news of the developments can be viewed at University of California’s Institute for Creative Technologies home page.[16]

64 Bibliography

[1] Encyclopdia Britannica 2009, chapter holography. Encyclopdia Britannica On- line, 02 Apr. 2009. [2] Picdem hpc explorer boeard, w/ pic18f8722, 2006. http://www. microchip.com/. [3] 80/20 inc. - the industrial erector set, May. 2009. http://www.8020.net/. [4] Tibor Agocs, Tibor Balogh, Tamas Forgacs, Fabio Bettio, Enrico Gobbetti, Gian- luigi Zanetti, and Eric Bouvier. A large scale interactive holographic display. In VR ’06: Proceedings of the IEEE Virtual Reality Conference (VR 2006), page 57, Washington, DC, USA, 2006. IEEE Computer Society. [5] M. Argyle and M. Cook. Gaze and Mutual Gaze. Cambridge University Press, London, 1976. [6] Lee J. Marsella S. Traum D. Gratch J. Lance B. The rickel gazel model: A window on the mind of a virtual human. In 7th International Conference on Intelligent Virtual Agents, Paris, France, pages 296–303, 2007. [7] R. G. Batchko. Three-hundred-sixty degree electro-holographic stereogram and volumetric display system. In Proc. SPIE, volume 2176, pages 30–41, 1994. [8] Milton Chen. Leveraging the asymmetric sensitivity of eye contact for videocon- ference. In CHI ’02: Proceedings of the SIGCHI conference on Human factors in computing systems, pages 49–56, New York, NY, USA, 2002. ACM. [9] W. S. Condon and W. D. Ogston. A segmentation of behavior. Journal of Psychi- atric Research, 5(3):221 – 235, 1967. [10] Oliver Cossairt, Adrian R. Travis, Christian Moller, and Stephen A. Benton. Novel view sequential display based on dmd technology. In Andrew J. Woods, John O. Merritt, Stephen A. Benton, and Mark T. Bolas, editors, Proc. SPIE, Stereoscopic Displays and Virtual Reality Systems XI, volume 5291, pages 273– 278, May 2004. [11] Oliver S. Cossairt, Joshua Napoli, Samuel L. Hill, Rick K. Dorval, and Gregg E. Favalora. Occlusion-capable multiview volumetric three-dimensional display. Applied Optics, 46(8):1244–1250, Mar 2007.

65 [12] Oliver S. Cossairt, Joshua Napoli, Samuel L. Hill, Rick K. Dorval, and Gregg E. Favalora. Occlusion-capable multiview volumetric three-dimensional display. Applied Optics, 46(8):1244–1250, Mar 2007. [13] Neil A. Dodgson. Autostereoscopic 3D displays. Computer, 38(8):31–36, 2005.

[14] Gregg E. Favalora. Volumetric 3D displays and application infrastructure. Com- puter, 38(8):37–44, 2005. [15] Steinberg L. Floyd R. W. An adaptive algorithm for spatial gray scale. In Int. Symp. Dig. Tech. Papers, SID 75, page 36, 1975. [16] Institute for Creative Technology Graphics Lab at University of Southern Califor- nia. 3d teleconference, May 2009. http://gl.ict.usc.edu/Research/ 3DTeleconferencing/. [17] David M. Grayson and Andrew F. Monk. Are you looking at me? eye contact and desktop video conferencing. ACM Trans. Comput.-Hum. Interact., 10(3):221– 243, 2003.

[18] U. Hadar, T. J. Steiner, and F. Clifford Rose. Head movement during listening turns in conversation. Journal of Nonverbal Behavior, 9(4):214–228, 1985. [19] Dirk Heylen. A closer look at gaze. In AAMAS Workshop on Creating Bonds, 2005.

[20] T. et al. Honda. Three-dimensional display technologies satisfying super mul- tiview condition. In Proc. Three-Dimensional Video and Display: Devices And Systems, B. Javidi and F. Okano, volume CR76, pages 218–249. SPIE Press, 2000. [21] Xianyou Hou, Li-Yi Wei, Heung-Yeung Shum, and Baining Guo. Real-time multi-perspective rendering on graphics hardware. In Rendering Techniques 2006: 17th Eurographics Workshop on Rendering, pages 93–102, June 2006. [22] Y. Iwano, S. Kageyama, E. Morikawa, S. Nakazato, and K. Shirai. Analysis of head movements and its role in spoken dialogue. In Spoken Language, 1996. ICSLP 96. Proceedings., Fourth International Conference on, volume 4, pages 2167–2170 vol.4, Oct 1996. [23] R. Jakobson. Motor signs for ’yes’ and ’no’. In Language in Society, pages 91–96, April 1972. [24] Andrew Jones, Ian McDowall, Hideshi Yamada, Mark Bolas, and Paul Debevec. Rendering for an interactive 360 light field display. ACM Transactions on Graph- ics, 26(3):40:1–40:10, July 2007. [25] James T. Kajiya and Timothy L. Kay. Rendering fur with three dimensional tex- tures. In Computer Graphics (Proceedings of SIGGRAPH 89), pages 271–280, July 1989.

66 [26] Chris L Kleinke. Gaze and eye contact: A research review. Psychological Bul- letin, 100(1):78–100, July 1986. [27] Marsella S. Lee J. Nonverbal behavior generator for embodied conversational agents. In 6th International Conference on Intelligent Virtual Agents, Marina del Rey, CA, pages 243–255, 2006.

[28] Hiroyuki Maeda, Kazuhiko Hirose, Jun Yamashita, Koichi Hirota, and Michi- taka Hirose. All-around display for video avatar in real world. In ISMAR ’03: Proceedings of the The 2nd IEEE and ACM International Symposium on Mixed and Augmented Reality, page 288, Washington, DC, USA, 2003. IEEE Computer Society.

[29] Wojciech Matusik and Hanspeter Pfister. 3D tv: a scalable system for real-time acquisition, transmission, and autostereoscopic display of dynamic scenes. ACM Transactions on Graphics, 23(3):814–824, August 2004. [30] Naoki Mukawa, Tsugumi Oka, Kumiko Arai, and Masahide Yuasa. What is con- nected by mutual gaze?: user’s behavior in video-mediated communication. In CHI ’05: CHI ’05 extended abstracts on Human factors in computing systems, pages 1677–1680, New York, NY, USA, 2005. ACM. [31] David Nguyen and John Canny. Multiview: spatially faithful group video confer- encing. In CHI ’05: Proceedings of the SIGCHI conference on Human factors in computing systems, pages 799–808, New York, NY, USA, 2005. ACM.

[32] David T. Nguyen and John Canny. Multiview: improving trust in group video conferencing through spatial faithfulness. In CHI ’07: Proceedings of the SIGCHI conference on Human factors in computing systems, pages 1465–1474, New York, NY, USA, 2007. ACM.

[33] Victor Ostromoukhov. A simple and efficient error-diffusion algorithm. In Pro- ceedings of ACM SIGGRAPH 2001, Computer Graphics Proceedings, Annual Conference Series, pages 567–572, August 2001. [34] Rieko Otsuka, Takeshi Hoshino, and Youichi Horry. Transpost: A novel approach to the display and transmission of 360 degrees-viewable 3D solid images. IEEE Transactions on Visualization and Computer Graphics, 12(2):178–185, 2006. [35] Pierre Poulin and Alain Fournier. A model for anisotropic reflection. In Computer Graphics (Proceedings of SIGGRAPH 90), pages 273–282, August 1990. [36] Jeremy Rees. Critics pan CNN’s fake election holograms. New Zealand Herald, Nov 7 2008.

[37] H. M. Rosenfeld. Conversational control functions of nonverbal behavior. In Nonverbal behavior and communication., Hillsdale, NJ, 1978. Lawrence Erlbaum Associates.

67 [38] Jim Steinmeyer. THE SCIENCE BEHIND THE GHOST: A BRIEF HISTORY OF PEPPERS GHOST. Hahne, 1999. [39] Alan Sullivan. A solid-state multi-planar volumetric display. SID Symposium Digest of. Technical Papers, 32(1):1531–1533, May 2003.

[40] K. Tanaka and S. Aoki. A method for the real-time construction of a full par- allax light field. In A. J. Woods, N. A. Dodgson, J. O. Merritt, M. T. Bolas, and I. E. McDowall, editors, Stereoscopic Displays and Virtual Reality Systems XIII. Proceedings of the SPIE, Volume 6055, pp. 397-407 (2006)., pages 397–407, February 2006.

[41] A. R. L. Travis. The display of three-dimensional video images. Proceedings of the IEEE, 85(11):1817–1832, Nov 1997. [42] Chris Welch. Beam me up, wolf! CNN debuts election-night ’hologram’, 2008. [43] Song Zhang and Peisen Huang. High-resolution, real-time three-dimensional shape measurement. Optical Engineering, 45(12), 2006.

[44] Ying Zhang and Adrian Travis. A projection-based multi-view time-multiplexed autostereoscopic 3D display system. In IV ’06: Proceedings of the Tenth Inter- national Conference on Information Visualization, pages 778–784, Washington, DC, USA, November 2006. IEEE Computer Society.

[45] Matthias Zwicker, Wojciech Matusik, Fredo´ Durand, and Hanspeter Pfister. An- tialiasing for automultiscopic 3D displays. In Rendering Techniques 2006: 17th Eurographics Workshop on Rendering, pages 73–82, June 2006.

68 Appendix A

Appendix

A.1 LUT vertex shader code

void LUT VP ( float4 Qpos : POSITION, // vertex position float4 Qcol : COLOR, // vertex color f l o a t 4 Qtex : TEXCOORD0, // texture coords

uniform f l o a t s c a l e , // object scale uniform float4 translate, // object translate uniform float4x4 rotate, // rotation matrix

uniform float4 lutOrigin, // origin of lut volume (low corner // of cube) in world coordinates uniform f l o a t lutInverseWidth , // 1/width of lut volume in world coordinates uniform f l o a t lutSamples , // lut samples per dimension

uniform sampler3D lut0 , // 3D lut, Q(x,y,z) => ( u , v ) uniform sampler3D lut1 , // 3D lut, Q(x,y,z) => ( u , v ) uniform sampler3D lut2 , // 3D lut, Q(x,y,z) => ( u , v ) uniform sampler3D lut3 , // 3D lut, Q(x,y,z) => ( u , v )

f l o a t 4 Lw, // wights for the 4 luts

out float4 out pos :POSITION, out float4 out Q c o l : COLOR0, out float4 out u v : TEXCOORD0, / / t e x t u r e out float4 out u v S c r e e n : TEXCOORD1 / / uv ) {

// do transforms on the scene Qpos . xyz ∗= s c a l e ; Qpos = mul( rotate , Qpos ); Qpos += translate;

// direct 3D tex lookup with 4 pt interpolation for h and d (omega i set) float3 pos = (Qpos.xyz − l u t O r i g i n ) ∗ lutInverseWidth; float3 l0 = tex3D(lut0 , pos) ∗ 2 . 0 − 1 . 0 ; float3 l1 = tex3D(lut1 , pos) ∗ 2 . 0 − 1 . 0 ; float3 l2 = tex3D(lut2 , pos) ∗ 2 . 0 − 1 . 0 ; float3 l3 = tex3D(lut3 , pos) ∗ 2 . 0 − 1 . 0 ;

69 o u t p o s . xyz = l 0 ∗ Lw . x + l 1 ∗ Lw . y + l 2 ∗ Lw . z + l 3 ∗ Lw .w; o u t p o s .w = 1 . 0 ;

// pass data to fragment shader o u t uvScreen = out p o s ; o u t u v = Qtex ; o u t Q c o l = Qcol ;

}

A.2 Fitted LUT vertex shader code

void FittedLUT VP( float4 Qpos : POSITION, // vertex position float4 Qcol : COLOR, // vertex color f l o a t 4 Qtex : TEXCOORD0, // texture coords

uniform f l o a t s c a l e , // object scale uniform float4 translate, // object translate uniform float4x4 rotate, // rotation matrix

uniform float4 lutOrigin, // origin of lut volume ( low corner of cube) in world coordinates uniform f l o a t lutInverseWidth , // 1/width of lut volume in world coordinates uniform f l o a t lutSamples , // lut samples per dimension uniform f l o a t lutCubeToSample , // lut cubespace to samplespace uniform float3 viewerPos, // position of the assumed v i e w e r

f l o a t 4 Lw, // weights for the 4 different positions to be interpolated

/ / F i t t e d LUTs ======// this is extensive , but fast , naming: FitteLut# U / V C o e f f 0to2float4 ’s / / 0 uniform float4 FL0 uC 0 , // coufficents for u equation uniform float4 FL0 uC 1 , uniform float4 FL0 uC 2 ,

uniform float4 FL0 vC 0 , // coufficents for v equation uniform float4 FL0 vC 1 , uniform float4 FL0 vC 2 , / / 1 uniform float4 FL1 uC 0 , // coufficents for u equation uniform float4 FL1 uC 1 , uniform float4 FL1 uC 2 ,

uniform float4 FL1 vC 0 , // coufficents for v equation uniform float4 FL1 vC 1 , uniform float4 FL1 vC 2 , / / 2 uniform float4 FL2 uC 0 , // coufficents for u equation uniform float4 FL2 uC 1 , uniform float4 FL2 uC 2 ,

uniform float4 FL2 vC 0 , // coufficents for v equation uniform float4 FL2 vC 1 , uniform float4 FL2 vC 2 , / / 3

70 uniform float4 FL3 uC 0 , // coufficents for u equation uniform float4 FL3 uC 1 , uniform float4 FL3 uC 2 ,

uniform float4 FL3 vC 0 , // coufficents for v equation uniform float4 FL3 vC 1 , uniform float4 FL3 vC 2 , / / ======

out float4 out pos :POSITION, out float4 out Q c o l : COLOR0, out float4 out u v : TEXCOORD0, / / t e x t u r e out float4 out u v S c r e e n : TEXCOORD1 / / uv ) { / / Geometry t r a n s f o r m a t i o n s ======Qpos . xyz ∗= s c a l e ; Qpos = mul( rotate , Qpos ); Qpos += translate;

/ / LUT i n t e r p o l a t i o n ======float3 pos, posSS; f l o a t 3 l 0 ; f l o a t 3 l 1 ; f l o a t 3 l 2 ; f l o a t 3 l 3 ; float3 lutVal;

// compute position in LUT coordinates pos = (Qpos.xyz − lutOrigin .xyz) ∗ lutInverseWidth;

// convert to SampleSpace for the fitted LUTs posSS = pos ∗ lutSamples + 1;

// use fitted LUTs to get uv’s f l o a t xx = pow(posSS.x, 2); f l o a t xy = posSS . x ∗ posSS . y ; f l o a t xz = posSS . x ∗ posSS . z ; f l o a t yy = pow(posSS.y, 2); f l o a t yz = posSS . y ∗ posSS . z ; f l o a t zz = pow(posSS.z, 2);

l 0 . x = FL0 uC 0 [ 0 ] ∗ xx + FL0 uC 0 [ 1 ] ∗ xy + FL0 uC 0 [ 2 ] ∗ xz + FL0 uC 0 [ 3 ] ∗ yy + FL0 uC 1 [ 0 ] ∗ yz + FL0 uC 1 [ 1 ] ∗ zz + FL0 uC 1 [ 2 ] ∗ posSS . x + FL0 uC 1 [ 3 ] ∗ posSS.y + FL0 uC 2 [ 0 ] ∗ posSS.z + FL0 uC 2 [ 1 ] ; l 0 . y = FL0 vC 0 [ 0 ] ∗ xx + FL0 vC 0 [ 1 ] ∗ xy + FL0 vC 0 [ 2 ] ∗ xz + FL0 vC 0 [ 3 ] ∗ yy + FL0 vC 1 [ 0 ] ∗ yz + FL0 vC 1 [ 1 ] ∗ zz + FL0 vC 1 [ 2 ] ∗ posSS . x + FL0 vC 1 [ 3 ] ∗ posSS.y + FL0 vC 2 [ 0 ] ∗ posSS.z + FL0 vC 2 [ 1 ] ;

l 1 . x = FL1 uC 0 [ 0 ] ∗ xx + FL1 uC 0 [ 1 ] ∗ xy + FL1 uC 0 [ 2 ] ∗ xz + FL1 uC 0 [ 3 ] ∗ yy + FL1 uC 1 [ 0 ] ∗ yz + FL1 uC 1 [ 1 ] ∗ zz + FL1 uC 1 [ 2 ] ∗ posSS . x + FL1 uC 1 [ 3 ] ∗ posSS.y + FL1 uC 2 [ 0 ] ∗ posSS.z + FL1 uC 2 [ 1 ] ; l 1 . y = FL1 vC 0 [ 0 ] ∗ xx + FL1 vC 0 [ 1 ] ∗ xy + FL1 vC 0 [ 2 ] ∗ xz + FL1 vC 0 [ 3 ] ∗ yy + FL1 vC 1 [ 0 ] ∗ yz + FL1 vC 1 [ 1 ] ∗ zz + FL1 vC 1 [ 2 ] ∗ posSS . x + FL1 vC 1 [ 3 ] ∗ posSS.y + FL1 vC 2 [ 0 ] ∗ posSS.z + FL1 vC 2 [ 1 ] ;

l 2 . x = FL2 uC 0 [ 0 ] ∗ xx + FL2 uC 0 [ 1 ] ∗ xy + FL2 uC 0 [ 2 ] ∗ xz + FL2 uC 0 [ 3 ] ∗ yy + FL2 uC 1 [ 0 ] ∗ yz + FL2 uC 1 [ 1 ] ∗ zz + FL2 uC 1 [ 2 ] ∗ posSS . x + FL2 uC 1 [ 3 ] ∗ posSS.y + FL2 uC 2 [ 0 ] ∗ posSS.z + FL2 uC 2 [ 1 ] ; l 2 . y = FL2 vC 0 [ 0 ] ∗ xx + FL2 vC 0 [ 1 ] ∗ xy + FL2 vC 0 [ 2 ] ∗ xz + FL2 vC 0 [ 3 ] ∗ yy + FL2 vC 1 [ 0 ] ∗ yz + FL2 vC 1 [ 1 ] ∗ zz + FL2 vC 1 [ 2 ] ∗ posSS . x + FL2 vC 1 [ 3 ] ∗ posSS.y + FL2 vC 2 [ 0 ] ∗ posSS.z + FL2 vC 2 [ 1 ] ;

l 3 . x = FL3 uC 0 [ 0 ] ∗ xx + FL3 uC 0 [ 1 ] ∗ xy + FL3 uC 0 [ 2 ] ∗ xz + FL3 uC 0 [ 3 ] ∗ yy + FL3 uC 1 [ 0 ] ∗ yz + FL3 uC 1 [ 1 ] ∗ zz + FL3 uC 1 [ 2 ] ∗ posSS . x + FL3 uC 1 [ 3 ] ∗ posSS.y + FL3 uC 2 [ 0 ] ∗ posSS.z + FL3 uC 2 [ 1 ] ;

71 l 3 . y = FL3 vC 0 [ 0 ] ∗ xx + FL3 vC 0 [ 1 ] ∗ xy + FL3 vC 0 [ 2 ] ∗ xz + FL3 vC 0 [ 3 ] ∗ yy + FL3 vC 1 [ 0 ] ∗ yz + FL3 vC 1 [ 1 ] ∗ zz + FL3 vC 1 [ 2 ] ∗ posSS . x + FL3 vC 1 [ 3 ] ∗ posSS.y + FL3 vC 2 [ 0 ] ∗ posSS.z + FL3 vC 2 [ 1 ] ;

l u t V a l = l 0 ∗ Lw . x + l 1 ∗ Lw . y + l 2 ∗ Lw . z + l 3 ∗ Lw .w;

// convert LUT values to screen UV o u t pos.xy = lutVal.xy ∗ 2 . 0 − 1 . 0 ; o u t pos.z = length(pos − viewerPos)/200; o u t p o s .w = 1 . 0 ;

// PASS DATA TO FS o u t uvScreen = out p o s ;

// scale, for dithering , assumes resolution to be 1024 x 768 o u t u v S c r e e n . x ∗= 128; o u t u v S c r e e n . y ∗= 9 6 ;

o u t u v = Qtex ; o u t Q c o l = Qcol ;

}

A.3 LUT fitting code excerpt

% MATLAB code % This code shows some of the core functionality of the fitting. % The data handling, plotting and verification scripts are ommited. % The code is only excerpts , and all dependencies are not explained % i n d e t a i l .

% function for fitting functions to a lut and save for use in C++ f u n c t i o n generateFittedLUT ()

disp ( ’ F i t t i n g f u n c t i o n s t o LUT ’ ) ;

% load LUT: Loads a binary file alredy convertet from ascii , % generates data in global variable space loadLUT(’LookupTable. txt .bin’);

% access globaly stored data g l o b a l HSamples ; g l o b a l DSamples ; g l o b a l OmegaSamples ;

g l o b a l paramVector % header info for the custom file format

% write the result to a binary file % HEADERS [ f i d msg ] = fopen (’LUT.txt. fitted .bin’,’wb’); i f ( msg ˜= ’ ’ ) disp ( ’ F i l e opened with msg : ’ ) ; disp ( msg ) ; end f w r i t e (fid , paramVector, ’float32’);

% DATA % loop over LUT omega, ih, id f o r io = 1:OmegaSamples f o r ih = 1:HSamples f o r id = 1:DSamples % fit functions for u and v for each omega, height and distance. % fitting is done for all Q(z, y, z) and returned as two param vectors

72 [ f f u f f v] = fitFunction(io, ih, id);

% write the result to a binary file f w r i t e ( f i d , [ f f u f f v] , ’float32’); end end end

f c l o s e ( f i d ) ; disp ( ’ F i t t e d LUT written.’); end f u n c t i o n [ f f u f f v] = fitFunction(iangle, ih, id)

g l o b a l DATA; g l o b a l SpatialSamples;

c ou n t = 1 ;

A = z e ro s (SpatialSamplesˆ3, 10); Bu = z e r os (SpatialSamplesˆ3, 1); Bv = z e r os (SpatialSamplesˆ3, 1);

i o = i a n g l e ;

% loop through the cube for given angle, h, d f o r ix = 1:SpatialSamples f o r iy = 1:SpatialSamples f o r iz = 1:SpatialSamples

% for each row add equation to A, and answer to Bk element = getLUTValue(DATA, io, ih, id, ix, iy, iz); Bu( count ) = element(1); Bv( count ) = element(2);

% filter out outliers i f ( Bu( count ) ˜= 0 | | Bv( count ) ˜= 1 ) A( count, : ) = [ ixˆ2 ix ∗ i y i x ∗ i z i y ˆ2 i y ∗ iz izˆ2 ix iy iz 1 ]; count = count + 1; end end end end

i f ( c o un t < 100 ) disp ( ’WARNING: Rows i n l i n e a r system : ’ ) ; disp ( c ou n t ) ; end

% least square solve system Ax = B for x f f u = A\Bu ; f f v = A\Bv ; end

}

73