<<

Mixed in Virtual World Teleconferencing

Tuomas Kantonen (1), Charles Woodward (1), Neil Katz (2)

(1) VTT Technical Research Centre of Finland, (2) IBM Corporation

ABSTRACT In this paper we present a (MR) teleconferencing application based on (SL) and the OpenSim virtual world. (AR) techniques are used for displaying virtual avatars of remote meeting participants in real physical spaces, while Augmented (AV), in form of video based gesture detection, enables capturing of human expressions to control avatars and to manipulate virtual objects in virtual worlds. The use of Second Life for creating a shared augmented space to represent different physical locations allows us to incorporate the application into existing infrastructure. The application is implemented using open source Second Life viewer, ARToolKit and OpenCV libraries.

KEYWORDS: mixed reality, virtual worlds, Second Life, teleconferencing, immersive virtual environments, collaborative Figure 1. Illustration of Mixed Reality teleconference: augmented reality. Second Life among real people, wearing ultra light weight data glasses, sharing a virtual object on the table, inside virtual room, displayed in CAVE. INDEX TERMS: H.4.3 [ Applications]: : Applications – computer conferencing,The structure of the paper is as follows. Section 2 describes the teleconferencing, and video conferencing; H.5.1 [Information background and motivation for our work. Section 3 explains Systems]: Multimedia Information Systems – artificial,previous work related to the subject. Section 4 gives an overview augmented, and virtual . of the system we are developing. Section 5 goes into some explanation of Second Life technical detail. Section 6 gives a 1 INTRODUCTION description of our prototype implementation. Section 7 provides a The need for effective teleconferencing systems is increasing, discussion of results, as well as items for future work. mainly due to economical and environmental reasons Conclusions as are in the section 8. transporting people for face•to•face meetings consumes lot of time, money and energy. Massively multi•user virtual 3D worlds 2 BACKGROUND have lately gained popularity as teleconferencing environments. are several existing teleconference systems, ranging from This interest is not only academic as one of the largest virtual old but still often used audio teleconferencing and video conferences was held by IBM in late 2008 with over 200teleconferencing to web•based conferencing applications. 2D participants. The conference, hosted in a private installment of groupware and even massively multi•user 3D virtual worlds have Second Life virtual world, was a great success saving also an been used for teleconferencing. estimated $320,000 compared to the expense of having the Each of these existing systems has its pros and cons. conference held in the physical world [1]. Conference calls are quick and easy to set up without other In this paper, we present a system for mixed reality hardware than a mobile phone, yet it is limited to audio only and teleconferencing where a mirror world of a conference room is requires a separate channel e.g. for document sharing. created in Second Life and the virtual world is displayed in the Videoconferencing adds a new modality as pictures of participants real•life conference room using augmented reality techniques. The are transferred but it requires more hardware and bandwidth, real people’s gestures are reflected back to Second Life. Thebeing quite expensive in the high•end. Web•conferencing is participants are also able to interact with shared virtual objects on lightweight and readily supports document and application the conference table. A synthetic illustration of such a setting is sharing but it lacks natural interaction between users. shown in figure 1. We see several advantages of using a 3D virtual environment, such as Second Life or OpenSim among many other platforms, as [email protected] alternative means for real•time teleconferencing and [email protected] collaboration. First, the users are able to see all meeting [email protected] participants and get a sense of presence not possible in a traditional conference call. Second, the integrated voice capability of 3D virtual worlds provides spatial and stereo audio. Third, the 3D environment itself provides a visually appealing shared meeting environment that is just not possible with other means of teleconferencing. However, the lack of natural gestures constitutes IEEE 2010 a major drawback for real interaction between the participants. 20 - 24 March, Waltham, Massachusetts, USA 978-1-4244-6236-0/10/$26.00 ©2010 IEEE

179 3 RELATED WORK virtual worlds. We call the systemAugmented Collaboration in In our work, virtual reality and augmented reality is combined in Mixed Environments (ACME). similar manner as in the original work by Piekarski et al. [2]. In the ACME system, some participants of the meeting occupy Their work was quite limited in the amount of augmenteda space in Second Life while others are located around a table in virtuality as only position and orientation of users werereal world. The physical meeting table is replicated in Second Life transferred into the virtual environment. Our work focuses on to support virtual object interactions as well as avatar occlusions. interaction between augmented reality and a virtual environment. The people in real world see the avatars augmented around a real Therefore our work is closely related to immersive world table, displayed by video see through glasses, immersive environments such as [3, 4]. Several different immersive 3D video stereoscopic walls or within a video teleconference screen. conferencing systems are described in [5]. Participants in Second Life see the real world people as avatars Local collaboration in augmented reality has been studied for around the meeting table, augmented with hand and body example in [6, 7]. Collaboration is achieved by presenting co• gestures. Both the avatars and real people can interact with virtual located users the same virtual scene from their respectiveobjects shared between them, on the virtual and physical viewpoints and providing the users simple collaboration tools conference tables respectively. such as virtual pointers. Remote AR collaboration has mostly The main components of the system are: co•located users been limited to augmenting live video such as in [8] or laterwearing video•see•throught HMD, a laptop for each user running augmenting a 3D model reconstructed from multiple videothe modified SL client, a ceiling mounted camera above each user cameras as in [9]. Remote sharing of the augmented virtualfor hand tracking and remote users using the normal SL client. objects and applications has been studied for example in [10]. The system is designed for restricted conference room Our work uses Second Life and the open source implementation environments where meeting participants are seated around a well of Second Life called OpenSim, which are multi•userlit, uniformly colored table. As an alternative to HMDs, a CAVE virtual worlds, as the virtual environment for presenting shared style stereo display environment or a plain old video screens can virtual objects. Using Second Life in AR has been previously be used. studied by Lang et al. [11] as well as Stadon [12] although their Figure 2 shows how the ACME system is experienced in a work does not include augmented virtuality. meeting between two participants, one attending the meeting in In the simplest case, augmented virtuality can be achieved by Second Life and the other one in . It should be noted that displaying real video inside a virtual environment as in [13]. This the system is designed for multiple simultaneous remote and co• approach has been also used for virtual videoconferencing in [14] located users. A video of the ACME system is available at [20]. and augmenting avatar heads in [15]. Another form of augmented 6 IMPLEMENTATION virtuality is avatar puppeteering where human body gestures are recognized and used to control the avatar, either only the avatars face as in [16] or the whole avatar body as in [17]. However, only 6.1 General little previous work has been presented on augmenting Second The ACME system is implemented by modifying the open source Life avatars with real life gestures. The main exception is the VR• Second Life viewer [21]. The viewer is kept backward compatible Wear system [18] for controlling avatar’s facial expressions. with original Second Life so that, even though more advanced features might require server side changes, all major ACME 4 SECOND LIFE VIRTUAL WORLD features are also available when the user is logged in to the Second Life is a free, massively multi•user on•line game•like 3D original Second Life world. virtual world for social interaction. It is based on community The SL client was run on Dell Precision M6400 laptops (Intel created content and it even has a thriving economy. The virtual Mobile Core 2 Duo 2.66GHz, 4GB DDR3 533MHz). Logitech world users, called residents, are represented by customizable QuickCam Pro for Notebooks USB cameras (640x480 RGB, 30 avatars and can take part in different activities provided by other FPS) were used for video•see•through functionality, while residents. Unibrain Fire•I firewire camera (640x480 YUV, 7.5 FPS) was For interaction, Second Life features spatial voice chat, text used for hand tracking. eMagin Z800 (800x600, 40° diagonal chat and avatar animations. Only the left hand of the avatar can be FOV) and MyVu Crystal 701 (640x480, 22.5° diagonal FOV) freely animated on•the•fly, while all other animations rely on pre• HMDs were used as video•see•through displays. recorded skeletal animations that the user can create and upload to Usability studies of the system are currently limited to projects the SL server. internal testing of individual components. The author has For non•expert SL users, however, meetings in SL can be quite evaluated the technical feasibility of each feature and comments static with the ‘who is currently speaking’ indicator being the only have been collected during multiple public demonstrations, active element. From our experience, actively animating theincluding a demo at ISMAR 2009. We have been able to identify avatar while talking takes considerable training and directs the key points where the application has possibilities to overcome user’s focus away from the discussion. limitations of current systems and also points where Second Life has client•server architecture and each server is improvements need to be made to create a really usable system. A scalable to tens of thousands of concurrent users. The server is proper user study will be conducted during 2010 with HIT Lab proprietary to Linden Labs but there exists also the community NZ, comparing the ACME system with other means of developed SL compatible server OpenSimulator [19]. telecommunication. Detailed plans of the study have not yet been made. 5 SYSTEM OVERVIEW 6.2 Augmenting reality In this project we developed a prototype and proof•of•concept of video conference meeting taking place between Second Life and To be able to use SL for video see•through AR, three steps are the real world. Our system combines immersive virtualrequired: video capture, camera pose estimation and rendering of environment, collaborative augmented reality and human gesture correctly registered virtual objects. recognition in a way to support collaboration between real and Currently the ACME system supports two different video sources, either ARToolkit [22] video capture routines for USB

180 Avatar interaction is more relaxed as the intent of body language is conveyed even when avatar movements don’t precisely match to user motion. Object interaction requires finer control as objects can be small and in many cases the precise relative position of objects is of importance. The orientation of the user’s face is a strong cue about where the user is currently focusing on. When the user is wearing a video•see•through HMD we use the orientation of the camera, already computed for augmented reality visualization, to rotate the avatar’s head accordingly. User hands are tracked by the hand tracking camera as explained in section 6.3. This hand position information is used to Figure 2. User views of ACME: Second Life view move avatar’s hand towards the same position. Second Life (screenshot, left), real life view (augmented video, right). viewer has a simple built in inverse kinematics (IK) logic to control shoulder and elbow joints so that the palm of the avatar is devices or CMU [23] firewire camera driver API. ARToolkit placed approximately to the correct position. As the current OpenGL subroutines are used for video rendering. implementation limits the hand to a plane over the table, HMD camera pose is estimated by ARToolkit marker tracking interaction is restricted to simple pointing gestures. Other subroutines. Multiple markers are placed around the walls of the animations, for example waving for good bye, can still be used by conference room and the table so that at least one marker ismanually triggering animations from the SL client. always seen by the user wearing a HMD. We experimented with 20cm by 20cm and 50cm by 50cm markers at the distance from 1 6.5 Object interaction to 3 meters from the user. Distance between markers was about For easy interaction with objects, a direct and correct visual three times the width of the marker. feedback is needed. This is achieved by moving a feedback object Real world coordinate system is defined by a marker that lies with the user’s hand. Any SL object can be used as the feedback on a conference table. Registration with SL coordinates is done by object by attaching the object to the avatar’s hand. This feedback fixing one SL object to the real world origin and using object’s object is moved only locally to avoid any network latency. coordinate axis as unit vectors. Thisanchor object is selected in Currently we provide three object interaction techniques: the ACME configuration file. If the marker is not on the table, the pointing, grabbing and dragging. Interaction is controlled by two anchor object must be transformed accordingly. different gestures: thumb visible and thumb hidden. Gestures are Occlusion is the ability of a physical object to cover those parts interpreted from the point of view of the hand tracking camera, of virtual objects that are physically behind it. In the ACME therefore the hand must be kept in a proper pose. system, occlusion is implemented by modeling the physical space If the user moves her hand inside an object, the object is in the virtual world and using the virtual model as a mask when highlighted by rendering a white silhouette around the object. If rendering virtual objects. The virtual model itself is not visible in there is a gesture transaction from thumb visible to thumb hidden the augmented image as otherwise it would cover the verywhile an object is highlighted, the object is grabbed. The grabbed physical objects we want to see. Similar method was used in [24]. object is highlighted with a yellow silhouette. By moving the hand The ACME system does not place any restrictions on what kind while an object is grabbed the object can be dragged, that is, the of virtual objects can be augmented. Any virtual object can also object will move with the hand. Releasing the grabbed object is be used as occlusion model. However, properly augmentingdone with a gesture transition from thumb hidden to thumb transparent objects has not yet been implemented. visible.

6.3 Hand tracking 7 RESULTS For hand tracking, a camera is set up over the conference room The current implementation of the ACME system is still quite table. The camera is oriented downwards so that the whole table is limited. Even when using multiple large markers there can be visible in the camera image. The current implementation supports registration errors of tens of pixels, creating annoying visual only one hand tracking camera. effects particulary at occlusion boundaries. Augmented objects Hand tracking video capturing and processing is done in a also jerk a lot when markers become visible or disappear from the separate thread from rendering so that a lower video frame rate view. Better vision based tracking techniques or fusion with can be used without affecting rendering of the augmented video. inertial sensors are clearly required for the system to be usable. Hands are recognized from the video image by HSV (hue, Visualizing virtual avatars with a head mounted video see saturation and value) segmentation. HSV color space has been through display is limited by the current HMD technology. shown to perform well for skin detection [25]. Each HSV channel Affordable HMDs do not provide enough wide field of view to be is thresholded and combined into a single binary mask. really A usable in a multi user conferencing. On the other hand, calibration utility was created for calibrating threshold limits to when augmentation is done into a video teleconferencing image, take different lightning conditions into account. the user is able to follow virtual participants as easily as other The current implementation uses only a single camera for hand video conference participants. tracking, therefore proper 3D hand tracking has not yet been Hand tracking with non•adaptive HSV segmentation is implemented. User hand is always assumed to hover 15cm over extremely sensitive to lighting and skin color changes. Careful the table so that the user can do simple interactions with virtual calibration is needed for each user and recalibration needs to be objects on the table. done when ever the room lightning changes. 6.4 Gesture interaction The current hand gesture recognition is prone to errors and lacks haptic feedback. This makes the interaction feel very Interaction in the ACME system is divided into two categories: unnatural and requires very fine control from the user. Also the interacting with other avatars and interacting with virtual objects.

181 [3] P. Kauff and O. Sheer (2002), “An immersive 3D video• conferencing system using shared virtual team user environments”, in Proc. CVE’02, pp. 338•354. [4] M. Gross et al. (2003), “blue•c: a spatially immersive display and 3D video portal for telepresence”, ACM Transactions on Graphics 22(3) , Jul 2003, pp. 819 – 827. [5] P. Eisert (2003), “Immersive 3•D Video conferencing: challenges, concepts, and implementations”, Proc. VCIP 2003, pp. 69•79. [6] D. Schmalstieg, A. Fuhrmann, G. Hesina, Z. Szalavári, L. Encarnaçäo, M. Gervautz, W. Purgathofer (2002), “The Studierstube Augmented Reality Project.”, Presence: Teleoperators and Virtual Environments, Feb 2002, pp. 33 – 54. Figure 3. Interaction with virtual objects: Second Life view [7] M. Billinghurst, I. Poupyrev, H. Kato, R. May (2000), “Mixing (left), and real life view (right). Feedback object as red ball. realities in shared space: an augmented reality interface for collaborative computing”, in Proc. ICME 2000. current limitation of the hand motion to a 2D plane makes any [8] M. Billinghurst and H. Kato (1999), “Real world teleconferencing”, sensible interaction rather difficult. Proc. Chi’99, pp. 194•195. It should be noted that most of these short comings can be fixed [9] S. Prince et al., “Real•time 3D interaction for augmented and virtual by applying existing, more advanced algorithms. The only major reality”, ACM SIGGRAPH 2002 conference abstracts and issue without a direct solution is the low quality of currently applications, pp. 238 available affordable HMDs. [10] D. Schmalstieg, G. Reitmayr, G. Hesina (2003), “Distributed applications for collaborative three•dimensional workspaces.” 8 CONCLUSIONS Presence: Teleoperators and Virtual Environments 12(1), Feb 2003, In this paper, we have presented a system called ACME for pp. 52•67. teleconferencing between virtual and physical worlds, including [11] T. Lang, B. MacIntyre, I. J. Zugaza (2008), “Massively Multiplayer two way interaction with shared virtual objects, using means of Online Worlds as a Platform for Augmented Reality Experiences”, augmented reality and gesture detection in combination of Second IEEE VR ’08, pp. 67•70. Life viewer and ARToolkit and OpenCV libraries. Currently the [12] J. Stadon, “Project SLARiPS: An investigation of mediated mixed ACME system contains augmenting of avatars and virtual objects reality” In Arts, Media and Humanities Proc. of the 8th IEEE based on marker tracking, visualization including occlusions, and ISMAR 2009, pp. 43–47. for interaction head tracking, 2D hand tracking from a monocular [13] K. Simsarian, K.•P. Åkesson (1997), “Windows on the World: An camera and a grab•and•hold gesture based interaction with virtual example of Augmented Virtuality.”, Proceedings of Interfaces 97: objects. Items for future work include enhanced AR visualization Man•Machine Interaction. with markerless tracking, more elaborated hand gesture[14] H. Regenbrecht, C. Ott, M. Wagner, T. Lum, P. Kohler, W. Wilke, interactions and body language recognition, controlling avatar E. Mueller (2003), “An Augmented Virtuality Approach to 3D facial expressions, as well as various user interface issues. Videoconferencing.”, In Proc. Of the 2nd IEEE and ACM ISMAR, Overall, we believe this early work with the ACME system has 2003. demonstrated the feasibility of using a mixed reality environment [15] P. Quax, T. Jehaes, P. Jorissen, W. Lamotte (2003), “A Multi•User as a means to enhance a collaborative teleconference. Certainly, Framework Supporting Video•Based Avatars.”, In Proceedings of the ACME system is not a replacement for a face to face meeting, the 2nd workshop on Network and system support for games, 2003, but it should simplify and even enhance the 3D meeting pp. 137 – 147. experience to the point where mixed world teleconference[16] F. Pighin, R. Szeliski, D. Salesin (1999), “Resynthesizing Facial meetings could be a low cost yet effective alternative for many Animation through 3D Model•Based Tracking.”, Proceedings of the business meetings. Our aim is within the next few months to 7th ICCV, 1999, pp. 143 – 150. employ the ACME system in our internal project meetings[17] J. Lee, J. Chai, P. Reitsma, J. Hodgins, N. Pollard (2002), between overseas partners, which we have so far held in the pure “Interactive Control of Avatars Animated with Human Motion virtual Second Life environment. Data.”, ACM Transactions on Graphics 21, Jul 2002, pp. 491 – 500. [18] VR•WEAR SL head analysis viewer, http://sl.vr•wear.com/, ACKNOWLEDGMENTS unpublished [19] OpenSimulator, http://opensimulator.org/. The system has been developed in project “MR•Conference”[20] Video of the ACME system, starting in October 2008 with VTT as the main developer, IBM http://www.youtube.com/watch?v=DNB0_c•5TSk and Nokia Research Center as partner companies, and main[21] Second Life Source Downloads, funding provided by Tekes (Finnish Funding Agency for http://wiki.secondlife.com/wiki/Source_archive. Technology and Innovation). Various people in the project team [22] ARToolkit homepage, http://www.hitl.washington.edu/artoolkit/. helped us with their ideas and discussions, special thanks going to [23] CMU 1394 Digital Camera Driver Suzy Deffeyes at IBM and Martin Schrader at Nokia Research http://www.cs.cmu.edu/~iwan/1394/. Center. [24] A. Fuhrmann, et al. “Occlusion in Collaborative Augmented Environments”, Computers and Graphics, 23(6):809•819, 1999. REFERENCES [25] Benjamin D. Zarit, Boaz J. Super, Francis K. H. Quek (1999), [1] How Meeting In Second Life Transformed IBM’s Technology Elite “Comparison of Five Color Models in Skin Pixel Classification”, In Into Virtual World Believers ICCV’99 Int’l Workshop on, pp 58•63. http://secondlifegrid.net/casestudies/IBM. [2] W. Piekarski, B. Gunther, B. Thomas (1999), “Integrating virtual and augmented realities in an outdoor application”, Proc. IWAR 1999, pp. 45•49.

182