COST-EFFICIENT VIDEO INTERACTIONS FOR VIRTUAL TRAINING ENVIRONMENT
Arsen Gasparyan
A Thesis
Submitted to the Graduate College of Bowling Green State University in partial fulfillment of the requirements for the degree of
MASTER OF SCIENCE
August 2007
Committee:
Hassan Rajaei, Advisor
Guy Zimmerman
Walter Maner
ii
ABSTRACT
Rajaei Hassan, Advisor
In this thesis we propose the paradigm for video-based interaction in the distributed virtual training environment, develop the proof of concept, examine the limitations and discover the opportunities it provides.
To make the interaction possible we explore and/or develop methods which allow to estimate the position of user’s head and hands via image recognition techniques and to map this data onto the virtual 3D avatar thus allowing the user to control the virtual objects the similar way as he/she does with real ones. Moreover, we target to develop a cost efficient system using only cross-platform and freely available software components, establishing the interaction via common hardware (computer, monocular web-cam), and an ordinary internet channel.
Consequently we aim increasing accessibility and cost efficiency of the system and avoidance of expensive instruments such as gloves and cave system for interaction in virtual space.
The results of this work are the following: the method for estimation the hand positions; the proposed design solutions for the system; the proof of concepts based on the two test cases
(“Ball game” and “Chemistry lab”) which show that the proposed ideas allow cost-efficient video-based interaction over the internet; and the discussion regarding the advantages, limitations and possible future research on the video-based interactions in the virtual environments.
iii
This work is dedicated to my ancestors, teachers, colleagues and other kind people. iv
ACKNOWLEDGMENTS
This study could not be completed without my scientific advisor Dr. Rajaei, who had enough patience and goodwill to guide me through the whole process of scientific research.
I would also like to thank Dr. Maner for his persistent help and care and Dr. Zimmerman for his support and precious comments on the committee meetings.
A special acknowledgement goes to ACM Digital Library, Wikipedia contributors and the Open Source community members who made all the necessary information and software components available for this research. v
TABLE OF CONTENTS
Page
CHAPTER 1. INTRODUCTION ...... 1
1.1 Hypothesis...... 1
1.2 Goals ...... 1
1.3 Research questions...... 3
1.4 Preliminary Research...... 3
CHAPTER 2. LITERATURE REVIEW ...... 5
2.1 Virtual Reality/Environment...... 5
2.2 Visual Aspects ...... 6
2.3 Computer Vision...... 10
2.4 Preliminary Conclusions...... 12
CHAPTER 3. METHODOLOGY ...... 14
3.1 The interaction paradigm ...... 14
3.2 Test cases ...... 14
3.2.1 Test Case 1: Simple Ball Game ...... 15
3.2.2 Test Case 2: Virtual Chemistry Lab...... 16
3.2.3 Observations ...... 17
3.3 Preliminary gesture tracking algorithm ...... 17
3.4 Proposed methods ...... 19
CHAPTER 4. IMPLEMENTATION...... 21
4.1 Problem Statement...... 22
4.1.1 Problems and solutions ...... 23 vi
4.1.2 Test cases ...... 24
4.2 Overview of available platforms...... 25
4.2.1 Graphics engines...... 26
4.2.2 Physics engines ...... 27
4.2.3 Sound engines ...... 28
4.2.4 Video acquisition ...... 29
4.2.5 Network engines ...... 29
4.3 Networking ...... 30
4.3.1 Preliminary estimation of channel usage ...... 30
4.3.2 Network architecture...... 31
4.4 Prototype 1...... 32
4.4.1 Prototype 1 Goals...... 32
4.4.2 Prototype 1 Design...... 32
4.4.3 Prototype 1 Implementation...... 34
4.5 Prototype 2...... 36
4.5.1 Prototype 2 Goals...... 36
4.5.2 Prototype 2 Design...... 36
4.5.3 Prototype 2 Implementation...... 37
4.6 Prototype 3...... 39
4.6.1 Prototype 3 Goals...... 39
4.6.2 Prototype 3 Design...... 40
4.6.3 Prototype 3 Implementation...... 42
4.7 Prototype 4...... 43 vii
4.7.1 Prototype 4 Goals...... 43
4.7.2 Prototype 4 Design...... 43
4.7.3 Prototype 4 Implementation...... 43
CHAPTER 5. RESULTS AND DISCUSSION...... 46
5.1 Proof of concept...... 46
5.2 Identified problems ...... 47
5.2.1 Physical avatar ...... 47
5.2.2 Mapping movements to virtual body ...... 47
5.2.3 Object representation ...... 48
5.3 Proposed solutions ...... 49
5.3.1 Interaction paradigm ...... 49
5.3.2 Human communication...... 50
5.3.3 Scalability and networking ...... 50
5.4 Developed algorithm...... 51
5.5 Estimated feasibility of created interface...... 51
5.6 Integration with web services ...... 52
5.7 Future research...... 53
REFERENCES ...... 55 1
CHAPTER 1. INTRODUCTION
Virtual environment allows spatially distributed people to meet and collaborate via
Internet however there is no ultimate solution yet which allows doing it effectively and in
essential way for ordinary users.
There are different accessible applications which solve small parts of this problem (as
VoipBuster[35] for voice, Skype[27] for video and Second Life[26] for virtual interaction).
However they do not allow people to see each other’s faces and gestures in the virtual space at
the same time interacting with virtual objects using their hands as they do in real life.
On the other hand there are many research projects which target to enhance the
interaction in the virtual environment or to increase the scalability of videoconferencing (as
CAVE[6] and AccessGrid[1]), but they require expensive equipment (sensors, stereo cameras,
special routers with wide channels, etc) which normally is not available for common users.
1.1 Hypothesis
The hypothesis of this study is that it is possible to interact with virtual objects with the
hand gestures acquired from a monocular web-cam, and that the computational power of modern
PCs (or laptops) is enough to perform all the required image recognition and simulation
procedures.
1.2 Goals
The goal of this thesis is to examine the potential and propose a solution for a system
which could provide video-based interaction in the distributed virtual environment using only
standard, available hardware and reducing the bandwidth to increase the scalability if possible. 2 The main goal is to enable the video-based hand tracking to interact in accessible and scalable virtual training environment.
The objectives of this research are to focus on the following aspects of video interaction:
1. Face detection and mapping the small face image onto the virtual body
2. Gesture recognition and conversion to the commands which control the virtual body
3. Do the ground work for hand tracking technique using web-cam
Face detection should allow video communication of people within a virtual world with relatively low bandwidth usage due to selection of the relevant area in the image acquired from the web-cam. Potentially this should allow more people seeing each other at the same time.
Gesture recognition should allow users to interact with virtual objects as proposed in two examples which should demonstrate the proof of concept. The first example is to enable a simple video-based interaction: kicking the ball with a hand gesture. The second example is to allow more complicated interaction: putting the substance to the chemical pot hold by another person.
More specifically, hand tracking should allow making the basic 3-dimensional movements with a virtual hand thus kicking, grabbing or performing other actions on the virtual objects while user does it with his real hands and imaginary objects in front of the webcam getting visual feedback from the computer display.
The proposal is to use the detected face of the user and his/her motion to track the position of the hands in the plane and then to use the knowledge about perspective projection with some assumptions regarding the probable movements to enhance this estimation to 3D and to assign the corresponding behavior for the virtual model of physical world. 3 1.3 Research questions
- Is it possible to interact with virtual objects with the gestures acquired from a
monocular web-cam?
- Can hand tracking be used to transform the hand gestures into virtual environment in
such a way that it allows grabbing and holding the virtual objects without gloves or
other sensors?
- Do modern PCs (or laptops) have enough computational power to perform image
recognition, 3D rendering, networking and other necessary routines required for the
system?
- What software architecture could fit the purpose and which software components
could be used to implement the system?
- How is it possible to integrate the proposed solution with the web interface so that it
could be easily used for other online projects, including Virtual Training
Environment?
- What approaches could be used to increase the scalability of the system?
- What are the potential strengths and limitations of the proposed techniques?
1.4 Preliminary Research
The prior investigation (in summer 2006) targeted experiments on image acquisition and recognition for control of the virtual flight over a landscape with user hands’ gestures. It inspired the current research and suggested that organizing human-computer interaction via web-cam was doable, even though the interaction was not very stable and precise.
The obvious application of such kind of interface lies in the field of virtual reality since it makes it possible to control the virtual body in a very essential way without expensive hardware. 4 Since the virtual body in this case repeats the natural movements of the user, it makes it possible to:
1. Enhance existing voice over IP communication with gestures
2. Decrease the bandwidth required for teleconferencing sending only crucial
video data (mainly, the face)
3. Interact with virtual objects the same way as with real ones
We believe this functionality can become a part of ongoing Web-Based Simulation
(WBS) project [24, 23, 25] which is supervised by Dr. Rajaei since the target of WBS project is
the Virtual Training Environment (VTE) which needs to be equipped with the mentioned interface. 5
CHAPTER 2. LITERATURE REVIEW
Inspired by the first practical results we started a literature review to get a bigger picture
of the field and to find out what has been done so far by other researches and what could be done by us in this project.
The ACM Digital Library contains thousands of articles related to virtual environments,
hundreds of them were considered but only about 50 were selected for literature review. The following aspects of virtual environments were of our interest during this research.
• What is a virtual reality and what is a virtual environment
• Visual representation
• Image recognition or other methods of gesture input
• Physics simulation
• Simulated sound distribution
• Scalability & Networking
• Usability issues
• Related projects
2.1 Virtual Reality/Environment
There is a lot of misunderstanding about what Virtual Reality (VR) is and what Virtual
Environment (VE) is. Wikipedia (the free encyclopedia) shows the same article for terms
claiming that VR is a technology which allows a user to interact with computer-simulated
environment, but not clarifying what is VE [33]. Columbia Electronic Encyclopedia states that
VE and VR mean the same thing: computer-generated environment with and within which
people can interact [34]. According to Burdea and Coiffet in [4], Virtual Reality is high-end 6 user-computer interface that involves real-time simulation and multimodal interactions, but at the same time they state that Telepresense and Augmented reality are not Virtual Reality in its strictest sense since they incorporate real images. With this specification it looks like the system we aim to build will not be “strict” Virtual Reality. However we will use VR and VE terms as equivalents in less strict manner so that they could be applied to our project, keeping in mind, that in fact it can be classified as an augmented reality.
Many research and commercial companies have being working on the field of virtual reality for years. The purpose of this section is to get acquainted with the valuable results of their work and to see what is done and what could be done.
2.2 Visual Aspects
Graphical representation is crucial for virtual environments since vision is agreed to be
the most important sense [2] which is confirmed by the fact that visual cortex is the largest
sensory area in the human brain [18]. Many techniques and algorithms have been developed in
the field to make VR look more realistic. We conducted a literature research to obtain an
overview of the existing projects to get a better picture of what is already done.
VR2S – a generic VR software system which is an extension of the high-level rendering
system VRS, focused real-time human-computer interactions and multimedia responsiveness of the system, for example provides virtual shadows for the real image of the head as shown on the
Figure 2-1[29].
Online Model Reconstruction – a system for generating real-time 3D reconstructions of
the user and other real objects in an immersive virtual environment for visualization and
interaction, which bypasses an explicit 3D modeling stage, does not use additional sensors or
prior object knowledge, but uses a set of outside-looking-in cameras instead. Visual hull 7 technique (proposed by Benjamin Lok) was used to rapidly determine which volume samples lie within the visual hull, and then to build object reconstruction from any given viewpoint at interactive rates [17]. The example of rendered interaction is represented on the Figure 2-2.
Unlike our project, Online Model Reconstruction utilizes specialized hardware such as five wall- mounted NTSC cameras (720x486) and one additional camera (640x480) to extract the visual hull of human body. Obviously, this kind of system is not portable and accessible for common users.
Figure 2-1: Rendered shadows (top-left image) of the user (from [29])
Figure 2-2: Reconstructed hands interact with environment (from [17])
CU-SeeMe – a distributed video-conferencing system which embeds live video and audio
in a virtual space. The system uses its own protocol on top of UDP and utilizes threshold-
difference approach to transfer only valuable video-data (compressed via lossless compression)
as shown on the Figure 2-3 [10]. Sound attenuation is used to improve the perception of virtual
space. 8 An immersive 3D video-conferencing system – this project is based on the principle of shared virtual table environment, allowing tele-presence and human-centered communication as shown on the Figure 2-4. Images from 2 cameras are used to perform the depth analysis and to create a virtual view of the remote conferee. Extracted data has a compact form and can efficiently be encoded via Mpeg-4 codec [13].
Figure 2-3: Segmented video in virtual 3D space – only face is being transferred (from [10])
Figure 2-4: Real images are merged with virtual objects (from [13])
Figure 2-5: Side-by-side comparison of photographs (left) and heads (right) for OpenGL rendering [30]
Face modeling 17and mimicking. It is possible to amazing things with the digital images of faces. Models of human faces based on photos are shown on Figure 2-5, and on Figure 2-6 you can see face manipulation based on a single image, processed in several stages: 3D- 9 reconstruction; rotation; modification; rotation back; rendering [3]. Most likely, those techniques are not real-time, however real-time techniques also exist and produce fairly good picture (Figure
2-7) [11].
Figure 2-6: Face manipulations based on a single image [3]
Figure 2-7: Real-time video mimicking [11]
There is variety of other projects which do not concentrate on graphics but are somewhat
close to the virtual environment implementing some of its important features.
Second Life [26], There [31], and massively-multiplayer games provide scalable 3D
worlds with interactive virtual objects and means of interaction. However they do not allow
audio/video conferencing, nor do they provide means for realistic and essential interaction: users 10 have to use keyboard and mouse to control their avatars and deal with simplistic icons and menu items to perform interaction.
Voice-over-IP applications allow real-time audio communication, some of them (such as
Skype [27], Access Grid [1]) allow video conferencing. These projects are specialized in teleconferencing and do not provide 3D-space and virtual objects for interaction.
CAVE [6] and similar projects create illusion of immersion into the virtual world; some
of them even use physics for interaction [15]. However unlike our project, they require special
hardware and can not become globally accessible.
Potentially the synthesis and further development of existing ideas and technologies
could bring us to scalable and universal immersive virtual environment with essential interface,
which could integrate audio/video communication with interaction in 3D world. Our contribution to essential interface lies in the field of video-based interaction via common hardware. We aim to provide the essential way of interaction in the virtual environment without utilizing uncommon or expensive hardware. We plan to pay computing power (which seems to be the cheapest and fastest-growing resource) to reduce the necessity in special hardware, replacing it with combination of usual web-camera and computer vision techniques.
2.3 Computer Vision
Most of the existing projects which aim using cameras for control require professional
hardware such as stereo cameras, however there are projects which use monocular camera for
user input [7, 38, 14] as well as necessary computer vision techniques which make it possible
[37, 32].
For example, one of the projects [7] uses Intel’s OpenCV [21] library to detect user’s face
thus locating his/her position in order to allow pointing via user's fingertip. The library is Open 11 Source and free for non-commercial use. It utilizes cascade of boosted classifiers working with haar-like features originally proposed by Paul Viola [9].
Viola and Jones used four types of features (shown on Figure 2-8) and introduced the
term of Integral Image each pixel (X, Y) of which is the sum of brightness of all pixels with x≤X
and y≤Y. After the Integral Image is built, it takes only a constant time to extract image features
[32]. The cascade of (preliminary selected via special learning algorithm) features is being used
to detect faces or other objects (Figure 2-10). Later Wilson and Fernandez [37] enhanced the
algorithm to diagonal features.
Figure 2-8. Rectangle features relative to the detection window. The sum of pixel brightness within white rectangles is subtracted from the sum of pixels brightness within gray rectangles. [32]
Figure 2-9. Integral Image allows to calculate the sum of pixels for D for a constant time using only 4 array references: 4 + 1 – (2 + 3) [32] 12
Figure 2-10. Two most important features used for face detection. [32]
2.4 Preliminary Conclusions
To get a better understanding of the existing systems, it is necessary to try them on practice (download projects if possible, get registered in the online systems, games, etc).
However it is not always possible to do, especially in a limited timeframe.
According to covered literature, the following has been done:
• VR with of gloves, glasses, “caves” and of other special equipment
• Advanced processing of video, including image segmentation (based on a single
camera), face/hands tracking, 3D reconstruction and video-based avatars (based on
multiple cameras).
• Real-time mimicking; Mimicking based on a single photograph
• Gesture recognition (to some degree)
• EEG as an input device for primitive games (to some degree)
• Decentralized networking, including multi-server and P2P technologies, dynamic load
distribution, zoning and message distribution algorithms.
• Collision detection and modeled physics
• Acoustic modeling: sound distribution and attenuation
• Scalable video-conferencing, IP-telephony 13 What could be done is to integrate existing achievements from different fields in a single project, possibly to modify them and to see if it works:
• Use image tracking and recognition techniques to get a compact representation of the
user thus modeling the customizable virtual human based on the video from a single
camera.
• Place the virtual human into the modeled physical world.
• Provide an interface for attaching externally simulated models into the virtual
environment so that virtual humans could interact with them.
• Use decentralized architecture to simulate this world in a scalable manner.
• Extend the world with acoustic interaction.
• Do all this in such a way, that ordinary desktop owners could use the system.
The long-term goal of the project is to create generally accessible for common users, scalable virtual environment with essential interface. This system could be used for education, training, collaboration and entertainment by everyone who has more or less modern computer with 3D-accelerator and a reasonable internet channel.
The short-term goal of the project is to implement the key parts of the system as a basic prototype in order to provide a proof of concept to answer the question to what degree it is doable or what are the constrains that do not allow to fully implement it. This thesis chose the later approach in providing evidence that it can be done. 14
CHAPTER 3. METHODOLOGY
3.1 The interaction paradigm
Before we start discussing implementation, we should present the methodology we used
to make the fulfillment of our research. The proposed methodology has to answer the question:
how is it possible to make (at least theoretically) the interaction with virtual objects closer to the
way we do with real ones, i. e. how can we kick a virtual ball, hold it, transfer it or place
arbitrary virtual objects to the desired positions with real hand gestures?
We focus to solve this problem with a common hardware. Literature review showed that
web camera, being a wide spread and inexpensive device, has a great potential for being used as
essential interface for virtual environment. The proposed solution lies in the intersection of the
fields of computer vision and virtual environment.
Even though the general scope of the problem is broad and could be applied for variety of
actions and objects with different physical properties, we narrow it down to the set of specific
functions required for the test cases explained further.
3.2 Test cases
In order to get some evidence that our ideas could work we developed two test cases which demonstrate the idea, identify the basic problems and indicate whether we solved them or not.
The first test case is designed to test the very basic and simple functionality as interaction
using physics with rough hand tracking. The results of the first test case will show whether hand-
tracking interface is doable or not. 15 The second test case involves more accurate and complicated interaction, simulating the possible real application of the proposed solutions. The results of the second test case will show whether hand-tracking interface is feasible for virtual training environment or not.
3.2.1 Test Case 1: Simple Ball Game
Real world gaming uses physical laws as an interface to the game objects, i. e. in football and many other games, the game object is an elastic ball. Player interacts with it kicking it, i. e. moving his/her extremity towards it, experiencing collision forces which try to move the extremity back, and the ball – forward. The ball receives the impulse, accelerates and moves respectively. Friction with air, collisions with ground and other objects affect its speed and rotation.
Sender Recepient
Ball (mass) Force
Gravity
Figure 3-1. Simplistic representation of the environment-mediated interaction with a ball
The interaction between players in real life happens via passing the ball, which is in fact
the same as kicking the ball in the direction of the recipient. This observation demonstrates that
the real world with its physical laws and properties is a mediator for the interaction between the
players as shown on the Figure 3-1.
In our work we target to implement the first test case allowing user to control the
extremities of the virtual body with his hands and to pass the virtual ball to another user using 16 them. As mentioned above, the images from the monocular web-cam are going to be acquired and processed to allow the interaction.
3.2.2 Test Case 2: Virtual Chemistry Lab
In our test case for chemistry lab we want a chemistry student to pour the chemicals from his pot to the pot hold by another student in the cyber lab.
Figure 3-2. Chemicals pouring from one pot to another following the gravity law
Unlike a previous case with passing the ball, more actions are required here.
1) Student A grabs the empty pot with the help of his fingers (causing small collisions
and friction, which do not allow the pot to fall).
2) Student B grabs the pot with chemical a similar way.
3) Student B moves his pot towards the pot of Student A and flips it (the pot just follows
the trajectory of the hand since pot’s movement is constrained by hand’s fingers)
4) The contents of the pot falls to the empty pot due to the forces of gravity (Figure 3-2) 17 3.2.3 Observations
Based on those examples we made an observation that a small amount of fundamental physical laws allows to establish variety of different interactions, potentially all the interactions we can observe in real life. Having understood how they are actually performed in real world, we can model them in the virtual world as well. Potentially it allows us to find a general solution for realistic and natural interaction in virtual environment instead of developing specialized interfaces for each possible type of user action and object (as is commonly done in most of the virtual reality and gaming projects).
3.3 Preliminary gesture tracking algorithm
During the preliminary research the algorithm has been developed which allows
estimating the angle of hands when they move in a vertical plane (parallel to the camera’s lens).
This kind of movement was used to control the wings of the virtual bird which was flying over
the landscape.
The algorithm uses some a-priory knowledge and assumptions about the image and the
human body. The basic idea of the algorithm can be described as following:
1) Generate the motion map as a difference between the consequent frames filtering out
everything beneath the threshold and adjusting the threshold if that difference is too
big or too small.
2) Remove noise (isolated pixels in the motion map; pixels which mold small segments)
3) Assume that in long term run the average position of the body center should be in the
middle of the detected motion, and it does not change as frequently as the position of
the hands (which size is about the half of the average motion width). According to
this knowledge, adjust the estimation of the body position and hand size. 18 4) Having the estimated body position, calculate the average angle for all the pixels on
the left and on the right of the body and visualize it as estimation of the human hands
positions (as shown on Figure 3-3).
Figure 3-3. On the left side: 2 consequent images of the movement one on another; On the right side: the difference between 2 consequent images (red) and the angles which can be estimated using the assumptions about the body position.
This algorithm is quite simple. Nevertheless it is a good starting point for our experiments
which serves our purpose well in order to move forward to the next step. Even though the result
of this algorithm is not quite accurate, it allows controlling the direction of the flight with hand
movements, giving somewhat satisfactory accuracy for this purpose as shown on the Figure 3-4.
This algorithm however has several limitations:
1. It estimates only 2 angles in a single plane.
2. The hands should be more or less straightened, not overlapping and unlike the
body they should be in motion most of the time.
3. Both hands should be used with the similar frequency, otherwise the estimated
position of the body center moves towards the most active hand thus
drastically decreasing the accuracy of the method. 19 Furthermore it is supposed that there is no any other movement in the scope of the camera’s view, and the lightning conditions do not change.
Figure 3-4. Flying over a 3D-landscape using hand gestures acquired from the web camera. Blue lines are the estimated positions of hands (wings) whereas green and red rectangles represent the estimated long-term and short-term body positions respectively.
3.4 Proposed methods
Based on the preliminary research and observations for test cases the decision has been made to use the combination of the hand tracking and a physical model as a ground for interaction paradigm in this research. With the correspondent experiments, proof of concept and the following proposals it molds the essence of this work.
Thus we suggest:
1. To enhance the developed hand tracking algorithm to 3 dimensions
2. To use it to control the extremities of the virtual avatar with real hand movements in
front of the web-cam
3. To use the extremities of the virtual avatar for interaction with virtual objects
4. To establish the interaction between users and objects in terms of interaction of
physical bodies 20 5. To enhance communication of users in virtual space via detecting and mapping their
faces on the virtual avatars
6. To use client side simulation for decreasing the bandwidth requirements and server
load thus increasing the scalability of the system
We also propose to use only freely available components to implement the system which would allow the mentioned functionality.
In the following sections we present the details about our implementation of this approach including criteria for selection of the components, architecture, with the justification for each of the proposed solutions. 21
CHAPTER 4. IMPLEMENTATION
Since the subject of this study is video input in the virtual environment, we had to find
either the existing virtual environment, or software modules which could provide the
functionality and flexibility necessary for the project considering the learning curve and the
limited timeframe.
One of the first issues we had is to decide whether it is better to use the existing system, or to develop the prototype of new one using freely available engines and libraries.
Before we made a selection, we identified our basic requirements for the system:
1. The system should run on common hardware.
2. The system should be capable of 3D-rendering of dynamic scenes on the fly.
3. Unlike other systems, the interaction in the system should be implemented similar to
the real world - via physics.
4. The system should use the network efficiently considering the real-time nature of the
application
It is easy to notice, that most of those requirements are very common for the computer
games. Usage of the gaming techniques could be considered to be “non-scientific”, however
gaming is in fact a very developed sector of economy which moves forward the whole
computing industry and gets as much as it can get from the existing computer hardware and user
interfaces. Virtual reality field is in fact very close to gaming, which is confirmed by the usage of
VR-equipment in computer games and the growing popularity of WII remote (a relatively cheap
3D manipulator) which appeared on the market recently. 22 4.1 Problem Statement
We condensed our view of the long-term project’s future into a hierarchy of implementation goals having two top-level requirements which come from one of the expected uses of the system (Virtual Training Environment) to determine the software modules we might require to implement them:
• allow ordinary spatially distributed people to communicate online, seeing and hearing
each other
o make the system accessible
decreasing bandwidth
decreasing hardware requirements
making it to run on the different platforms
making it free or cheap enough
o enhance the user experience
allow users to interact using natural body movements
allow them to see each other in 3D space
enhance their communication with realistic 3D sound effects
o implement the system in a scalable way so that it could serve as many people
as required
• allow this system to be used for training and education purposes
o embed virtual objects
o allow users to interact with them as well
o integrate it somehow with the web 23 Ideally, eventually it should become a scalable system, which is able to serve millions of people at the same time, allowing them to communicate with each other and to interact with virtual objects the similar way they interact with real ones.
4.1.1 Problems and solutions
Keeping in mind project’s goals, we can identify the basic problems and propose solutions for them as shown on the Figure 4-1.
Problem Proposed solution
Make it cross-platform, free or cheap enough Use free components when possible, avoid to
require costly multi-server infrastructure for
system maintenance
Embed voice over IP and 3D sound Find and adapt cross-platform sound engine
Allow users to see each other in 3D space Find and adapt cross-platform 3D engine,
embed optional video acquisition and mapping
on the virtual body
Allow users to interact with virtual objects Find and adapt cross-platform physics engine
Decrease bandwidth required • track the body & head position and
send only necessary data
• use modern audio/video codecs and
data compression
• if bandwidth is not sufficient, video
quality could be sacrificed, but other
functions should work properly
• use existing cross-platform game 24 networking library to utilize efficient
UDP-based data transmission.
Decrease hardware requirements Unlike many VR applications, use a simple
monocular webcam for hand/head/body
tracking
Implement a system in a scalable way so that it Utilize distributed architecture, use approach could serve huge amounts of users similar to P2P or GRID
Allow users to interact with natural body Implement head, hand and body tracking movements
Figure 4-1. Identified problems and solutions
Obviously, all of the above purposes could not be solved in a very limited timeframe and beyond the scope of this study, but the proof of concept and the working (though may be not fully functional) prototype can be completed.
4.1.2 Test cases
As mentioned earlier, the complete system should allow implementing following test cases:
• Ball game: users should be able to be in the virtual room and play with a single ball
• Chemistry lesson: users should be able to mix the chemicals in collaboration, pouring out
the chemicals from one pot (being hold by one user) to another (being hold by another
user).
The final version of the project should allow users to communicate orally and to see each other’s facial expressions as well as hands’ movements in both of those cases as shown on the
Figure 4-2. 25
Figure 4-2. Project preview: ball game1
4.2 Overview of available platforms
Computer gaming industry made significant progress in the field of interactive virtual
worlds. It incorporates most recent advances of 3D graphics, 3D sound and real-time networking
to build speedy and accessible entertainment products. The most sophisticated commercial game
engines cost many thousands of dollars, but others, becoming outdated, often move to the public
domain, though with restrictions for commercial use. For example, Doom and Quake I-III
engines’ sources were published and forked into many projects developed by enthusiasts [16]).
Usually game engines incorporate many different features (for example: graphics, sound,
networking, collision detection, scripting language) and allow to develop variety of different games on the same engine using its scripting language. But at the same time, engines are usually
designed and optimized for a specific application to meet specific requirements. For example,
engine for first-person shooters should be very speedy, responsive, but might be not so scalable,
1 Photomontage based on the images downloaded on October 30, 2006.from: http://www.animationindia.biz/images/gallery/man.jpg - image of 3d man http://www.threedy.com/site/forum/attachment.php?attachmentid=85627&stc=1 – image of user hands http://www.nv2.cc.va.us/home/hzell/art279/level/level_room2.jpg - image of the room http://tcos.com/sbforum/images/downloaded/arm1.jpg - image of the hand for a wheel creature 26 as engine for massively multiplayer role play game, responsiveness of which is not so crucial as networking and scalability.
Our engine should be both responsive (though may be not as responsive as first-person shooters, where milliseconds are of great importance sometimes) and scalable. It should incorporate some primitive general-purpose physics and should be able to work with the web- cam. In order to avoid limitations of complete game engines optimized for a specific purpose and to get fully customizable and free cross-platform architecture, a decision to use several freely available components instead of a solid game engine has been made.
The desired components should take care of:
- 3D graphics
- Sound
- Collision detection and physics
- Networking
4.2.1 Graphics engines
There are plenty of free graphics engines, and many of them were considered, tested and filtered out during the research. Figure 4-3 contains my comparison of 3 engines which are most popular according to the Top 10 Open Source Engines list provided by DevMaster.Net [8]. Since
Crystal Space appeared to be a game engine (which is not what we need), the most difficult thing was to choose between Irrlicht [12] and Open Source Graphics Engine (OGRE) [19].
Both of them are free and cross-platform with impressive demos. Irrlicht seems to be very speedy and relatively simple, but it is being developed and maintained by a single person. OGRE has more advanced graphics and has much bigger community, but it is a bit more complicated and possibly slower on the older PCs.
27 Irrlicht OGRE Crystal Space
Graphics Good Very Good Good
Performance Very Good Good No info
Ease of Use Easy Average Hard?
Status Alpha Stable Stable
License Zlib/libpng LGPL LGPL
Google “3d engine” 33800 50100 29300
Google “game engine” 52600 88300 64400
Figure 4-3. Comparison of 3D engines
It was not easy to make the decision which one to choose based on the available information. OGRE seemed to have better perspectives due to a larger community and high-end graphical features; whereas Irrlicht had everything we needed at that moment and could be the easiest thing to start with.
Demo examples for both of them were downloaded, compiled and tested successfully before the decision to use Irrlicht as the simplest solution has been made.
4.2.2 Physics engines
Physics engines allow collision detection and simulation of quite realistic behavior of virtual 3D objects. On the Table Figure 4-4 there is a list of top 3 engines (according to my own online research). The main dilemma was to choose between Open Dynamics Engine which has a bigger community, and Bullet Physics, which has more sophisticated collision detection. Demo examples for both of them were downloaded and tested successfully, principal difference between them was not found. 28 Open Newton Game Bullet Physics
Dynamics Engine
Engine
Trimesh collision detection No Limited to static Yes
objects
Performance High Average No info
License LGPL Proprietary, but free LGPL
Google 152000 12100 27800
Figure 4-4. Comparison of physics engines
ODE’s demos look a bit more impressive and were successfully compiled, though lack of triangular mesh collision detection could be a serious disadvantage. However, it appeared that those engines do not contradict, but supplement each other, as shown in the Figure 4-5.
Q: Does it inter-operate or compete with ODE?
A: It inter-operates. ODE can benefit from the collision detection features like
GJK convex primitives and persistent manifold. Bullet can benefit from ODE, it can use the lcp solver, and from the ODE user community for its feedback.
Figure 4-5. Fragment of Bullet FAQ
Further research showed that Newton Game Dynamics is the only available deterministic engine, which is a significant advantage in case if we decide to use this feature to minimize the network traffic via local simulation of the world. According to our views this advantage overweighed its major disadvantage: it is proprietary and closed source, so the decision to use
Newton Game Dynamics engine has been made.
4.2.3 Sound engines
The following software could be used for audio subsystem: 29 - FMOD (Cross-platform library for playing music of different formats, free for
noncommercial use)
- SDL (cross-platform for abstraction of sound routines, LGPL)
- OpenAL (Closs-platform API for 3D-sound, under LGPL)
OpenAL seems to be the most appropriate for our purpose. It provides illusion of 3D sound via stereo sound, sound attenuation, Doppler effects etc. It has been used in development of many games, including Unreal Tournament 2004, Doom 3 and Quake 4 [20]. However our first prototype will be without sound due to the limited timeframe and the focus of this research on the visual aspects of virtual environment.
4.2.4 Video acquisition
The only cross-platform tool which was found was Open Source Computer Vision
(OpenCV) library [21], and we were very satisfied with it since it provides not only video acquisition interface, but also variety of image recognition algorithms including face detection
which can be used in the project.
4.2.5 Network engines
Several network engines were considered to enable low-overhead and low-latency multi-
channel communication between the network nodes.
- RakNet (cross-platform, Creative Commons Attribution License or you choose a
price from 100$ for a single application, Google: 114,000) with speex can transmit
voice (~0.6-2kB/s)
- OpenTNL (GPL or $995 per developer, Google: 10,800)
- HawkNL (LGPL, Google: 18,000) with HawkVoice (~0.2-2 kB/s) 30 RakNet engine has been selected because it has clear tutorials online, it is available for free, its commercial license is cheap and it allows sound transfer via low bandwidth channels.
Moreover it takes care of network time synchronization, network ID generation and other useful functionality.
4.3 Networking
By low-bandwidth channel we mean just an ordinary internet channel, which is usually
not more than 1 Mb/s (as opposed to high-bandwidth local networks, which are now from 100
Mb/s to 1Gb/s). We do not target the narrow modem connections as they are outdated and
assume that our channel is least 64 kb/s, which theoretically allows us to send up to 8 kilobytes
of data per second. One of our goals was to enable communication via low-bandwidth channel,
which makes it possible to use the system over the internet for ordinary users.
4.3.1 Preliminary estimation of channel usage
According to the preliminary estimation 64Kb/s should be enough for basic interaction:
1. 0.6-2 kB/s – audio (depending on quality)
2. 2-4 kB/s – video, which is enough to send 5-10 uncompressed 256-color images of 20*20
pixels per second (see Figure 4-6).
3. 1-4 kB/s – world state and overhead
We can’t guarantee any minimal FPS (frames-per-second) value since it depends on
amount of users in range as well as their bandwidths, but those who have 64Kb/s should get about 10 fps communicating with a single person, or 1 fps, communicating with 10 persons.
Video compressions techniques could improve those values. 31
Figure 4-6. Low-resolution face image (8-bit 20*17 pixels, on the left) could be mapped to the 3D bodies
(photo-montage in the middle and on the right).
4.3.2 Network architecture
There are several most commonly used network architectures which could be applied for
the problem:
- Single server
- Multiserver
- Peer-to-peer
The less preferable architecture is the single server. The disadvantages of it are obvious:
- Limited computing resources which do not scale with growth of world or amount of
connected clients.
- Limited bandwidth which does not scale with number of connected clients.
- Multiplying the traffic: instead of direct data streams between the peers there will be a
downstream and upstream for every client/server connection.
- Higher latency since before event information from one client reaches another client,
it has to go to the server.
Multiserver architecture could be a better solution since it is a compromise between mentioned single server and peer-to-peer. This architecture is used, for example, in the Second 32 Life of the Linden Lab [26]. There are millions of registered users there, however when more than 30000 are online simultaneously, they still get scalability and performance problems.
Since the scalability of the system is important for our ambitious long-term goals, the
most promising for us is peer-to-peer architecture because it is supposed to be scalable and it
does not require purchasing and maintenance of thousands of expensive servers (as Linden Lab
does).
Not requiring much expenses it makes it possible to implement the virtual reality which
could be free and accessible for everyone. Ideally that would stimulate volunteers from the
OpenSource community to continue development of the project.
4.4 Prototype 1
4.4.1 Prototype 1 Goals
The purpose of this prototype was to integrate RakNet library with the Irrlicht engine and
Newton Game Dynamics (though not using physics simulation yet), to test our architecture and
to see if it works satisfactory with primitive virtual bodies equipped with a virtual hand and
controlled via keyboard and mouse.
4.4.2 Prototype 1 Design
As a basis for our first prototype we decided to use client/server architecture as it is the
simplest case to start with. Since single server-based solution may have scalability issues in
future, we implement it in such a way that it could be distributed into a multi-server or peer-to-
peer system later. 33 Initially it will be a centralized system running on Windows XP. The server will allow multiple clients to connect. Most of the code will be cross-platform, and as mentioned before, developing it we keep in mind, that this system is likely to be distributed in future.
First prototype of the system will consist of client and server with the following functionality:
1. Server
a. Accepts connections
b. Sends coordinates of objects to clients
c. Distributes changes in coordinates requested by clients
2. Client
a. Connects to server
b. Gets coordinates of objects to display
c. Renders 3D world from the virtual body’s point of view
d. Gets user input and changes the coordinates of corresponding bodies
Server Client Client Graphics just relays Graphics
Local Internet messages Internet Local Physics Physics
Peer Peer
Graphics Graphics
Local Internet Local Physics Physics
Figure 4-7. Idea of the centralized system (on the top) being converted to the decentralized system (on the
bottom) 34 As mentioned earlier, centralized architecture limits the scalability of the system and is just an intermediate step. Since the server does not simulate physics and is responsible for message distribution only, its role can be distributed between several servers or clients thus converting the architecture into the multi-server or peer-to-peer network if needed (as shown on the Figure 4-7).
4.4.3 Prototype 1 Implementation
During this stage the working name VEVI (which stands for Virtual Environment with
Video Interface) was given to the project.
The bodies were implemented as cubes, and the hand – as parallelepiped connected to the
body with a joint. The graphics and an idea of how to integrate Irrlicht engine with Newton
Game Dynamics were borrowed from the Mercior’s example
(http://www.mercior.com/files/newton-irrlicht.zip).
Figure 4-8. Prototype 1 – the green parallelepiped on the right is a user-controlled “hand”.
Having implemented the first prototype, we came to the following conclusions:
- The Irrlicht engine can be (and has been) successfully integrated with RakNet and
Newton Game Dynamics. 35 - The server application (and the RakNet library in particular) is really cross-platform,
it successfully compiles and runs under both Windows XP and Linux operational
systems.
- The proposed client/server architecture allows different users to connect and to see
each other in the virtual space. However despite the fact that we use a specialized
library for fast UDP-based networking, sometimes a significant latency happens
during the movement of the bodies.
The suggested explanation of that is that while body moves, all the changes of coordinates are being transferred too frequently. Two possible solutions of the problem were considered:
1) To limit the frequency of outgoing data (i. e. avoid sending the changed
coordinates more than 30 times a second)
2) Instead of sending the coordinates of the moving bodies, send the messages which
correspond to user’s input instead. The second way is more preferable since users
usually do not generate too much input at a time and it allows smooth motion
simulated locally based on the received messages. In case if users generate more
than 30 messages a second, we still my introduce a 0.03 second of unnoticeable
delay which should not affect the smoothness of the movements.
Since user input can be represented in a very compact form in (comparison with the array of objects’ coordinates) and it is not likely to be generated too often, we decided that the second solution would reduce the utilized bandwidth and is more preferable for the second prototype. 36 4.5 Prototype 2
4.5.1 Prototype 2 Goals
The second prototype’s goals were:
- To implement the changes on the networking thus addressing the high latency issue.
- To implement the physics simulation on the client side.
- To adjust the physical parameters and physical bodies so that they could be used for
further experiments.
- To check if Newton Game Dynamics really behaves deterministically (otherwise the
chosen networking solution will not work due to growing difference between locally
simulated worlds)
And, of course, to test if all this works in a satisfactory fashion and we can move on to
the video-related part.
4.5.2 Prototype 2 Design
Second prototype of the system has a little bit more complicated server. Instead of just dispatching all the coordinates to all the clients, it has to store the incoming messages in a historical queue and to send the missing parts of the queue to the clients who do not have it yet.
This is required to allow clients to synchronize the world state without sending the coordinates of every object. Only the user’s actions need to be sent since they are the only unpredictable factors which affect the deterministic behavior of the simulated physical world.
The client also became more complicated not only because it needs to simulate the world,
but also because it needs to synchronize it with other clients. In order to do so, it stores 2 states
of the world: the last simulated state, and the guaranteed state. The world state is considered to
be guaranteed (unchangeable) after a specific timeout (for example, 3 seconds). It is not allowed 37 to change the past which was timeout ticks ago, but it is allowed to change the newer past since messages from other clients will come with latency in comparison with locally generated and simulated ones, and in that case the client is required to make re-simulation (in general case – starting from the state which was before that message arrival, but for the second prototype – starting from the guaranteed state).
The simulation algorithm for the client could be briefly represented as the following loop:
1. Increment simulation time.
2. Receive network and local messages and place them to the queue.
3. If any of the received messages have been generated before the current simulation time
then re-simulate the world from the guaranteed state (and propagate the guaranteed state
to the simulation time - timeout) otherwise simulate the next state of the world.
4.5.3 Prototype 2 Implementation
Since second prototype involved significant changes in networking and introduced physics, it took much experimenting and required several intermediate prototypes to make it work.
The first major challenge was the physics and the physical control over the body. In order to obey physical laws we had to make it horizontal so that it looked like a bug thus preserving relatively simple body structure and avoiding falls after each step or kicking the ball. The interesting side effect of this constructive solution is that if the virtual bug falls on its back, it experiences the same difficulties as the real one.
As shown on Figure 4-9, body model was implemented via joined parallelepipeds, which were used not only for hands and corpse, but also for “wheels-legs” which produced the illusion of stepping. The second problem was to address constrain issues for the joints allowing to 38 control “hands” and “legs” without breaking the stability of the model via excessive forces and accelerations.
Figure 4-9. Virtual bodies are physically stable and simple when they have similar shape to the bugs.
Fortunately it appeared that Newton Game Dynamics has a special API for calculating required accelerations to set the joint to the desired angle. Eventually the required masses, sizes and friction coefficients were adjusted so that those robots were controllable enough to play with
a ball, and that concluded the first big change to the project.
However, the body structure looks too different from a human being and the human body is too complicated to physically model it, so the better solution for the body was discovered and used – the up-vector constrain provided by Newton Game Dynamics engine API which allowed constructing stable vertical body which is a bit more human-like as shown on the Figure 4-10.
Having implemented the client and the server according to the algorithms explained in the previous section, the tests showed that the algorithm is feasible for real time networking and client-side re-simulation: the latency is almost unnoticeable. However due to unknown bugs in code and the difference in the performance of computers used for tests, de-synchronization of the 39 world still happens time to time. If to disconnect the de-synchronized client and to connect it again, it retrieves the whole event list from the server and reaches the correct state of the world which proves that Newton Game Dynamics behaves deterministically on the tested machines and the general idea of sending only the information about user actions can work.
Figure 4-10. Two avatars play with a ball (on the left – camera is outside the body, on the right – camera is
inside one of the avatars)
That means that ignoring the minor defects due to the limited timeframe we are ready to
go to the video part of the project at last.
4.6 Prototype 3
4.6.1 Prototype 3 Goals
The general goal for this prototype is to implement the first test case (“Ball game”). The
objectives are the following:
- To integrate the project with OpenCV library and to see if it is able to get the image
from the web-cam.
- To integrate the face detection algorithm provided from OpenCV to the project, and
to map the detected face to the arbitrary 3D objects (later should be mapped to the
head of the virtual body only). 40 - To adapt and enhance the algorithm developed in the preliminary research for hand
tracking in 3 dimensions and to kick the ball with hand gesture.
4.6.2 Prototype 3 Design
The main design innovation except adding OpenCV calls was about embedding and improvement of the hand tracking algorithm.
As explained in the preliminary research, the hand angles were estimated based on the average angles of pixels on the filtered difference maps relative to the estimated (almost guessed) position of the body. That algorithm worked well only during intense and frequent motions with both hands, and it allowed estimation of only two angles in one plane.
However having integrated Haar Classifiers Cascade into the project we got an opportunity to improve the estimation of the human pose using the knowledge acquired by the face detection (as shown on the Figure 4-11). It increases the accuracy of the method and makes it more stable reducing the role of the permanent evenly distributed between hands motion
(Figure 4-12).
X,Y
Figure 4-11. Detected face coordinates and size allow to approximate position of body center and shoulders
easily. 41
Figure 4-12. Hand tracking based on information from face detection
Also that algorithm could be slightly modified to roughly estimate the angles for the new dimension based on the visual hand deformation due to the motion along the horizontal plane (as shown on the Figure 4-13 and Figure 4-14).
Figure 4-13. Visual hand deformation due to the movement in the horizontal plane
Figure 4-14. The rough guess of angle in a horizontal plane can be achieved via estimation of the visual hand deformation due to the movement in that dimension. 42 4.6.3 Prototype 3 Implementation
OpenCV library was integrated into the project successfully without any serious problems. Code from the face detection example also was integrated easily. However it took some time to figure out how to update the Irrlicht’s textures on the fly and how to set camera’s fps (OpenCV library is still in development and lacks this feature).
The first attempt to implement the prototype caused the drastic decreasing of the frame rate down to 2-4 fps, which was obviously unacceptable for the project and could result in its failure.
However moving the camera-related routines out to a separate thread solved the issue and the fps came back to normal (it is hard-coded to be about 30fps).
Later processing of hand motion was moved to an individual thread since it is important to get this information as fast as possible and hands normally change their position faster than the head while head detection requires a lot of computation.
This prototype showed that it in fact possible to integrate image recognition with world
(re-)simulation and rendering on an ordinary computer and even with existing primitive algorithm of hand tracking (which probably could be improved or replaced with other computer vision techniques) it is possible to kick the virtual ball with real hand gesture.
The secondary conclusion is that mapping the low resolution (~20x20 pixels) image of the detected face appeared to work pretty stable, i. e. not depending on the position of the face in the original image retrieved from a web-cam, it always appear on the object it is mapped to. At the same time the mapped low-resolution image of the face seems to be sufficient to recognize its owner and to guess its expression. However further research with human subjects involved could give a more accurate data on this issue. 43 4.7 Prototype 4
4.7.1 Prototype 4 Goals
The goals of the 4th prototype are the following
- To create and import graphical 3D models for hands and the pot.
- To implement the simplified second test case (“Chemistry lab”) adding the pot to the
virtual environment and attempting to grab it with hands.
4.7.2 Prototype 4 Design
Design of prototype 4 does not differ much from prototype 3 since we use exactly the same principles for hand control and for object interaction. Only 2 changes need to be done:
- To import designed graphical models (3D-meshes) from Blender
- To add the physical model of the pot from the primitives – cylinders and a box.
4.7.3 Prototype 4 Implementation
The new feature of this prototype is use of graphical models which differ from their physical representation: for simulation purposes hands are still considered to be stretched spheres whereas 3D meshes are being used for their visualization as shown on Figure 4-13.
Figure 4-15. Visual representation of the avatar (on the left) and its shape for simulation (on the right)
44 The similar approach was used for pot visualization (Figure 4-17).
Another new feature is the pot – a new object of the special shape shown on the Figure
4-16 which makes it possible to grab it with hands using the laws of physics (gravity and friction). Despite the fact that this simplified object looks more like a chalice, we intentionally call it a pot to be consistent with our “Chemical lab” example.
Figure 4-16. The physical shape of the pot
Figure 4-17. The pot being grabbed with the hand
Experiments showed that it is possible to grab the pot via one or two hands (Figure 4-17,
Figure 4-18). However it was much harder to do it before we added vertical attractor to pot to prevent it from tilting. The drawback of using the vertical attractor is that it limits the area of 45 applications of the pot: it is not possible to use it for pouring the liquid from it the way we proposed to do for the second test case.
We expect that when the accuracy of hand tracking algorithm is improved and more degrees of hands’ freedom are added, it will become easier to grab the pot and to pour the liquid from it without vertical attractor. It will also require more detailed physical model of hands, pot and liquid.
Figure 4-18. The pot being grabbed with 2 hands (first person view)
46
CHAPTER 5. RESULTS AND DISCUSSION
In this work we examined the potential and proposed solutions for distributed virtual environment with video-based interface. We proved the concept and showed that common hardware is sufficient to allow users’ interaction in virtual 3D space via their hand movements acquired from the web-cam and tracked with the help of computer vision techniques. We identified the challenges, developed hand tracking algorithm, explored the limitations of the used solutions, answered the research questions, proposed the framework and suggested the topics for the future research.
5.1 Proof of concept
This study showed that the video-based interface is doable on the common hardware with the internet connection and simple algorithms used. The proposed interaction paradigm works and the hand movements, acquired from the web-cam, mapped to the virtual body and mediated via modeled physics can be used to kick the virtual ball or grab the virtual pot, i. e. manipulate virtual objects in the same fashion as we do with real ones.
The proof of concept is significant since it demonstrates that when we filter out few issues, the hand-tracking interactions can become a very useful technique and, being coupled with simulated physics and 3D graphics, it can provide an essential and globally accessible interface for virtual environments. 47 5.2 Identified problems
5.2.1 Physical avatar
As shown on this work, we proposed to use simulated physics as a mediator for interaction in the virtual environment. This is a significant difference between this project and many others, which either do not use physics at all, or use it in a primitive fashion to enable realistic look of the virtual world, but not trying to use it for the actual interaction. In such projects virtual body looks good because it has been carefully modeled by professional artists and animated in a special 3D software, but its physical model (if used) is usually approximated to a stretched sphere which makes it impossible to use it to interact with objects as we do in real life.
Since this project did not focus on digital arts presentation, it was difficult to make the avatar look realistic. Nevertheless, we managed to demonstrate that our prototype interacts with simulated objects using ordinary web-cam. The physical model is also simplistic, but it is detailed up to the level of extremities which makes it possible to kick or grab objects with virtual hands.
Being enhanced and developed in more detail, this method of interaction should allow precise and diverse manipulations with objects and avatars.
5.2.2 Mapping movements to virtual body
It is hard to control simultaneously 3-dimensional motion of two extremities via keyboard and mouse in real time. Simulating physical behavior in the virtual environment does not make much sense if we do not map real movements onto the virtual body. However it is not a trivial task to do because it makes us use non-traditional input device to track the user’s movements. 48 Even though there are few devices on the market which allow 3D-hand tracking (such as wired gloves [22] or Wii remote [36]), they are not that popular and have limited range of applications. It does not mean that they should not be used in the virtual environment, but this topic could be a subject of other research.
The advantages of using web-camera as an input device are the following:
- it is cheap and very common now
- theoretically it allows not only hand tracking, but tracking of the whole body
- it should be used anyway to enable the full-pledged visual communication between
people
Even though the field of computer vision has a lot of various techniques for intelligent
image processing, the problem of precise real-time 3D-human pose estimation based on the
images acquired from monocular camera has not being solved yet. This work is an attempt to
develop the combination of simple algorithms and freely available techniques for this purpose,
and shows, that it is doable to some degree. The further development of physical model and
image processing algorithms should make it possible to control virtual objects from anywhere
from a laptop with embedded web-cam.
5.2.3 Object representation
Creation of even simple virtual objects (avatar, ball, chemical pot) appeared to be a
challenge. There are several issues which needed to be addressed.
First, each object has physical properties such as shape, mass and friction, which need to
be adjusted so that they behave realistically and consistently in the virtual environment.
Second, those objects have their visual representation, which is usually implemented in a
higher detail, than a physical model. This causes the problem of careful object modeling which 49 has to find the right balance between detail and simplicity, and at the same time to ensure that difference between the physical and graphical model will not affect significantly user experience.
The third issue related to object representation is that there is variety of potential objects which can not be represented adequately in with existing physical engine. For example, objects which are not rigid bodies, or objects which have a different behavior which depends not only on
Newton laws, but also on internal states or other types of interaction.
Most physics simulation engines behave realistically only within a certain range of physical parameters (for example, they do not consider relativistic or quantum effects). Starting from a certain level of detail, realistic simulation of even simple physical objects or phenomena considering all possible kinds of forces and physical effects becomes computationally expensive.
However if there is a necessity in specific kind of complex behavior, we suggest embedding techniques used in other projects, such as scripted objects and qualitative physics [5], which could supplement the existing interaction paradigm and add more functionality and value to the virtual environment as tool for education and collaboration.
5.3 Proposed solutions
Important aspects of virtual environment were considered and tried, possible solutions for
the networking, simulation, reducing the bandwidth, video conferencing and interaction were
proposed.
5.3.1 Interaction paradigm
Use of physical world and physical models instead of just visual avatars (as been done in
several projects) potentially allows interaction with virtual objects as with real ones. Widely
spread web-cams are likely to be a cost efficient solution of how to make this interaction easy
and natural. 50 Our experiments showed, that the proposed methods allow users to interact with each other in the simulated virtual environment, for example via kicking the virtual ball or grabbing and throwing the virtual object.
We are not familiar with any other project which uses the detailed physical model of human body and integrates it with monocular video input to control the character, so we consider it to be one of our major contributions.
5.3.2 Human communication
Mapping the video acquired face onto the virtual head and sending only this video data over the network with unreliable UDP-packets with lower priority. This should allow solving the scalability issue for the video conferencing, enabling faces of many people to be seen in the virtual reality simultaneously. Partially this is done in other projects, such as CU-SeeMe [10].
We went further integrating it with a physical virtual body with head and hands which reproduce the gestures of the real hands thus improving non-verbal communication between parties.
5.3.3 Scalability and networking
Using the UDP allows to decrease the network latency. Only sending the user input events is required since all other things are deterministic and can be simulated (and re-simulated for the delayed events) in the real-time on the client side.
In some projects (such as Second Life) clients do extrapolation of the coordinates based on the speed of objects. This allows smooth movement even when the delay between the packets is high. But sometimes it causes unrealistic behavior such as penetrating other objects or falling under the ground.
In case of physics simulated on the client side we are able to hide the latency so that the locally estimated objects’ behavior will be smooth and realistic even if packets come with 51 significant delays. In case if nobody generated events which affect those objects, this estimation will be precise. This technique decreases the required channel requirements at the expense of increasing the requirements for client station.
Since the communication channel is the bottleneck which limits the scalability of the system, we consider this tradeoff to be reasonable. Moreover modern multi-core processors allow efficient execution of many threads simultaneously, so a separate thread for physics simulation would not slow the client software down.
We also suggest increasing the scalability via distributed architecture. The major bottlenecks of massively-multiplayer games and virtual environments are the communication channels and the server resources. We tried to address both of them. Even though we use client/server architecture, the only responsibility of the server is queuing and relaying the messages. In fact, this is computationally cheap and could be done by clients themselves, preferably utilizing the multicasting functionality if available.
5.4 Developed algorithm
During this project several variations of simple image recognition algorithm were
developed and improved (however it is not possible to guarantee that nobody used the same approach before). The idea behind the algorithm is to use some a-priory knowledge about the
image and to use statistical methods to determine the angles of the hands. An advanced version
of the algorithm uses more reliable knowledge resulted from the face detection functionality
provided by the OpenCV library, and tries to estimate angles of hands in the (horizontal) plane.
5.5 Estimated feasibility of created interface
This algorithm allows rough estimation of hands’ position which is somewhat sufficient
for the primitive cases (for example for hitting the ball as in test case 1, “Ball game”), but the 52 accuracy and stability of estimation needs to be improved in order to make it useful for more complex cases.
In order to implement the second test case (“Chemistry lab” with a pot) we needed to make significant simplification and constrains of the physical properties of the pot so that it could be hooked by a hand. This method allows grabbing and throwing the objects, but is not good enough for real applications yet. However it is good enough to show that virtual hand can be controlled in 3 dimensions via real-time processing of flat images from cheap web-camera and can be used to interact with virtual objects.
We believe that constrains and simplifications of this project are connected with the limited timeframe and can be addressed in future research. For example, in order to implement pouring the chemicals from one pot to another the pot should be designed in more detail and the liquid should be represented as a set of drops which obey Newton (and possibly some other) laws. As an alternative approach, qualitative physics with particle effects could be used for simulation of the liquid flow as it was done by Cavazza, Hartley, Lugrin and Le Bras in their work [5].
We also believe that future research will improve video-based interface and reduce the role of keyboard and mouse in the virtual environment making the interaction more essential.
This requires improving the accuracy and stability of hand tracking algorithms, as well as adding more degrees of freedom and detail up to the level of the fingers.
5.6 Integration with web services
It could be beneficial to integrate this work with the web interface and embed it into
variety of online projects, like the mentioned Web Based Simulation project. Even though it was
not possible to do it during this work, it can be done when efficient java-based modules for the 53 required components of the system (such as fast deterministic physics, image recognition library, etc) will be available. Some of those components are already available such as Java version of
Irrlicht engine [28] making an interesting subject for future research.
5.7 Future research
Several issues should be addressed to fully complete the scalable virtual environment
with video interface:
• Integrate audio (VoIP) and implement sound distribution and attenuation to allow
users to communicate with each other without touching the keyboard.
• Switch to peer to peer or other distributed network architecture to increase the
scalability of the system.
• Improve video recognition techniques and user interface to allow really
convenient and powerful way of interaction.
• Examine the scalability, usability and performance issues involving human
subjects for the research.
• Integrate with web services such as Web-Based Simulation project.
• Embed the scripting language to enable interactive objects.
• Implement qualitative physics to enable other ways of physical interaction and
effects (such as heat exchange, pouring of water, etc).
• Embed interfaces for external applications and devices (such as electronic
microscopes, robots, satellites, physical devices, etc) which could be used for
education and research purposes.
Regarding the 3D human pose estimation, much work is being conducted by the leading
research institutions. When better techniques for real-time tracking of body parts become 54 available, it will open the potential for development of commercial and educational projects which could enable the 3D interface similar to the real physical world (instead of using menus, clicks and icons as is done nowadays).
Having addressed all the mentioned issues, it would be possible to start an OpenSource project which could make virtual reality available for everybody. Enthusiasts can significantly facilitate creation of worlds they live in (as they do in Second Life [26] and other systems). 55
REFERENCES
1. AccessGrid.Org (n. d.). Retrieved October 2, 2006 from http://www.accessgrid.org/
2. Auyang S. Y. (1999, May) Cognitive and neural processes that make possible vision.
Retrieved June 4, 2007 from http://www.creatingtechnology.org/papers/vision.htm
3. Blanz, V. (2004). Learning-Based Approaches. In Facial modeling and animation. ACM
SIGGRAPH 2004 Course Notes (Los Angeles, CA, August 08 - 12, 2004). SIGGRAPH '04.
ACM Press, New York, NY, 6.
4. Burdea, G. & Coiffet, P. (1994). Virtual reality technology. New York: Wiley.
5. Cavazza, M., Hartley, S., Lugrin, J., and Le Bras, M. (2004). Qualitative physics in virtual
environments. In Proceedings of the 9th international Conference on intelligent User
interfaces (Funchal, Madeira, Portugal, January 13 - 16, 2004). IUI '04. ACM Press, New
York, NY, 54-61
6. CAVE User’s guide (n. d.). Retrieved June 08, 2007 from
http://www.evl.uic.edu/pape/CAVE/prog/CAVEGuide.html
7. Cheng K. & Takatsuka M. (2005). Real-time Monocular Tracking of View Frustrum for
Large Screen Human-Computer Interaction. ACM International Conference Proceeding
Series; Vol. 102
8. DevMaster.Net – Your source for game development (n. d.) Retrieved October 30, 2006 from
http://www.devmaster.net/engines/
9. Face Detection using OpenCV (2006, August 27) Retrieved June 9, 2007 from
http://opencvlibrary.sourceforge.net/FaceDetection
10. Han, J. & Smith, B. (1996). CU-SeeMe VR immersive desktop teleconferencing. In
Proceedings of the Fourth ACM international Conference on Multimedia (Boston, 56 Massachusetts, United States, November 18 - 22, 1996). MULTIMEDIA '96. ACM Press,
New York, NY, 199-207.
11. Hiwada, K., Maki, A., and Nakashima, A. (2003). Mimicking video: real-time morphable 3D
model fitting. In Proceedings of the ACM Symposium on Virtual Reality Software and
Technology (Osaka, Japan, October 01 - 03, 2003). VRST '03. ACM Press, New York, NY,
132-139.
12. Irrlicht Engine – A free open source 3D engine (n. d.) Retrieved October 30, 2006 from
http://irrlicht.sourceforge.net
13. Kauff, P. & Schreer, O. (2002). An immersive 3D video-conferencing system using shared
virtual team user environments. In Proceedings of the 4th international Conference on
Collaborative Virtual Environments (Bonn, Germany, September 30 - October 02, 2002).
CVE '02. ACM Press, New York, NY, 105-112.
14. Kjeldsen R. (2005, August). IBM Head Tracking Pointer User’s Manual, Retrieved June 8,
2007 from
http://dl.alphaworks.ibm.com/technologies/headpointer/Head_Tracking_Pointer_Users_Man
ual.pdf
15. Lee, S., Kim, I., Ahn, S. C., Lim, M., and Kim, H. (2005). Toward immersive
telecommunication: 3D video avatar with physical interaction. In Proceedings of the 2005
international Conference on Augmented Tele-Existence (Christchurch, New Zealand,
December 05 - 08, 2005). ICAT '05, vol. 157. ACM Press, New York, NY, 56-61.
16. List of game engines. (2007, July 5). In Wikipedia, The Free Encyclopedia. Retrieved July 8,
2006, from http://en.wikipedia.org/wiki/List_of_game_engines 57 17. Lok, B. (2001). Online model reconstruction for interactive virtual environments. In
Proceedings of the 2001 Symposium on interactive 3D Graphics I3D '01. ACM Press, New
York, NY, 69-72.
18. Mather, G. (2006). Foundations of Perception. Psychology Press, 2006
Retrieved June 4, 2007 from
http://www.psypress.com/mather/resources/topic.asp?topic=ch01-tp-01
19. OGRE 3D: Open source graphics engine (n. d.) Retrieved October 30, 2006 from
http://www.ogre3d.org
20. OpenAL. (2007, June 11). In Wikipedia, The Free Encyclopedia. Retrieved July 17, 2006,
from http://en.wikipedia.org/wiki/OpenAL
21. OpenCV Library Wiki (n. d.) Retrieved February16, 2007 from
http://opencvlibrary.sourceforge.net/
22. P5 Glove – Stratchpad Wiki Labs (n. d.) Retrieved June 08, 2007 from
http://scratchpad.wikia.com/wiki/P5_Glove
23. Rajaei, H. & Barnes, A. (2006). A Real-Time Interactive Web-Based Environment for
Training, In Proceedings of the International Conference on Modeling and Simulation -
Methodology, Tools, Software Application (M&S-MTSA'06) July 31-August 2, 2006,
Calgary Canada
24. Rajaei, H. & Dieball, A. (2007) A Shared-View Web-Based Environment for Training, In
Proceedings of the 10th Communications and Networking Simulation Symposium, CNS'07,
Sponsored by ACM/SCS, March 26-28, Norfolk, Virginia 6.
25. Rajaei, H., 2004, "Distributed Virtual Training Environment", in Proceedings of Swedish-
American Workshop on Modeling and Simulation, 195-200, February 2004, Florida, pp. 195-
200 58 26. Second Life (n. d.). Retrieved October 2, 2006 from http://secondlife.com
27. Skype (n. d.). Retrieved October 2, 2006 from http://skype.com
28. SourceForge.net: Jirr (n. d.). Retrieved July 8, 2006, from
https://sourceforge.net/projects/jirr/
29. Steinicke, F., Ropinski, T. & Hinrichs, K. (2005, December). A Generic Virtual Reality
Software System’s Architecture and Application. In ACM International Conference
Proceeding Series, ICAT’05, Vol. 157, 220-227
30. Tarini, M., Yamauchi, H., Haber, J. & Seidel, H. P. (2004) Texturing Faces. In Facial
modeling and animation. ACM SIGGRAPH 2004 Course Notes (Los Angeles, CA, August
08 - 12, 2004). SIGGRAPH '04. ACM Press, New York, NY, 6.
31. There – the online virtual world that is your everyday hangout (n. d.). Retrieved June 09,
2007 from http://www.there.com/
32. Viola, P. & Jones, M. (2001). Rapid object detection using boosted cascade of simple
features. IEEE Conference on Computer Vision and Pattern Recognition.
33. Virtual Environment. (2006, September 26). In Wikipedia, The Free Encyclopedia. Retrieved
October 02, 2006, from http://en.wikipedia.org/wiki/Virtual_environment
34. Virtual reality. (n.d.). In The Columbia Electronic Encyclopedia, Sixth Edition. Retrieved
October 02, 2006, from http://www.answers.com/topic/virtual-reality
35. VoipBuster (n. d.). Retrieved October 2, 2006 from http://voipbuster.com
36. Wii.Nintendo.com – In-Depth Regional Wii Coverage (n. d.) Retrieved June 08, 2007 from
http://wii.nintendo.com/controller.jsp
37. Wilson P. I. & Fernandez J. (2006, April). Facial feature detection using Haar classifiers.
Journal of Computing Sciences in Colleges, Volume 21 Issue 4, 127-133 59 38. Zhu, Z., Ji, Q. (2004). Real Time 3D Face Pose Tracking From an Uncalibrated Camera,
Retrieved June 08, 2007 from http://www.cv.iit.nrc.ca/VI/fpiv04/pdf/12ft.pdf