COST-EFFICIENT VIDEO INTERACTIONS FOR VIRTUAL TRAINING ENVIRONMENT

Arsen Gasparyan

A Thesis

Submitted to the Graduate College of Bowling Green State University in partial fulfillment of the requirements for the degree of

MASTER OF SCIENCE

August 2007

Committee:

Hassan Rajaei, Advisor

Guy Zimmerman

Walter Maner

ii

ABSTRACT

Rajaei Hassan, Advisor

In this thesis we propose the paradigm for video-based interaction in the distributed virtual training environment, develop the proof of concept, examine the limitations and discover the opportunities it provides.

To make the interaction possible we explore and/or develop methods which allow to estimate the position of user’s head and hands via image recognition techniques and to map this data onto the virtual 3D avatar thus allowing the user to control the virtual objects the similar way as he/she does with real ones. Moreover, we target to develop a cost efficient system using only cross-platform and freely available software components, establishing the interaction via common hardware (computer, monocular web-cam), and an ordinary internet channel.

Consequently we aim increasing accessibility and cost efficiency of the system and avoidance of expensive instruments such as gloves and cave system for interaction in virtual space.

The results of this work are the following: the method for estimation the hand positions; the proposed design solutions for the system; the proof of concepts based on the two test cases

(“Ball game” and “Chemistry lab”) which show that the proposed ideas allow cost-efficient video-based interaction over the internet; and the discussion regarding the advantages, limitations and possible future research on the video-based interactions in the virtual environments.

iii

This work is dedicated to my ancestors, teachers, colleagues and other kind people. iv

ACKNOWLEDGMENTS

This study could not be completed without my scientific advisor Dr. Rajaei, who had enough patience and goodwill to guide me through the whole process of scientific research.

I would also like to thank Dr. Maner for his persistent help and care and Dr. Zimmerman for his support and precious comments on the committee meetings.

A special acknowledgement goes to ACM Digital Library, Wikipedia contributors and the Open Source community members who made all the necessary information and software components available for this research. v

TABLE OF CONTENTS

Page

CHAPTER 1. INTRODUCTION ...... 1

1.1 Hypothesis...... 1

1.2 Goals ...... 1

1.3 Research questions...... 3

1.4 Preliminary Research...... 3

CHAPTER 2. LITERATURE REVIEW ...... 5

2.1 Virtual Reality/Environment...... 5

2.2 Visual Aspects ...... 6

2.3 Computer Vision...... 10

2.4 Preliminary Conclusions...... 12

CHAPTER 3. METHODOLOGY ...... 14

3.1 The interaction paradigm ...... 14

3.2 Test cases ...... 14

3.2.1 Test Case 1: Simple Ball Game ...... 15

3.2.2 Test Case 2: Virtual Chemistry Lab...... 16

3.2.3 Observations ...... 17

3.3 Preliminary gesture tracking algorithm ...... 17

3.4 Proposed methods ...... 19

CHAPTER 4. IMPLEMENTATION...... 21

4.1 Problem Statement...... 22

4.1.1 Problems and solutions ...... 23 vi

4.1.2 Test cases ...... 24

4.2 Overview of available platforms...... 25

4.2.1 Graphics engines...... 26

4.2.2 Physics engines ...... 27

4.2.3 Sound engines ...... 28

4.2.4 Video acquisition ...... 29

4.2.5 Network engines ...... 29

4.3 Networking ...... 30

4.3.1 Preliminary estimation of channel usage ...... 30

4.3.2 Network architecture...... 31

4.4 Prototype 1...... 32

4.4.1 Prototype 1 Goals...... 32

4.4.2 Prototype 1 Design...... 32

4.4.3 Prototype 1 Implementation...... 34

4.5 Prototype 2...... 36

4.5.1 Prototype 2 Goals...... 36

4.5.2 Prototype 2 Design...... 36

4.5.3 Prototype 2 Implementation...... 37

4.6 Prototype 3...... 39

4.6.1 Prototype 3 Goals...... 39

4.6.2 Prototype 3 Design...... 40

4.6.3 Prototype 3 Implementation...... 42

4.7 Prototype 4...... 43 vii

4.7.1 Prototype 4 Goals...... 43

4.7.2 Prototype 4 Design...... 43

4.7.3 Prototype 4 Implementation...... 43

CHAPTER 5. RESULTS AND DISCUSSION...... 46

5.1 Proof of concept...... 46

5.2 Identified problems ...... 47

5.2.1 Physical avatar ...... 47

5.2.2 Mapping movements to virtual body ...... 47

5.2.3 Object representation ...... 48

5.3 Proposed solutions ...... 49

5.3.1 Interaction paradigm ...... 49

5.3.2 Human communication...... 50

5.3.3 Scalability and networking ...... 50

5.4 Developed algorithm...... 51

5.5 Estimated feasibility of created interface...... 51

5.6 Integration with web services ...... 52

5.7 Future research...... 53

REFERENCES ...... 55 1

CHAPTER 1. INTRODUCTION

Virtual environment allows spatially distributed people to meet and collaborate via

Internet however there is no ultimate solution yet which allows doing it effectively and in

essential way for ordinary users.

There are different accessible applications which solve small parts of this problem (as

VoipBuster[35] for voice, Skype[27] for video and Second Life[26] for virtual interaction).

However they do not allow people to see each other’s faces and gestures in the virtual space at

the same time interacting with virtual objects using their hands as they do in real life.

On the other hand there are many research projects which target to enhance the

interaction in the virtual environment or to increase the scalability of videoconferencing (as

CAVE[6] and AccessGrid[1]), but they require expensive equipment (sensors, stereo cameras,

special routers with wide channels, etc) which normally is not available for common users.

1.1 Hypothesis

The hypothesis of this study is that it is possible to interact with virtual objects with the

hand gestures acquired from a monocular web-cam, and that the computational power of modern

PCs (or laptops) is enough to perform all the required image recognition and simulation

procedures.

1.2 Goals

The goal of this thesis is to examine the potential and propose a solution for a system

which could provide video-based interaction in the distributed virtual environment using only

standard, available hardware and reducing the bandwidth to increase the scalability if possible. 2 The main goal is to enable the video-based hand tracking to interact in accessible and scalable virtual training environment.

The objectives of this research are to focus on the following aspects of video interaction:

1. Face detection and mapping the small face image onto the virtual body

2. Gesture recognition and conversion to the commands which control the virtual body

3. Do the ground work for hand tracking technique using web-cam

Face detection should allow video communication of people within a virtual world with relatively low bandwidth usage due to selection of the relevant area in the image acquired from the web-cam. Potentially this should allow more people seeing each other at the same time.

Gesture recognition should allow users to interact with virtual objects as proposed in two examples which should demonstrate the proof of concept. The first example is to enable a simple video-based interaction: kicking the ball with a hand gesture. The second example is to allow more complicated interaction: putting the substance to the chemical pot hold by another person.

More specifically, hand tracking should allow making the basic 3-dimensional movements with a virtual hand thus kicking, grabbing or performing other actions on the virtual objects while user does it with his real hands and imaginary objects in front of the webcam getting visual feedback from the computer display.

The proposal is to use the detected face of the user and his/her motion to track the position of the hands in the plane and then to use the knowledge about perspective projection with some assumptions regarding the probable movements to enhance this estimation to 3D and to assign the corresponding behavior for the virtual model of physical world. 3 1.3 Research questions

- Is it possible to interact with virtual objects with the gestures acquired from a

monocular web-cam?

- Can hand tracking be used to transform the hand gestures into virtual environment in

such a way that it allows grabbing and holding the virtual objects without gloves or

other sensors?

- Do modern PCs (or laptops) have enough computational power to perform image

recognition, 3D rendering, networking and other necessary routines required for the

system?

- What software architecture could fit the purpose and which software components

could be used to implement the system?

- How is it possible to integrate the proposed solution with the web interface so that it

could be easily used for other online projects, including Virtual Training

Environment?

- What approaches could be used to increase the scalability of the system?

- What are the potential strengths and limitations of the proposed techniques?

1.4 Preliminary Research

The prior investigation (in summer 2006) targeted experiments on image acquisition and recognition for control of the virtual flight over a landscape with user hands’ gestures. It inspired the current research and suggested that organizing human-computer interaction via web-cam was doable, even though the interaction was not very stable and precise.

The obvious application of such kind of interface lies in the field of virtual reality since it makes it possible to control the virtual body in a very essential way without expensive hardware. 4 Since the virtual body in this case repeats the natural movements of the user, it makes it possible to:

1. Enhance existing voice over IP communication with gestures

2. Decrease the bandwidth required for teleconferencing sending only crucial

video data (mainly, the face)

3. Interact with virtual objects the same way as with real ones

We believe this functionality can become a part of ongoing Web-Based Simulation

(WBS) project [24, 23, 25] which is supervised by Dr. Rajaei since the target of WBS project is

the Virtual Training Environment (VTE) which needs to be equipped with the mentioned interface. 5

CHAPTER 2. LITERATURE REVIEW

Inspired by the first practical results we started a literature review to get a bigger picture

of the field and to find out what has been done so far by other researches and what could be done by us in this project.

The ACM Digital Library contains thousands of articles related to virtual environments,

hundreds of them were considered but only about 50 were selected for literature review. The following aspects of virtual environments were of our interest during this research.

• What is a virtual reality and what is a virtual environment

• Visual representation

• Image recognition or other methods of gesture input

• Physics simulation

• Simulated sound distribution

• Scalability & Networking

• Usability issues

• Related projects

2.1 Virtual Reality/Environment

There is a lot of misunderstanding about what Virtual Reality (VR) is and what Virtual

Environment (VE) is. Wikipedia (the free encyclopedia) shows the same article for terms

claiming that VR is a technology which allows a user to interact with computer-simulated

environment, but not clarifying what is VE [33]. Columbia Electronic Encyclopedia states that

VE and VR mean the same thing: computer-generated environment with and within which

people can interact [34]. According to Burdea and Coiffet in [4], Virtual Reality is high-end 6 user-computer interface that involves real-time simulation and multimodal interactions, but at the same time they state that Telepresense and Augmented reality are not Virtual Reality in its strictest sense since they incorporate real images. With this specification it looks like the system we aim to build will not be “strict” Virtual Reality. However we will use VR and VE terms as equivalents in less strict manner so that they could be applied to our project, keeping in mind, that in fact it can be classified as an augmented reality.

Many research and commercial companies have being working on the field of virtual reality for years. The purpose of this section is to get acquainted with the valuable results of their work and to see what is done and what could be done.

2.2 Visual Aspects

Graphical representation is crucial for virtual environments since vision is agreed to be

the most important sense [2] which is confirmed by the fact that visual cortex is the largest

sensory area in the human brain [18]. Many techniques and algorithms have been developed in

the field to make VR look more realistic. We conducted a literature research to obtain an

overview of the existing projects to get a better picture of what is already done.

VR2S – a generic VR software system which is an extension of the high-level rendering

system VRS, focused real-time human-computer interactions and multimedia responsiveness of the system, for example provides virtual shadows for the real image of the head as shown on the

Figure 2-1[29].

Online Model Reconstruction – a system for generating real-time 3D reconstructions of

the user and other real objects in an immersive virtual environment for visualization and

interaction, which bypasses an explicit 3D modeling stage, does not use additional sensors or

prior object knowledge, but uses a set of outside-looking-in cameras instead. Visual hull 7 technique (proposed by Benjamin Lok) was used to rapidly determine which volume samples lie within the visual hull, and then to build object reconstruction from any given viewpoint at interactive rates [17]. The example of rendered interaction is represented on the Figure 2-2.

Unlike our project, Online Model Reconstruction utilizes specialized hardware such as five wall- mounted NTSC cameras (720x486) and one additional camera (640x480) to extract the visual hull of human body. Obviously, this kind of system is not portable and accessible for common users.

Figure 2-1: Rendered shadows (top-left image) of the user (from [29])

Figure 2-2: Reconstructed hands interact with environment (from [17])

CU-SeeMe – a distributed video-conferencing system which embeds live video and audio

in a virtual space. The system uses its own protocol on top of UDP and utilizes threshold-

difference approach to transfer only valuable video-data (compressed via lossless compression)

as shown on the Figure 2-3 [10]. Sound attenuation is used to improve the perception of virtual

space. 8 An immersive 3D video-conferencing system – this project is based on the principle of shared virtual table environment, allowing tele-presence and human-centered communication as shown on the Figure 2-4. Images from 2 cameras are used to perform the depth analysis and to create a virtual view of the remote conferee. Extracted data has a compact form and can efficiently be encoded via Mpeg-4 codec [13].

Figure 2-3: Segmented video in virtual 3D space – only face is being transferred (from [10])

Figure 2-4: Real images are merged with virtual objects (from [13])

Figure 2-5: Side-by-side comparison of photographs (left) and heads (right) for OpenGL rendering [30]

Face modeling 17and mimicking. It is possible to amazing things with the digital images of faces. Models of human faces based on photos are shown on Figure 2-5, and on Figure 2-6 you can see face manipulation based on a single image, processed in several stages: 3D- 9 reconstruction; rotation; modification; rotation back; rendering [3]. Most likely, those techniques are not real-time, however real-time techniques also exist and produce fairly good picture (Figure

2-7) [11].

Figure 2-6: Face manipulations based on a single image [3]

Figure 2-7: Real-time video mimicking [11]

There is variety of other projects which do not concentrate on graphics but are somewhat

close to the virtual environment implementing some of its important features.

Second Life [26], There [31], and massively-multiplayer games provide scalable 3D

worlds with interactive virtual objects and means of interaction. However they do not allow

audio/video conferencing, nor do they provide means for realistic and essential interaction: users 10 have to use keyboard and mouse to control their avatars and deal with simplistic icons and menu items to perform interaction.

Voice-over-IP applications allow real-time audio communication, some of them (such as

Skype [27], Access Grid [1]) allow video conferencing. These projects are specialized in teleconferencing and do not provide 3D-space and virtual objects for interaction.

CAVE [6] and similar projects create illusion of immersion into the virtual world; some

of them even use physics for interaction [15]. However unlike our project, they require special

hardware and can not become globally accessible.

Potentially the synthesis and further development of existing ideas and technologies

could bring us to scalable and universal immersive virtual environment with essential interface,

which could integrate audio/video communication with interaction in 3D world. Our contribution to essential interface lies in the field of video-based interaction via common hardware. We aim to provide the essential way of interaction in the virtual environment without utilizing uncommon or expensive hardware. We plan to pay computing power (which seems to be the cheapest and fastest-growing resource) to reduce the necessity in special hardware, replacing it with combination of usual web-camera and computer vision techniques.

2.3 Computer Vision

Most of the existing projects which aim using cameras for control require professional

hardware such as stereo cameras, however there are projects which use monocular camera for

user input [7, 38, 14] as well as necessary computer vision techniques which make it possible

[37, 32].

For example, one of the projects [7] uses Intel’s OpenCV [21] library to detect user’s face

thus locating his/her position in order to allow pointing via user's fingertip. The library is Open 11 Source and free for non-commercial use. It utilizes cascade of boosted classifiers working with haar-like features originally proposed by Paul Viola [9].

Viola and Jones used four types of features (shown on Figure 2-8) and introduced the

term of Integral Image each pixel (X, Y) of which is the sum of brightness of all pixels with x≤X

and y≤Y. After the Integral Image is built, it takes only a constant time to extract image features

[32]. The cascade of (preliminary selected via special learning algorithm) features is being used

to detect faces or other objects (Figure 2-10). Later Wilson and Fernandez [37] enhanced the

algorithm to diagonal features.

Figure 2-8. Rectangle features relative to the detection window. The sum of pixel brightness within white rectangles is subtracted from the sum of pixels brightness within gray rectangles. [32]

Figure 2-9. Integral Image allows to calculate the sum of pixels for D for a constant time using only 4 array references: 4 + 1 – (2 + 3) [32] 12

Figure 2-10. Two most important features used for face detection. [32]

2.4 Preliminary Conclusions

To get a better understanding of the existing systems, it is necessary to try them on practice (download projects if possible, get registered in the online systems, games, etc).

However it is not always possible to do, especially in a limited timeframe.

According to covered literature, the following has been done:

• VR with of gloves, glasses, “caves” and of other special equipment

• Advanced processing of video, including image segmentation (based on a single

camera), face/hands tracking, 3D reconstruction and video-based avatars (based on

multiple cameras).

• Real-time mimicking; Mimicking based on a single photograph

• Gesture recognition (to some degree)

• EEG as an input device for primitive games (to some degree)

• Decentralized networking, including multi-server and P2P technologies, dynamic load

distribution, zoning and message distribution algorithms.

• Collision detection and modeled physics

• Acoustic modeling: sound distribution and attenuation

• Scalable video-conferencing, IP-telephony 13 What could be done is to integrate existing achievements from different fields in a single project, possibly to modify them and to see if it works:

• Use image tracking and recognition techniques to get a compact representation of the

user thus modeling the customizable virtual human based on the video from a single

camera.

• Place the virtual human into the modeled physical world.

• Provide an interface for attaching externally simulated models into the virtual

environment so that virtual humans could interact with them.

• Use decentralized architecture to simulate this world in a scalable manner.

• Extend the world with acoustic interaction.

• Do all this in such a way, that ordinary desktop owners could use the system.

The long-term goal of the project is to create generally accessible for common users, scalable virtual environment with essential interface. This system could be used for education, training, collaboration and entertainment by everyone who has more or less modern computer with 3D-accelerator and a reasonable internet channel.

The short-term goal of the project is to implement the key parts of the system as a basic prototype in order to provide a proof of concept to answer the question to what degree it is doable or what are the constrains that do not allow to fully implement it. This thesis chose the later approach in providing evidence that it can be done. 14

CHAPTER 3. METHODOLOGY

3.1 The interaction paradigm

Before we start discussing implementation, we should present the methodology we used

to make the fulfillment of our research. The proposed methodology has to answer the question:

how is it possible to make (at least theoretically) the interaction with virtual objects closer to the

way we do with real ones, i. e. how can we kick a virtual ball, hold it, transfer it or place

arbitrary virtual objects to the desired positions with real hand gestures?

We focus to solve this problem with a common hardware. Literature review showed that

web camera, being a wide spread and inexpensive device, has a great potential for being used as

essential interface for virtual environment. The proposed solution lies in the intersection of the

fields of computer vision and virtual environment.

Even though the general scope of the problem is broad and could be applied for variety of

actions and objects with different physical properties, we narrow it down to the set of specific

functions required for the test cases explained further.

3.2 Test cases

In order to get some evidence that our ideas could work we developed two test cases which demonstrate the idea, identify the basic problems and indicate whether we solved them or not.

The first test case is designed to test the very basic and simple functionality as interaction

using physics with rough hand tracking. The results of the first test case will show whether hand-

tracking interface is doable or not. 15 The second test case involves more accurate and complicated interaction, simulating the possible real application of the proposed solutions. The results of the second test case will show whether hand-tracking interface is feasible for virtual training environment or not.

3.2.1 Test Case 1: Simple Ball Game

Real world gaming uses physical laws as an interface to the game objects, i. e. in football and many other games, the game object is an elastic ball. Player interacts with it kicking it, i. e. moving his/her extremity towards it, experiencing collision forces which try to move the extremity back, and the ball – forward. The ball receives the impulse, accelerates and moves respectively. Friction with air, collisions with ground and other objects affect its speed and rotation.

Sender Recepient

Ball (mass) Force

Gravity

Figure 3-1. Simplistic representation of the environment-mediated interaction with a ball

The interaction between players in real life happens via passing the ball, which is in fact

the same as kicking the ball in the direction of the recipient. This observation demonstrates that

the real world with its physical laws and properties is a mediator for the interaction between the

players as shown on the Figure 3-1.

In our work we target to implement the first test case allowing user to control the

extremities of the virtual body with his hands and to pass the virtual ball to another user using 16 them. As mentioned above, the images from the monocular web-cam are going to be acquired and processed to allow the interaction.

3.2.2 Test Case 2: Virtual Chemistry Lab

In our test case for chemistry lab we want a chemistry student to pour the chemicals from his pot to the pot hold by another student in the cyber lab.

Figure 3-2. Chemicals pouring from one pot to another following the gravity law

Unlike a previous case with passing the ball, more actions are required here.

1) Student A grabs the empty pot with the help of his fingers (causing small collisions

and friction, which do not allow the pot to fall).

2) Student B grabs the pot with chemical a similar way.

3) Student B moves his pot towards the pot of Student A and flips it (the pot just follows

the trajectory of the hand since pot’s movement is constrained by hand’s fingers)

4) The contents of the pot falls to the empty pot due to the forces of gravity (Figure 3-2) 17 3.2.3 Observations

Based on those examples we made an observation that a small amount of fundamental physical laws allows to establish variety of different interactions, potentially all the interactions we can observe in real life. Having understood how they are actually performed in real world, we can model them in the virtual world as well. Potentially it allows us to find a general solution for realistic and natural interaction in virtual environment instead of developing specialized interfaces for each possible type of user action and object (as is commonly done in most of the virtual reality and gaming projects).

3.3 Preliminary gesture tracking algorithm

During the preliminary research the algorithm has been developed which allows

estimating the angle of hands when they move in a vertical plane (parallel to the camera’s lens).

This kind of movement was used to control the wings of the virtual bird which was flying over

the landscape.

The algorithm uses some a-priory knowledge and assumptions about the image and the

human body. The basic idea of the algorithm can be described as following:

1) Generate the motion map as a difference between the consequent frames filtering out

everything beneath the threshold and adjusting the threshold if that difference is too

big or too small.

2) Remove noise (isolated pixels in the motion map; pixels which mold small segments)

3) Assume that in long term run the average position of the body center should be in the

middle of the detected motion, and it does not change as frequently as the position of

the hands (which size is about the half of the average motion width). According to

this knowledge, adjust the estimation of the body position and hand size. 18 4) Having the estimated body position, calculate the average angle for all the pixels on

the left and on the right of the body and visualize it as estimation of the human hands

positions (as shown on Figure 3-3).

Figure 3-3. On the left side: 2 consequent images of the movement one on another; On the right side: the difference between 2 consequent images (red) and the angles which can be estimated using the assumptions about the body position.

This algorithm is quite simple. Nevertheless it is a good starting point for our experiments

which serves our purpose well in order to move forward to the next step. Even though the result

of this algorithm is not quite accurate, it allows controlling the direction of the flight with hand

movements, giving somewhat satisfactory accuracy for this purpose as shown on the Figure 3-4.

This algorithm however has several limitations:

1. It estimates only 2 angles in a single plane.

2. The hands should be more or less straightened, not overlapping and unlike the

body they should be in motion most of the time.

3. Both hands should be used with the similar frequency, otherwise the estimated

position of the body center moves towards the most active hand thus

drastically decreasing the accuracy of the method. 19 Furthermore it is supposed that there is no any other movement in the scope of the camera’s view, and the lightning conditions do not change.

Figure 3-4. Flying over a 3D-landscape using hand gestures acquired from the web camera. Blue lines are the estimated positions of hands (wings) whereas green and red rectangles represent the estimated long-term and short-term body positions respectively.

3.4 Proposed methods

Based on the preliminary research and observations for test cases the decision has been made to use the combination of the hand tracking and a physical model as a ground for interaction paradigm in this research. With the correspondent experiments, proof of concept and the following proposals it molds the essence of this work.

Thus we suggest:

1. To enhance the developed hand tracking algorithm to 3 dimensions

2. To use it to control the extremities of the virtual avatar with real hand movements in

front of the web-cam

3. To use the extremities of the virtual avatar for interaction with virtual objects

4. To establish the interaction between users and objects in terms of interaction of

physical bodies 20 5. To enhance communication of users in virtual space via detecting and mapping their

faces on the virtual avatars

6. To use client side simulation for decreasing the bandwidth requirements and server

load thus increasing the scalability of the system

We also propose to use only freely available components to implement the system which would allow the mentioned functionality.

In the following sections we present the details about our implementation of this approach including criteria for selection of the components, architecture, with the justification for each of the proposed solutions. 21

CHAPTER 4. IMPLEMENTATION

Since the subject of this study is video input in the virtual environment, we had to find

either the existing virtual environment, or software modules which could provide the

functionality and flexibility necessary for the project considering the learning curve and the

limited timeframe.

One of the first issues we had is to decide whether it is better to use the existing system, or to develop the prototype of new one using freely available engines and libraries.

Before we made a selection, we identified our basic requirements for the system:

1. The system should run on common hardware.

2. The system should be capable of 3D-rendering of dynamic scenes on the fly.

3. Unlike other systems, the interaction in the system should be implemented similar to

the real world - via physics.

4. The system should use the network efficiently considering the real-time nature of the

application

It is easy to notice, that most of those requirements are very common for the computer

games. Usage of the gaming techniques could be considered to be “non-scientific”, however

gaming is in fact a very developed sector of economy which moves forward the whole

computing industry and gets as much as it can get from the existing computer hardware and user

interfaces. Virtual reality field is in fact very close to gaming, which is confirmed by the usage of

VR-equipment in computer games and the growing popularity of WII remote (a relatively cheap

3D manipulator) which appeared on the market recently. 22 4.1 Problem Statement

We condensed our view of the long-term project’s future into a hierarchy of implementation goals having two top-level requirements which come from one of the expected uses of the system (Virtual Training Environment) to determine the software modules we might require to implement them:

• allow ordinary spatially distributed people to communicate online, seeing and hearing

each other

o make the system accessible

ƒ decreasing bandwidth

ƒ decreasing hardware requirements

ƒ making it to run on the different platforms

ƒ making it free or cheap enough

o enhance the user experience

ƒ allow users to interact using natural body movements

ƒ allow them to see each other in 3D space

ƒ enhance their communication with realistic 3D sound effects

o implement the system in a scalable way so that it could serve as many people

as required

• allow this system to be used for training and education purposes

o embed virtual objects

o allow users to interact with them as well

o integrate it somehow with the web 23 Ideally, eventually it should become a scalable system, which is able to serve millions of people at the same time, allowing them to communicate with each other and to interact with virtual objects the similar way they interact with real ones.

4.1.1 Problems and solutions

Keeping in mind project’s goals, we can identify the basic problems and propose solutions for them as shown on the Figure 4-1.

Problem Proposed solution

Make it cross-platform, free or cheap enough Use free components when possible, avoid to

require costly multi-server infrastructure for

system maintenance

Embed voice over IP and 3D sound Find and adapt cross-platform sound engine

Allow users to see each other in 3D space Find and adapt cross-platform 3D engine,

embed optional video acquisition and mapping

on the virtual body

Allow users to interact with virtual objects Find and adapt cross-platform

Decrease bandwidth required • track the body & head position and

send only necessary data

• use modern audio/video codecs and

data compression

• if bandwidth is not sufficient, video

quality could be sacrificed, but other

functions should work properly

• use existing cross-platform game 24 networking library to utilize efficient

UDP-based data transmission.

Decrease hardware requirements Unlike many VR applications, use a simple

monocular webcam for hand/head/body

tracking

Implement a system in a scalable way so that it Utilize distributed architecture, use approach could serve huge amounts of users similar to P2P or GRID

Allow users to interact with natural body Implement head, hand and body tracking movements

Figure 4-1. Identified problems and solutions

Obviously, all of the above purposes could not be solved in a very limited timeframe and beyond the scope of this study, but the proof of concept and the working (though may be not fully functional) prototype can be completed.

4.1.2 Test cases

As mentioned earlier, the complete system should allow implementing following test cases:

• Ball game: users should be able to be in the virtual room and play with a single ball

• Chemistry lesson: users should be able to mix the chemicals in collaboration, pouring out

the chemicals from one pot (being hold by one user) to another (being hold by another

user).

The final version of the project should allow users to communicate orally and to see each other’s facial expressions as well as hands’ movements in both of those cases as shown on the

Figure 4-2. 25

Figure 4-2. Project preview: ball game1

4.2 Overview of available platforms

Computer gaming industry made significant progress in the field of interactive virtual

worlds. It incorporates most recent advances of 3D graphics, 3D sound and real-time networking

to build speedy and accessible entertainment products. The most sophisticated commercial game

engines cost many thousands of dollars, but others, becoming outdated, often move to the public

domain, though with restrictions for commercial use. For example, Doom and Quake I-III

engines’ sources were published and forked into many projects developed by enthusiasts [16]).

Usually game engines incorporate many different features (for example: graphics, sound,

networking, collision detection, scripting language) and allow to develop variety of different games on the same engine using its scripting language. But at the same time, engines are usually

designed and optimized for a specific application to meet specific requirements. For example,

engine for first-person shooters should be very speedy, responsive, but might be not so scalable,

1 Photomontage based on the images downloaded on October 30, 2006.from: http://www.animationindia.biz/images/gallery/man.jpg - image of 3d man http://www.threedy.com/site/forum/attachment.php?attachmentid=85627&stc=1 – image of user hands http://www.nv2.cc.va.us/home/hzell/art279/level/level_room2.jpg - image of the room http://tcos.com/sbforum/images/downloaded/arm1.jpg - image of the hand for a wheel creature 26 as engine for massively multiplayer role play game, responsiveness of which is not so crucial as networking and scalability.

Our engine should be both responsive (though may be not as responsive as first-person shooters, where milliseconds are of great importance sometimes) and scalable. It should incorporate some primitive general-purpose physics and should be able to work with the web- cam. In order to avoid limitations of complete game engines optimized for a specific purpose and to get fully customizable and free cross-platform architecture, a decision to use several freely available components instead of a solid game engine has been made.

The desired components should take care of:

- 3D graphics

- Sound

- Collision detection and physics

- Networking

4.2.1 Graphics engines

There are plenty of free graphics engines, and many of them were considered, tested and filtered out during the research. Figure 4-3 contains my comparison of 3 engines which are most popular according to the Top 10 Open Source Engines list provided by DevMaster.Net [8]. Since

Crystal Space appeared to be a game engine (which is not what we need), the most difficult thing was to choose between Irrlicht [12] and Open Source Graphics Engine (OGRE) [19].

Both of them are free and cross-platform with impressive demos. Irrlicht seems to be very speedy and relatively simple, but it is being developed and maintained by a single person. OGRE has more advanced graphics and has much bigger community, but it is a bit more complicated and possibly slower on the older PCs.

27 Irrlicht OGRE Crystal Space

Graphics Good Very Good Good

Performance Very Good Good No info

Ease of Use Easy Average Hard?

Status Alpha Stable Stable

License Zlib/libpng LGPL LGPL

Google “3d engine” 33800 50100 29300

Google “game engine” 52600 88300 64400

Figure 4-3. Comparison of 3D engines

It was not easy to make the decision which one to choose based on the available information. OGRE seemed to have better perspectives due to a larger community and high-end graphical features; whereas Irrlicht had everything we needed at that moment and could be the easiest thing to start with.

Demo examples for both of them were downloaded, compiled and tested successfully before the decision to use Irrlicht as the simplest solution has been made.

4.2.2 Physics engines

Physics engines allow collision detection and simulation of quite realistic behavior of virtual 3D objects. On the Table Figure 4-4 there is a list of top 3 engines (according to my own online research). The main dilemma was to choose between which has a bigger community, and Physics, which has more sophisticated collision detection. Demo examples for both of them were downloaded and tested successfully, principal difference between them was not found. 28 Open Newton Game Bullet Physics

Dynamics Engine

Engine

Trimesh collision detection No Limited to static Yes

objects

Performance High Average No info

License LGPL Proprietary, but free LGPL

Google 152000 12100 27800

Figure 4-4. Comparison of physics engines

ODE’s demos look a bit more impressive and were successfully compiled, though lack of triangular mesh collision detection could be a serious disadvantage. However, it appeared that those engines do not contradict, but supplement each other, as shown in the Figure 4-5.

Q: Does it inter-operate or compete with ODE?

A: It inter-operates. ODE can benefit from the collision detection features like

GJK convex primitives and persistent manifold. Bullet can benefit from ODE, it can use the lcp solver, and from the ODE user community for its feedback.

Figure 4-5. Fragment of Bullet FAQ

Further research showed that Newton Game Dynamics is the only available deterministic engine, which is a significant advantage in case if we decide to use this feature to minimize the network traffic via local simulation of the world. According to our views this advantage overweighed its major disadvantage: it is proprietary and closed source, so the decision to use

Newton Game Dynamics engine has been made.

4.2.3 Sound engines

The following software could be used for audio subsystem: 29 - FMOD (Cross-platform library for playing music of different formats, free for

noncommercial use)

- SDL (cross-platform for abstraction of sound routines, LGPL)

- OpenAL (Closs-platform API for 3D-sound, under LGPL)

OpenAL seems to be the most appropriate for our purpose. It provides illusion of 3D sound via stereo sound, sound attenuation, Doppler effects etc. It has been used in development of many games, including Unreal Tournament 2004, Doom 3 and Quake 4 [20]. However our first prototype will be without sound due to the limited timeframe and the focus of this research on the visual aspects of virtual environment.

4.2.4 Video acquisition

The only cross-platform tool which was found was Open Source Computer Vision

(OpenCV) library [21], and we were very satisfied with it since it provides not only video acquisition interface, but also variety of image recognition algorithms including face detection

which can be used in the project.

4.2.5 Network engines

Several network engines were considered to enable low-overhead and low-latency multi-

channel communication between the network nodes.

- RakNet (cross-platform, Creative Commons Attribution License or you choose a

price from 100$ for a single application, Google: 114,000) with speex can transmit

voice (~0.6-2kB/s)

- OpenTNL (GPL or $995 per developer, Google: 10,800)

- HawkNL (LGPL, Google: 18,000) with HawkVoice (~0.2-2 kB/s) 30 RakNet engine has been selected because it has clear tutorials online, it is available for free, its commercial license is cheap and it allows sound transfer via low bandwidth channels.

Moreover it takes care of network time synchronization, network ID generation and other useful functionality.

4.3 Networking

By low-bandwidth channel we mean just an ordinary internet channel, which is usually

not more than 1 Mb/s (as opposed to high-bandwidth local networks, which are now from 100

Mb/s to 1Gb/s). We do not target the narrow modem connections as they are outdated and

assume that our channel is least 64 kb/s, which theoretically allows us to send up to 8 kilobytes

of data per second. One of our goals was to enable communication via low-bandwidth channel,

which makes it possible to use the system over the internet for ordinary users.

4.3.1 Preliminary estimation of channel usage

According to the preliminary estimation 64Kb/s should be enough for basic interaction:

1. 0.6-2 kB/s – audio (depending on quality)

2. 2-4 kB/s – video, which is enough to send 5-10 uncompressed 256-color images of 20*20

pixels per second (see Figure 4-6).

3. 1-4 kB/s – world state and overhead

We can’t guarantee any minimal FPS (frames-per-second) value since it depends on

amount of users in range as well as their bandwidths, but those who have 64Kb/s should get about 10 fps communicating with a single person, or 1 fps, communicating with 10 persons.

Video compressions techniques could improve those values. 31

Figure 4-6. Low-resolution face image (8-bit 20*17 pixels, on the left) could be mapped to the 3D bodies

(photo-montage in the middle and on the right).

4.3.2 Network architecture

There are several most commonly used network architectures which could be applied for

the problem:

- Single server

- Multiserver

- Peer-to-peer

The less preferable architecture is the single server. The disadvantages of it are obvious:

- Limited computing resources which do not scale with growth of world or amount of

connected clients.

- Limited bandwidth which does not scale with number of connected clients.

- Multiplying the traffic: instead of direct data streams between the peers there will be a

downstream and upstream for every client/server connection.

- Higher latency since before event information from one client reaches another client,

it has to go to the server.

Multiserver architecture could be a better solution since it is a compromise between mentioned single server and peer-to-peer. This architecture is used, for example, in the Second 32 Life of the Linden Lab [26]. There are millions of registered users there, however when more than 30000 are online simultaneously, they still get scalability and performance problems.

Since the scalability of the system is important for our ambitious long-term goals, the

most promising for us is peer-to-peer architecture because it is supposed to be scalable and it

does not require purchasing and maintenance of thousands of expensive servers (as Linden Lab

does).

Not requiring much expenses it makes it possible to implement the virtual reality which

could be free and accessible for everyone. Ideally that would stimulate volunteers from the

OpenSource community to continue development of the project.

4.4 Prototype 1

4.4.1 Prototype 1 Goals

The purpose of this prototype was to integrate RakNet library with the Irrlicht engine and

Newton Game Dynamics (though not using physics simulation yet), to test our architecture and

to see if it works satisfactory with primitive virtual bodies equipped with a virtual hand and

controlled via keyboard and mouse.

4.4.2 Prototype 1 Design

As a basis for our first prototype we decided to use client/server architecture as it is the

simplest case to start with. Since single server-based solution may have scalability issues in

future, we implement it in such a way that it could be distributed into a multi-server or peer-to-

peer system later. 33 Initially it will be a centralized system running on Windows XP. The server will allow multiple clients to connect. Most of the code will be cross-platform, and as mentioned before, developing it we keep in mind, that this system is likely to be distributed in future.

First prototype of the system will consist of client and server with the following functionality:

1. Server

a. Accepts connections

b. Sends coordinates of objects to clients

c. Distributes changes in coordinates requested by clients

2. Client

a. Connects to server

b. Gets coordinates of objects to display

c. Renders 3D world from the virtual body’s point of view

d. Gets user input and changes the coordinates of corresponding bodies

Server Client Client Graphics just relays Graphics

Local Internet messages Internet Local Physics Physics

Peer Peer

Graphics Graphics

Local Internet Local Physics Physics

Figure 4-7. Idea of the centralized system (on the top) being converted to the decentralized system (on the

bottom) 34 As mentioned earlier, centralized architecture limits the scalability of the system and is just an intermediate step. Since the server does not simulate physics and is responsible for message distribution only, its role can be distributed between several servers or clients thus converting the architecture into the multi-server or peer-to-peer network if needed (as shown on the Figure 4-7).

4.4.3 Prototype 1 Implementation

During this stage the working name VEVI (which stands for Virtual Environment with

Video Interface) was given to the project.

The bodies were implemented as cubes, and the hand – as parallelepiped connected to the

body with a joint. The graphics and an idea of how to integrate Irrlicht engine with Newton

Game Dynamics were borrowed from the Mercior’s example

(http://www.mercior.com/files/newton-irrlicht.zip).

Figure 4-8. Prototype 1 – the green parallelepiped on the right is a user-controlled “hand”.

Having implemented the first prototype, we came to the following conclusions:

- The Irrlicht engine can be (and has been) successfully integrated with RakNet and

Newton Game Dynamics. 35 - The server application (and the RakNet library in particular) is really cross-platform,

it successfully compiles and runs under both Windows XP and Linux operational

systems.

- The proposed client/server architecture allows different users to connect and to see

each other in the virtual space. However despite the fact that we use a specialized

library for fast UDP-based networking, sometimes a significant latency happens

during the movement of the bodies.

The suggested explanation of that is that while body moves, all the changes of coordinates are being transferred too frequently. Two possible solutions of the problem were considered:

1) To limit the frequency of outgoing data (i. e. avoid sending the changed

coordinates more than 30 times a second)

2) Instead of sending the coordinates of the moving bodies, send the messages which

correspond to user’s input instead. The second way is more preferable since users

usually do not generate too much input at a time and it allows smooth motion

simulated locally based on the received messages. In case if users generate more

than 30 messages a second, we still my introduce a 0.03 second of unnoticeable

delay which should not affect the smoothness of the movements.

Since user input can be represented in a very compact form in (comparison with the array of objects’ coordinates) and it is not likely to be generated too often, we decided that the second solution would reduce the utilized bandwidth and is more preferable for the second prototype. 36 4.5 Prototype 2

4.5.1 Prototype 2 Goals

The second prototype’s goals were:

- To implement the changes on the networking thus addressing the high latency issue.

- To implement the physics simulation on the client side.

- To adjust the physical parameters and physical bodies so that they could be used for

further experiments.

- To check if Newton Game Dynamics really behaves deterministically (otherwise the

chosen networking solution will not work due to growing difference between locally

simulated worlds)

And, of course, to test if all this works in a satisfactory fashion and we can move on to

the video-related part.

4.5.2 Prototype 2 Design

Second prototype of the system has a little bit more complicated server. Instead of just dispatching all the coordinates to all the clients, it has to store the incoming messages in a historical queue and to send the missing parts of the queue to the clients who do not have it yet.

This is required to allow clients to synchronize the world state without sending the coordinates of every object. Only the user’s actions need to be sent since they are the only unpredictable factors which affect the deterministic behavior of the simulated physical world.

The client also became more complicated not only because it needs to simulate the world,

but also because it needs to synchronize it with other clients. In order to do so, it stores 2 states

of the world: the last simulated state, and the guaranteed state. The world state is considered to

be guaranteed (unchangeable) after a specific timeout (for example, 3 seconds). It is not allowed 37 to change the past which was timeout ticks ago, but it is allowed to change the newer past since messages from other clients will come with latency in comparison with locally generated and simulated ones, and in that case the client is required to make re-simulation (in general case – starting from the state which was before that message arrival, but for the second prototype – starting from the guaranteed state).

The simulation algorithm for the client could be briefly represented as the following loop:

1. Increment simulation time.

2. Receive network and local messages and place them to the queue.

3. If any of the received messages have been generated before the current simulation time

then re-simulate the world from the guaranteed state (and propagate the guaranteed state

to the simulation time - timeout) otherwise simulate the next state of the world.

4.5.3 Prototype 2 Implementation

Since second prototype involved significant changes in networking and introduced physics, it took much experimenting and required several intermediate prototypes to make it work.

The first major challenge was the physics and the physical control over the body. In order to obey physical laws we had to make it horizontal so that it looked like a bug thus preserving relatively simple body structure and avoiding falls after each step or kicking the ball. The interesting side effect of this constructive solution is that if the virtual bug falls on its back, it experiences the same difficulties as the real one.

As shown on Figure 4-9, body model was implemented via joined parallelepipeds, which were used not only for hands and corpse, but also for “wheels-legs” which produced the illusion of stepping. The second problem was to address constrain issues for the joints allowing to 38 control “hands” and “legs” without breaking the stability of the model via excessive forces and accelerations.

Figure 4-9. Virtual bodies are physically stable and simple when they have similar shape to the bugs.

Fortunately it appeared that Newton Game Dynamics has a special API for calculating required accelerations to set the joint to the desired angle. Eventually the required masses, sizes and friction coefficients were adjusted so that those robots were controllable enough to play with

a ball, and that concluded the first big change to the project.

However, the body structure looks too different from a human being and the human body is too complicated to physically model it, so the better solution for the body was discovered and used – the up-vector constrain provided by Newton Game Dynamics engine API which allowed constructing stable vertical body which is a bit more human-like as shown on the Figure 4-10.

Having implemented the client and the server according to the algorithms explained in the previous section, the tests showed that the algorithm is feasible for real time networking and client-side re-simulation: the latency is almost unnoticeable. However due to unknown bugs in code and the difference in the performance of computers used for tests, de-synchronization of the 39 world still happens time to time. If to disconnect the de-synchronized client and to connect it again, it retrieves the whole event list from the server and reaches the correct state of the world which proves that Newton Game Dynamics behaves deterministically on the tested machines and the general idea of sending only the information about user actions can work.

Figure 4-10. Two avatars play with a ball (on the left – camera is outside the body, on the right – camera is

inside one of the avatars)

That means that ignoring the minor defects due to the limited timeframe we are ready to

go to the video part of the project at last.

4.6 Prototype 3

4.6.1 Prototype 3 Goals

The general goal for this prototype is to implement the first test case (“Ball game”). The

objectives are the following:

- To integrate the project with OpenCV library and to see if it is able to get the image

from the web-cam.

- To integrate the face detection algorithm provided from OpenCV to the project, and

to map the detected face to the arbitrary 3D objects (later should be mapped to the

head of the virtual body only). 40 - To adapt and enhance the algorithm developed in the preliminary research for hand

tracking in 3 dimensions and to kick the ball with hand gesture.

4.6.2 Prototype 3 Design

The main design innovation except adding OpenCV calls was about embedding and improvement of the hand tracking algorithm.

As explained in the preliminary research, the hand angles were estimated based on the average angles of pixels on the filtered difference maps relative to the estimated (almost guessed) position of the body. That algorithm worked well only during intense and frequent motions with both hands, and it allowed estimation of only two angles in one plane.

However having integrated Haar Classifiers Cascade into the project we got an opportunity to improve the estimation of the human pose using the knowledge acquired by the face detection (as shown on the Figure 4-11). It increases the accuracy of the method and makes it more stable reducing the role of the permanent evenly distributed between hands motion

(Figure 4-12).

X,Y

Figure 4-11. Detected face coordinates and size allow to approximate position of body center and shoulders

easily. 41

Figure 4-12. Hand tracking based on information from face detection

Also that algorithm could be slightly modified to roughly estimate the angles for the new dimension based on the visual hand deformation due to the motion along the horizontal plane (as shown on the Figure 4-13 and Figure 4-14).

Figure 4-13. Visual hand deformation due to the movement in the horizontal plane

Figure 4-14. The rough guess of angle in a horizontal plane can be achieved via estimation of the visual hand deformation due to the movement in that dimension. 42 4.6.3 Prototype 3 Implementation

OpenCV library was integrated into the project successfully without any serious problems. Code from the face detection example also was integrated easily. However it took some time to figure out how to update the Irrlicht’s textures on the fly and how to set camera’s fps (OpenCV library is still in development and lacks this feature).

The first attempt to implement the prototype caused the drastic decreasing of the frame rate down to 2-4 fps, which was obviously unacceptable for the project and could result in its failure.

However moving the camera-related routines out to a separate thread solved the issue and the fps came back to normal (it is hard-coded to be about 30fps).

Later processing of hand motion was moved to an individual thread since it is important to get this information as fast as possible and hands normally change their position faster than the head while head detection requires a lot of computation.

This prototype showed that it in fact possible to integrate image recognition with world

(re-)simulation and rendering on an ordinary computer and even with existing primitive algorithm of hand tracking (which probably could be improved or replaced with other computer vision techniques) it is possible to kick the virtual ball with real hand gesture.

The secondary conclusion is that mapping the low resolution (~20x20 pixels) image of the detected face appeared to work pretty stable, i. e. not depending on the position of the face in the original image retrieved from a web-cam, it always appear on the object it is mapped to. At the same time the mapped low-resolution image of the face seems to be sufficient to recognize its owner and to guess its expression. However further research with human subjects involved could give a more accurate data on this issue. 43 4.7 Prototype 4

4.7.1 Prototype 4 Goals

The goals of the 4th prototype are the following

- To create and import graphical 3D models for hands and the pot.

- To implement the simplified second test case (“Chemistry lab”) adding the pot to the

virtual environment and attempting to grab it with hands.

4.7.2 Prototype 4 Design

Design of prototype 4 does not differ much from prototype 3 since we use exactly the same principles for hand control and for object interaction. Only 2 changes need to be done:

- To import designed graphical models (3D-meshes) from Blender

- To add the physical model of the pot from the primitives – cylinders and a box.

4.7.3 Prototype 4 Implementation

The new feature of this prototype is use of graphical models which differ from their physical representation: for simulation purposes hands are still considered to be stretched spheres whereas 3D meshes are being used for their visualization as shown on Figure 4-13.

Figure 4-15. Visual representation of the avatar (on the left) and its shape for simulation (on the right)

44 The similar approach was used for pot visualization (Figure 4-17).

Another new feature is the pot – a new object of the special shape shown on the Figure

4-16 which makes it possible to grab it with hands using the laws of physics (gravity and friction). Despite the fact that this simplified object looks more like a chalice, we intentionally call it a pot to be consistent with our “Chemical lab” example.

Figure 4-16. The physical shape of the pot

Figure 4-17. The pot being grabbed with the hand

Experiments showed that it is possible to grab the pot via one or two hands (Figure 4-17,

Figure 4-18). However it was much harder to do it before we added vertical attractor to pot to prevent it from tilting. The drawback of using the vertical attractor is that it limits the area of 45 applications of the pot: it is not possible to use it for pouring the liquid from it the way we proposed to do for the second test case.

We expect that when the accuracy of hand tracking algorithm is improved and more degrees of hands’ freedom are added, it will become easier to grab the pot and to pour the liquid from it without vertical attractor. It will also require more detailed physical model of hands, pot and liquid.

Figure 4-18. The pot being grabbed with 2 hands (first person view)

46

CHAPTER 5. RESULTS AND DISCUSSION

In this work we examined the potential and proposed solutions for distributed virtual environment with video-based interface. We proved the concept and showed that common hardware is sufficient to allow users’ interaction in virtual 3D space via their hand movements acquired from the web-cam and tracked with the help of computer vision techniques. We identified the challenges, developed hand tracking algorithm, explored the limitations of the used solutions, answered the research questions, proposed the framework and suggested the topics for the future research.

5.1 Proof of concept

This study showed that the video-based interface is doable on the common hardware with the internet connection and simple algorithms used. The proposed interaction paradigm works and the hand movements, acquired from the web-cam, mapped to the virtual body and mediated via modeled physics can be used to kick the virtual ball or grab the virtual pot, i. e. manipulate virtual objects in the same fashion as we do with real ones.

The proof of concept is significant since it demonstrates that when we filter out few issues, the hand-tracking interactions can become a very useful technique and, being coupled with simulated physics and 3D graphics, it can provide an essential and globally accessible interface for virtual environments. 47 5.2 Identified problems

5.2.1 Physical avatar

As shown on this work, we proposed to use simulated physics as a mediator for interaction in the virtual environment. This is a significant difference between this project and many others, which either do not use physics at all, or use it in a primitive fashion to enable realistic look of the virtual world, but not trying to use it for the actual interaction. In such projects virtual body looks good because it has been carefully modeled by professional artists and animated in a special 3D software, but its physical model (if used) is usually approximated to a stretched sphere which makes it impossible to use it to interact with objects as we do in real life.

Since this project did not focus on digital arts presentation, it was difficult to make the avatar look realistic. Nevertheless, we managed to demonstrate that our prototype interacts with simulated objects using ordinary web-cam. The physical model is also simplistic, but it is detailed up to the level of extremities which makes it possible to kick or grab objects with virtual hands.

Being enhanced and developed in more detail, this method of interaction should allow precise and diverse manipulations with objects and avatars.

5.2.2 Mapping movements to virtual body

It is hard to control simultaneously 3-dimensional motion of two extremities via keyboard and mouse in real time. Simulating physical behavior in the virtual environment does not make much sense if we do not map real movements onto the virtual body. However it is not a trivial task to do because it makes us use non-traditional input device to track the user’s movements. 48 Even though there are few devices on the market which allow 3D-hand tracking (such as wired gloves [22] or Wii remote [36]), they are not that popular and have limited range of applications. It does not mean that they should not be used in the virtual environment, but this topic could be a subject of other research.

The advantages of using web-camera as an input device are the following:

- it is cheap and very common now

- theoretically it allows not only hand tracking, but tracking of the whole body

- it should be used anyway to enable the full-pledged visual communication between

people

Even though the field of computer vision has a lot of various techniques for intelligent

image processing, the problem of precise real-time 3D-human pose estimation based on the

images acquired from monocular camera has not being solved yet. This work is an attempt to

develop the combination of simple algorithms and freely available techniques for this purpose,

and shows, that it is doable to some degree. The further development of physical model and

image processing algorithms should make it possible to control virtual objects from anywhere

from a laptop with embedded web-cam.

5.2.3 Object representation

Creation of even simple virtual objects (avatar, ball, chemical pot) appeared to be a

challenge. There are several issues which needed to be addressed.

First, each object has physical properties such as shape, mass and friction, which need to

be adjusted so that they behave realistically and consistently in the virtual environment.

Second, those objects have their visual representation, which is usually implemented in a

higher detail, than a physical model. This causes the problem of careful object modeling which 49 has to find the right balance between detail and simplicity, and at the same time to ensure that difference between the physical and graphical model will not affect significantly user experience.

The third issue related to object representation is that there is variety of potential objects which can not be represented adequately in with existing physical engine. For example, objects which are not rigid bodies, or objects which have a different behavior which depends not only on

Newton laws, but also on internal states or other types of interaction.

Most physics simulation engines behave realistically only within a certain range of physical parameters (for example, they do not consider relativistic or quantum effects). Starting from a certain level of detail, realistic simulation of even simple physical objects or phenomena considering all possible kinds of forces and physical effects becomes computationally expensive.

However if there is a necessity in specific kind of complex behavior, we suggest embedding techniques used in other projects, such as scripted objects and qualitative physics [5], which could supplement the existing interaction paradigm and add more functionality and value to the virtual environment as tool for education and collaboration.

5.3 Proposed solutions

Important aspects of virtual environment were considered and tried, possible solutions for

the networking, simulation, reducing the bandwidth, video conferencing and interaction were

proposed.

5.3.1 Interaction paradigm

Use of physical world and physical models instead of just visual avatars (as been done in

several projects) potentially allows interaction with virtual objects as with real ones. Widely

spread web-cams are likely to be a cost efficient solution of how to make this interaction easy

and natural. 50 Our experiments showed, that the proposed methods allow users to interact with each other in the simulated virtual environment, for example via kicking the virtual ball or grabbing and throwing the virtual object.

We are not familiar with any other project which uses the detailed physical model of human body and integrates it with monocular video input to control the character, so we consider it to be one of our major contributions.

5.3.2 Human communication

Mapping the video acquired face onto the virtual head and sending only this video data over the network with unreliable UDP-packets with lower priority. This should allow solving the scalability issue for the video conferencing, enabling faces of many people to be seen in the virtual reality simultaneously. Partially this is done in other projects, such as CU-SeeMe [10].

We went further integrating it with a physical virtual body with head and hands which reproduce the gestures of the real hands thus improving non-verbal communication between parties.

5.3.3 Scalability and networking

Using the UDP allows to decrease the network latency. Only sending the user input events is required since all other things are deterministic and can be simulated (and re-simulated for the delayed events) in the real-time on the client side.

In some projects (such as Second Life) clients do extrapolation of the coordinates based on the speed of objects. This allows smooth movement even when the delay between the packets is high. But sometimes it causes unrealistic behavior such as penetrating other objects or falling under the ground.

In case of physics simulated on the client side we are able to hide the latency so that the locally estimated objects’ behavior will be smooth and realistic even if packets come with 51 significant delays. In case if nobody generated events which affect those objects, this estimation will be precise. This technique decreases the required channel requirements at the expense of increasing the requirements for client station.

Since the communication channel is the bottleneck which limits the scalability of the system, we consider this tradeoff to be reasonable. Moreover modern multi-core processors allow efficient execution of many threads simultaneously, so a separate thread for physics simulation would not slow the client software down.

We also suggest increasing the scalability via distributed architecture. The major bottlenecks of massively-multiplayer games and virtual environments are the communication channels and the server resources. We tried to address both of them. Even though we use client/server architecture, the only responsibility of the server is queuing and relaying the messages. In fact, this is computationally cheap and could be done by clients themselves, preferably utilizing the multicasting functionality if available.

5.4 Developed algorithm

During this project several variations of simple image recognition algorithm were

developed and improved (however it is not possible to guarantee that nobody used the same approach before). The idea behind the algorithm is to use some a-priory knowledge about the

image and to use statistical methods to determine the angles of the hands. An advanced version

of the algorithm uses more reliable knowledge resulted from the face detection functionality

provided by the OpenCV library, and tries to estimate angles of hands in the (horizontal) plane.

5.5 Estimated feasibility of created interface

This algorithm allows rough estimation of hands’ position which is somewhat sufficient

for the primitive cases (for example for hitting the ball as in test case 1, “Ball game”), but the 52 accuracy and stability of estimation needs to be improved in order to make it useful for more complex cases.

In order to implement the second test case (“Chemistry lab” with a pot) we needed to make significant simplification and constrains of the physical properties of the pot so that it could be hooked by a hand. This method allows grabbing and throwing the objects, but is not good enough for real applications yet. However it is good enough to show that virtual hand can be controlled in 3 dimensions via real-time processing of flat images from cheap web-camera and can be used to interact with virtual objects.

We believe that constrains and simplifications of this project are connected with the limited timeframe and can be addressed in future research. For example, in order to implement pouring the chemicals from one pot to another the pot should be designed in more detail and the liquid should be represented as a set of drops which obey Newton (and possibly some other) laws. As an alternative approach, qualitative physics with particle effects could be used for simulation of the liquid flow as it was done by Cavazza, Hartley, Lugrin and Le Bras in their work [5].

We also believe that future research will improve video-based interface and reduce the role of keyboard and mouse in the virtual environment making the interaction more essential.

This requires improving the accuracy and stability of hand tracking algorithms, as well as adding more degrees of freedom and detail up to the level of the fingers.

5.6 Integration with web services

It could be beneficial to integrate this work with the web interface and embed it into

variety of online projects, like the mentioned Web Based Simulation project. Even though it was

not possible to do it during this work, it can be done when efficient java-based modules for the 53 required components of the system (such as fast deterministic physics, image recognition library, etc) will be available. Some of those components are already available such as Java version of

Irrlicht engine [28] making an interesting subject for future research.

5.7 Future research

Several issues should be addressed to fully complete the scalable virtual environment

with video interface:

• Integrate audio (VoIP) and implement sound distribution and attenuation to allow

users to communicate with each other without touching the keyboard.

• Switch to peer to peer or other distributed network architecture to increase the

scalability of the system.

• Improve video recognition techniques and user interface to allow really

convenient and powerful way of interaction.

• Examine the scalability, usability and performance issues involving human

subjects for the research.

• Integrate with web services such as Web-Based Simulation project.

• Embed the scripting language to enable interactive objects.

• Implement qualitative physics to enable other ways of physical interaction and

effects (such as heat exchange, pouring of water, etc).

• Embed interfaces for external applications and devices (such as electronic

microscopes, robots, satellites, physical devices, etc) which could be used for

education and research purposes.

Regarding the 3D human pose estimation, much work is being conducted by the leading

research institutions. When better techniques for real-time tracking of body parts become 54 available, it will open the potential for development of commercial and educational projects which could enable the 3D interface similar to the real physical world (instead of using menus, clicks and icons as is done nowadays).

Having addressed all the mentioned issues, it would be possible to start an OpenSource project which could make virtual reality available for everybody. Enthusiasts can significantly facilitate creation of worlds they live in (as they do in Second Life [26] and other systems). 55

REFERENCES

1. AccessGrid.Org (n. d.). Retrieved October 2, 2006 from http://www.accessgrid.org/

2. Auyang S. Y. (1999, May) Cognitive and neural processes that make possible vision.

Retrieved June 4, 2007 from http://www.creatingtechnology.org/papers/vision.htm

3. Blanz, V. (2004). Learning-Based Approaches. In Facial modeling and animation. ACM

SIGGRAPH 2004 Course Notes (Los Angeles, CA, August 08 - 12, 2004). SIGGRAPH '04.

ACM Press, New York, NY, 6.

4. Burdea, G. & Coiffet, P. (1994). Virtual reality technology. New York: Wiley.

5. Cavazza, M., Hartley, S., Lugrin, J., and Le Bras, M. (2004). Qualitative physics in virtual

environments. In Proceedings of the 9th international Conference on intelligent User

interfaces (Funchal, Madeira, Portugal, January 13 - 16, 2004). IUI '04. ACM Press, New

York, NY, 54-61

6. CAVE User’s guide (n. d.). Retrieved June 08, 2007 from

http://www.evl.uic.edu/pape/CAVE/prog/CAVEGuide.html

7. Cheng K. & Takatsuka M. (2005). Real-time Monocular Tracking of View Frustrum for

Large Screen Human-Computer Interaction. ACM International Conference Proceeding

Series; Vol. 102

8. DevMaster.Net – Your source for game development (n. d.) Retrieved October 30, 2006 from

http://www.devmaster.net/engines/

9. Face Detection using OpenCV (2006, August 27) Retrieved June 9, 2007 from

http://opencvlibrary.sourceforge.net/FaceDetection

10. Han, J. & Smith, B. (1996). CU-SeeMe VR immersive desktop teleconferencing. In

Proceedings of the Fourth ACM international Conference on Multimedia (Boston, 56 Massachusetts, United States, November 18 - 22, 1996). MULTIMEDIA '96. ACM Press,

New York, NY, 199-207.

11. Hiwada, K., Maki, A., and Nakashima, A. (2003). Mimicking video: real-time morphable 3D

model fitting. In Proceedings of the ACM Symposium on Virtual Reality Software and

Technology (Osaka, Japan, October 01 - 03, 2003). VRST '03. ACM Press, New York, NY,

132-139.

12. Irrlicht Engine – A free open source 3D engine (n. d.) Retrieved October 30, 2006 from

http://irrlicht.sourceforge.net

13. Kauff, P. & Schreer, O. (2002). An immersive 3D video-conferencing system using shared

virtual team user environments. In Proceedings of the 4th international Conference on

Collaborative Virtual Environments (Bonn, Germany, September 30 - October 02, 2002).

CVE '02. ACM Press, New York, NY, 105-112.

14. Kjeldsen R. (2005, August). IBM Head Tracking Pointer User’s Manual, Retrieved June 8,

2007 from

http://dl.alphaworks.ibm.com/technologies/headpointer/Head_Tracking_Pointer_Users_Man

ual.pdf

15. Lee, S., Kim, I., Ahn, S. C., Lim, M., and Kim, H. (2005). Toward immersive

telecommunication: 3D video avatar with physical interaction. In Proceedings of the 2005

international Conference on Augmented Tele-Existence (Christchurch, New Zealand,

December 05 - 08, 2005). ICAT '05, vol. 157. ACM Press, New York, NY, 56-61.

16. List of game engines. (2007, July 5). In Wikipedia, The Free Encyclopedia. Retrieved July 8,

2006, from http://en.wikipedia.org/wiki/List_of_game_engines 57 17. Lok, B. (2001). Online model reconstruction for interactive virtual environments. In

Proceedings of the 2001 Symposium on interactive 3D Graphics I3D '01. ACM Press, New

York, NY, 69-72.

18. Mather, G. (2006). Foundations of Perception. Psychology Press, 2006

Retrieved June 4, 2007 from

http://www.psypress.com/mather/resources/topic.asp?topic=ch01-tp-01

19. OGRE 3D: Open source graphics engine (n. d.) Retrieved October 30, 2006 from

http://www.ogre3d.org

20. OpenAL. (2007, June 11). In Wikipedia, The Free Encyclopedia. Retrieved July 17, 2006,

from http://en.wikipedia.org/wiki/OpenAL

21. OpenCV Library Wiki (n. d.) Retrieved February16, 2007 from

http://opencvlibrary.sourceforge.net/

22. P5 Glove – Stratchpad Wiki Labs (n. d.) Retrieved June 08, 2007 from

http://scratchpad.wikia.com/wiki/P5_Glove

23. Rajaei, H. & Barnes, A. (2006). A Real-Time Interactive Web-Based Environment for

Training, In Proceedings of the International Conference on Modeling and Simulation -

Methodology, Tools, Software Application (M&S-MTSA'06) July 31-August 2, 2006,

Calgary Canada

24. Rajaei, H. & Dieball, A. (2007) A Shared-View Web-Based Environment for Training, In

Proceedings of the 10th Communications and Networking Simulation Symposium, CNS'07,

Sponsored by ACM/SCS, March 26-28, Norfolk, Virginia 6.

25. Rajaei, H., 2004, "Distributed Virtual Training Environment", in Proceedings of Swedish-

American Workshop on Modeling and Simulation, 195-200, February 2004, Florida, pp. 195-

200 58 26. Second Life (n. d.). Retrieved October 2, 2006 from http://secondlife.com

27. Skype (n. d.). Retrieved October 2, 2006 from http://skype.com

28. SourceForge.net: Jirr (n. d.). Retrieved July 8, 2006, from

https://sourceforge.net/projects/jirr/

29. Steinicke, F., Ropinski, T. & Hinrichs, K. (2005, December). A Generic Virtual Reality

Software System’s Architecture and Application. In ACM International Conference

Proceeding Series, ICAT’05, Vol. 157, 220-227

30. Tarini, M., Yamauchi, H., Haber, J. & Seidel, H. P. (2004) Texturing Faces. In Facial

modeling and animation. ACM SIGGRAPH 2004 Course Notes (Los Angeles, CA, August

08 - 12, 2004). SIGGRAPH '04. ACM Press, New York, NY, 6.

31. There – the online virtual world that is your everyday hangout (n. d.). Retrieved June 09,

2007 from http://www.there.com/

32. Viola, P. & Jones, M. (2001). Rapid object detection using boosted cascade of simple

features. IEEE Conference on Computer Vision and Pattern Recognition.

33. Virtual Environment. (2006, September 26). In Wikipedia, The Free Encyclopedia. Retrieved

October 02, 2006, from http://en.wikipedia.org/wiki/Virtual_environment

34. Virtual reality. (n.d.). In The Columbia Electronic Encyclopedia, Sixth Edition. Retrieved

October 02, 2006, from http://www.answers.com/topic/virtual-reality

35. VoipBuster (n. d.). Retrieved October 2, 2006 from http://voipbuster.com

36. Wii.Nintendo.com – In-Depth Regional Wii Coverage (n. d.) Retrieved June 08, 2007 from

http://wii.nintendo.com/controller.jsp

37. Wilson P. I. & Fernandez J. (2006, April). Facial feature detection using Haar classifiers.

Journal of Computing Sciences in Colleges, Volume 21 Issue 4, 127-133 59 38. Zhu, Z., Ji, Q. (2004). Real Time 3D Face Pose Tracking From an Uncalibrated Camera,

Retrieved June 08, 2007 from http://www.cv.iit.nrc.ca/VI/fpiv04/pdf/12ft.pdf