C531etukansi.kesken.fm Page 1 Wednesday, June 3, 2015 11:13 AM

C 531 OULU 2015 C 531

UNIVERSITY OF OULU P.O. Box 8000 FI-90014 UNIVERSITY OF OULU FINLAND ACTA UNIVERSITATISUNIVERSITATIS OULUENSISOULUENSIS ACTA UNIVERSITATIS OULUENSIS ACTAACTA

TECHNICATECHNICACC Matti Pouke Matti Pouke Professor Esa Hohtola AUGMENTED VIRTUALITY University Lecturer Santeri Palviainen TRANSFORMING REAL HUMAN ACTIVITY INTO VIRTUAL ENVIRONMENTS Postdoctoral research fellow Sanna Taskila

Professor Olli Vuolteenaho

University Lecturer Veli-Matti Ulvinen

Director Sinikka Eskelinen

Professor Jari Juga

University Lecturer Anu Soikkeli

Professor Olli Vuolteenaho UNIVERSITY OF OULU GRADUATE SCHOOL; UNIVERSITY OF OULU, FACULTY OF INFORMATION TECHNOLOGY AND ELECTRICAL ENGINEERING, Publications Editor Kirsti Nurkkala DEPARTMENT OF INFORMATION PROCESSING SCIENCE; CENTER FOR INTERNET EXCELLENCE ISBN 978-952-62-0833-6 (Paperback) ISBN 978-952-62-0834-3 (PDF) ISSN 0355-3213 (Print) ISSN 1796-2226 (Online)

ACTA UNIVERSITATIS OULUENSIS C Technica 531

MATTI POUKE

AUGMENTED VIRTUALITY Transforming real human activity into virtual environments

Academic dissertation to be presented with the assent of the Doctoral Training Committee of Technology and Natural Sciences of the University of Oulu for public defence in the OP auditorium (L10), Linnanmaa, on 21 August 2015, at 12 noon

UNIVERSITY OF OULU, OULU 2015 Copyright © 2015 Acta Univ. Oul. C 531, 2015

Supervised by Professor Petri Pulli Doctor Seamus Hickey

Reviewed by Professor Tapio Takala Professor Yoshitsugu Manabe

Opponent Professor Kunihiro Chihara

ISBN 978-952-62-0833-6 (Paperback) ISBN 978-952-62-0834-3 (PDF)

ISSN 0355-3213 (Printed) ISSN 1796-2226 (Online)

Cover Design Raimo Ahonen

JUVENES PRINT TAMPERE 2015 Pouke, Matti, Augmented virtuality. Transforming real human activity into virtual environments University of Oulu Graduate School; University of Oulu, Faculty of Information Technology and Electrical Engineering, Department of Information Processing Science; Center for Internet Excellence Acta Univ. Oul. C 531, 2015 University of Oulu, P.O. Box 8000, FI-90014 University of Oulu, Finland

Abstract The topic of this work is the transformation of real-world human activity into virtual environments. More specifically, the topic is the process of identifying various aspects of visible human activity with sensor networks and studying the different ways how the identified activity can be visualized in a virtual environment. The transformation of human activities into virtual environments is a rather new research area. While there is existing research on sensing and visualizing human activity in virtual environments, the focus of the research is carried out usually within a specific type of human activity, such as basic actions and locomotion. However, different types of sensors can provide very different human activity data, as well as lend itself to very different use-cases. This work is among the first to study the transformation of human activities on a larger scale, comparing various types of transformations from multiple theoretical viewpoints. This work utilizes constructs built for use-cases that require the transformation of human activity for various purposes. Each construct is a mixed reality application that utilizes a different type of source data and visualizes human activity in a different way. The constructs are evaluated from practical as well as theoretical viewpoints. The results imply that different types of activity transformations have significantly different characteristics. The most distinct theoretical finding is that there is a relationship between the level of detail of the transformed activity, specificity of the sensors involved and the extent of world knowledge required to transform the activity. The results also provide novel insights into using human activity transformations for various practical purposes. Transformations are evaluated as control devices for virtual environments, as well as in the context of visualization and simulation tools in elderly home care and urban studies.

Keywords: 3D user interfaces, healthcare information technology, information vi- sualization, mixed reality, urban planning, virtual reality

Pouke, Matti, Lisätty virtuaalisuus. Tosimaailman ihmistoiminnan heijastaminen virtuaalitodellisuuteen Oulun yliopiston tutkijakoulu; Oulun yliopisto, Tieto- ja sähkötekniikan tiedekunta, Tietojenkäsittelytieteiden laitos; Center for Internet Excellence Acta Univ. Oul. C 531, 2015 Oulun yliopisto, PL 8000, 90014 Oulun yliopisto

Tiivistelmä Tämän väitöskirjatyön aiheena on ihmistoiminnan muuntaminen todellisesta maailmasta virtuaa- litodellisuuteen. Työssä käsitellään kuinka näkyvästä ihmistoiminnasta tunnistetaan sensoriverk- kojen avulla erilaisia ominaisuuksia ja kuinka nämä ominaisuudet voidaan esittää eri tavoin vir- tuaaliympäristöissä. Ihmistoiminnan muuntaminen virtuaaliympäristöihin on kohtalaisen uusi tutkimusalue. Ole- massa oleva tutkimus keskittyy yleensä kerrallaan vain tietyntyyppisen ihmistoiminnan, kuten perustoimintojen tai liikkumisen, tunnistamiseen ja visualisointiin. Erilaiset anturit ja muut data- lähteet pystyvät kuitenkin tuottamaan hyvin erityyppistä dataa ja siten soveltuvat hyvin erilai- siin käyttötapauksiin. Tämä työ tutkii ensimmäisten joukossa ihmistoiminnan tunnistamista ja visualisointia virtuaaliympäristössä laajemmassa mittakaavassa ja useista teoreettisista näkökul- mista tarkasteltuna. Työssä hyödynnetään konstrukteja jotka on kehitetty eri käyttötapauksia varten. Konstruktit ovat sekoitetun todellisuuden sovelluksia joissa hyödynnetään erityyppistä lähdedataa ja visuali- soidaan ihmistoimintaa eri tavoin. Konstrukteja arvioidaan sekä niiden käytännön sovellusalu- een, että erilaisten teoreettisten viitekehysten kannalta. Tulokset viittaavat siihen, että erilaisilla muunnoksilla on selkeästi erityyppiset ominaisuu- det. Selkein teoreettinen löydös on, että mitä yksityiskohtaisemmasta toiminnasta on kyse, sitä vähemmän tunnistuksessa voidaan hyödyntää kontekstuaalista tietoa tai tavanomaisia datalähtei- tä. Tuloksissa tuodaan myös uusia näkökulmia ihmistoiminnan visualisoinnin hyödyntämisestä erilaisissa käytännön sovelluskohteissa. Sovelluskohteina toimivat ihmiskehon käyttäminen ohjauslaitteena sekä ihmistoiminnan visualisointi ja simulointi kotihoidon ja kaupunkisuunnitte- lun sovellusalueilla.

Asiasanat: 3D käyttöliittymät, informaatioteknologia terveydenhuollossa, kaupunkisuunnittelu, sekoitettu todellisuus, tiedon visualisointi, virtuaalitodellisuus

Acknowledgements

First, I would like to express my gratitude towards my two supervisors. My official supervisor, Prof. Petri Pulli, not only aided me in the writing of this dissertation, but also enticed me into the career of a researcher in the first place. Prof. Pulli encouraged me to carry out a great deal of my studies, both Master’s and Doctoral, in Japan. These research visits to Japanese universities were unforgettable experiences that also had a huge impact on my career. While Dr. Seamus Hickey could act as my official thesis supervisor only for a few months, he also had a great influence on this work. I was a co-worker of Dr. Hickey for years, and without his help in both scientific and practical matters, I would have had great difficulties finishing this work. I would also like to thank the external reviewers, Prof. Tapio Takala and Prof. Yoshitsugu Manabe, for providing their expertise to help in the completion of this work. I would also like to thank several other people who have significantly helped me during my doctoral studies. Prof. Tomohiro Kuroda, Dr. Haruo Noma and Dr. Masahiro Tada have all made significant effort to help me at multiple phases of my studies, and beyond. I would also like to thank Dr. Risto Honkanen, Mr. Antti Karhu, Dr. Leena Arhippainen, Prof. Jonna Häkkilä, Prof. Vassilis Kostakos, Mr. Jorge Goncalves and Dr. Denzil Ferreira for co-authoring the articles that constitute this dissertation. I also want to express my gratitude towards all my co-workers at the Department of Information Processing Science, as well as at the Center for Internet Excellence. I would also like to thank Nokia Foundation, as well as the Academy of Finland for providing me funding for my research visits. I would like to thank my parents Airi and Reijo for providing me support, financial and otherwise, throughout various phases of my studies. Lastly, I would like to thank my wife Anne for her endless support.

Oulu, April 2015 Matti Pouke

7 8 Abbreviations

2D Two-dimensional 3D Three-dimensional 6DOF Six Degrees of Freedom EPM Extent of Presence Metaphor EWK Extent of World Knowledge FG Focus Group HMD Head Mounted Display InfoVis Information Visualization IoT Internet of Things MUD Multi-user dungeon/dimension NetVE Networked Virtual Environment RF Reproduction Fidelity VE Virtual Environment VW Virtual World

9 10 List of original publications

This dissertation is based on the following articles, which are referred to in the text by their Roman numerals (I–VII):

I Pouke M, Hickey S, Kuroda T & Noma H (2010) Activity recognition of the elderly. In: Proceedings of the 4th ACM International Workshop on Context-Awareness for Self- Managing Systems (CASEMANS ’10). Article No. 7. DOI=10.1145/1858367.1858374 II Pouke M & Honkanen RT (2011) Comparison of nearest neighbour and neural network based classifications of patient’s activity. In: Pervasive Computing Technologies for Healthcare (PervasiveHealth 2011) 5th International Conference on pp. 331-335. c 2011 IEEE III Pouke M, Karhu A, Hickey S & Arhippainen L (2012) Gaze Tracking and Non-Touch Gesture Based Interaction Method for Mobile 3D Virtual Spaces. In: Proceedings of the 25th Annual Conference of the Australian Computer-Human Interaction Special Group (OZCHI 2012) pp. 505-512. DOI=10.1145/2414536.2414614 IV Pouke M (2013) Using GPS Data to Control an Agent in a Realistic 3D Environment. In: Proceedings of the 7th International Conference on Next Generation Mobile Apps, Services and Technologies (NGMAST 2013). pp. 87-92. c 2013 IEEE V Pouke M (2013) Using 3D Virtual Environments to Monitor Elderly Patient Activity With Low Cost Sensors. In: Proceedings of the 7th International Conference on Next Generation Mobile Apps, Services and Technologies (NGMAST 2013). pp. 81-86. c 2013 IEEE VI Pouke M & Häkkilä J (2013) Elderly Healthcare Monitoring Using an -Based 3D Virtual Environment. International Journal of Environmental Research and Public Health, vol. 10, no. 12, pp. 7283-7298. VII Pouke M, Goncalves J, Ferreira D & Kostakos V (2014) Practical simulation of virtual crowds using points of interest. Manuscript.

11 12 Contents

Abstract Tiivistelmä Acknowledgements 7 Abbreviations 9 List of original publications 11 Contents 13 1 Introduction 15 1.1 Background ...... 17 1.2 Objectives and scope ...... 19 1.3 Delimitations ...... 19 1.4 Research process and dissertation structure ...... 20 1.4.1 Problem identification and motivation ...... 21 1.4.2 Objectives of a solution ...... 22 1.4.3 Design and development ...... 23 1.4.4 Demonstration ...... 23 1.4.5 Evaluation ...... 24 1.4.6 Communication ...... 25 1.4.7 Dissertation structure ...... 25 2 Theoretical foundation 27 2.1 Virtual environments ...... 27 2.1.1 Basic components ...... 27 2.1.2 Immersion ...... 31 2.1.3 Mixed Reality ...... 34 2.2 Activity recognition ...... 36 2.3 Information Visualization ...... 38 3 Results 41 3.1 Construct 1: Using Gaze Tracking and non-touch gestures to interact with a mobile 3D VE ...... 42 3.1.1 System Development ...... 42 3.1.2 Experimental setup ...... 44 3.1.3 Experiment results and validation ...... 46

13 3.1.4 Discussion on the human action transformation ...... 49 3.2 Construct 2: Avatar and 3D VE based visualization of elderly patient activities ...... 50 3.2.1 System overview ...... 51 3.2.2 Evaluation and specification gathering ...... 54 3.2.3 Results ...... 56 3.2.4 Discussion on the human action transformation ...... 62 3.3 Construct 3: Using GPS to Control an Avatar in a 3D VE ...... 63 3.3.1 System overview ...... 63 3.3.2 Experimental setup ...... 66 3.3.3 Results and validation...... 66 3.3.4 Discussion on the human action transformation ...... 69 3.4 Construct 4: Pedestrian Simulation and Visualization ...... 71 3.4.1 System overview ...... 71 3.4.2 Results and validation...... 75 3.4.3 Discussion on the human action transformation ...... 78 4 Discussion 81 4.1 Theoretical implications ...... 81 4.1.1 Immersion ...... 82 4.1.2 Mixed Reality taxonomy ...... 83 4.1.3 Activity recognition ...... 84 4.1.4 Extent of activity passing the transformation ...... 85 4.1.5 Theoretical implications summarized ...... 85 4.2 Practical implications ...... 86 4.2.1 Utilization potential...... 87 4.3 Reliability and validity ...... 87 4.3.1 Technical limitations...... 88 4.4 Recommendations for future research ...... 89 References 91 Original publications 97

14 1 Introduction

Virtual environments, 3D and otherwise, have been utilized by the gaming industry for decades. Virtual environments are capable of depicting virtual counterparts of real-world locations or any type of imaginary location limited only by its creator’s imagination (and possibly the technical properties of the platform used). Video games offer a wide range of virtual environment applications from movie-like plot-intensive adventures to open-world role-playing-games, with tens of thousands of concurrent users capable of interacting with each other in real-time. Although video games are by far the most popular application category utilizing virtual environments, they are not the only one. Industry and academia alike are experimenting with single-user as well as multi-user virtual environments developed for purposes besides games. While there are non-game virtual online communities existing purely for social purposes and experimentation, such as Second Life and Meshmoon, virtual environments are under research or already in use for very practical purposes as well. For example, virtual environments for healthcare domains are under research and military training applications have utilized 3D virtual environments for years. Virtual environments are also making their way into tourism. While 3D graphics representations of major cities are common in contemporary map and navigation software such as Google Maps and Nokia’s HERE, actual multi-user virtual environments depicting museums or historical locations are also emerging in e-tourism applications and research. Virtual environments can also be used for controlling privacy in surveillance applications. Utilizing computer graphics to visualize surveillance footage allows the adjustment of the level of realism in the transmitted footage. This type of abstraction of reality can be useful when there is a conflict between security and intrusiveness, for example when surveillance is required for the safety of a patient. Transforming the patient’s actions into an abstract virtual avatar and utilizing a virtual environment for visualization, the intrusiveness of surveillance can be decreased significantly in comparison to regular video footage. The recent development of Internet technology affects the field of virtual environ- ments as well. The advent of web browsers capable of displaying 3D graphics opens up new opportunities for using virtual environments. Even though browsers cannot yet display virtual environments as complex as those run by specialized software, the lack of need to install anything on the user’s computer greatly reduces the threshold to

15 try various virtual environment applications. In addition to browser-based games and multi-user communities, 3D-graphics enabled browsers provide a great deal of new possibilities to merge traditional web applications, such as online stores, with 3D virtual environments. Virtual environments are also beginning to combine forces with Internet of Things (IoT). Utilizing data from ever-increasing networked sources, many types of data can be harvested and visualized through virtual environments. Measurements from sensors connected to the Internet, such as temperature sensors or traffic counters, provide interesting use-cases for virtual environment based visualizations and simulations. Even more complex phenomenon can be detected by utilizing data-mining techniques to the various data sources available in contemporary smart cities. For example, microblogging applications such as Twitter, can be combined with text mining techniques to detect events in real-time. Not only the vast and ever-increasing amount of networked data allow the real-time sensing and understanding of the events of the surrounding world, but also modern virtual environment technologies can bring these events to us as immersive, graphical representations. Inspired by the possibilities brought by increasingly networking world and modern virtual environment technologies, this dissertation examines the sensing of human activity and visualizing the sensed activity in virtual environments. While transforming human activity into virtual environments is a rather new research problem, there are existing studies investigating this area. The existing studies, however, usually focus on a specific sub-topic, such as detecting and visualizing the locomotion or daily actions of an individual. This work attempts to investigate the transformation of human activities from a wider perspective. Human activity is a complex phenomenon, and different data sources have the potential to provide different subsets of human activity. In this work, the transformation of human activity to virtual environments is examined from multiple viewpoints. This is achieved by leveraging results acquired from multiple studies, each utilizing a different type of transformation of real human activity into virtual environments. In this way, different abstractions of human activity can be systematically examined from multiple viewpoints such as activity recognition and immersion in virtual reality. This work is among the very first to examine human activity transformations in multiple levels of detail and from multiple theoretical viewpoints. The findings of this work will help future research towards a common theoretical framework in transforming real-world human activity into virtual environments.

16 1.1 Background

The potentials of interactive virtual environments were discussed in 2006 by researchers and industry experts; virtual worlds and virtual environments were seen as being on the verge of breakthrough into areas beyond gaming applications [1]. It was seen that besides technical properties, future research should consider psychological and sociological areas revolving around virtual environments as well. Five years later, the introduction to the MIS Quarterly special issue "Stepping Into the Internet: New Ventures in Virtual World" from 2011 implied that the true potentials of 3D virtual environments were now becoming visible after learning from past failures during the earlier hype; users and organizations were becoming more adept at leveraging the advantages of virtual environments in various tasks instead of simply replicating all aspects from real world environments [2]. Regardless of increasing knowledge, or interest, had by both industry and academia, it still is relatively uncommon for everyman to see virtual environments dedicated to other purposes besides games or pure experimentation. What exactly is a virtual environment then? Although virtual worlds and virtual environments lack agreed-upon definitions and the terms are often used in various ways, there have been attempts to generate clear definitions for both. An often cited definition for virtual worlds was made by Singhal and Zyda in 1999: a networked virtual world provides its users with a shared sense of space, a shared sense of presence, a shared sense of time, a way to communicate and a way to share information and manipulate objects [3]. In short, this means that users of virtual worlds are using a networked computer application in which they can interact with an artificial environment and with each other in a manner which is observable to everyone in real time. Mark W. Bell [4] used prior definitions from multiple sources to generate his own, somewhat similar, combined definition for virtual world as a synchronous, persistent network of people, represented as avatars, facilitated by networked computers. This means that a virtual world exists perpetually in networked computers, shares a common time, as well as a common geographical representation among multiple networked users that are represented as avatars that perform the actions commanded by their users. For example, both World of Warcraft, a 3D graphics based multi-user game, and BatMUD, a text-based multi-user game, match this definition of virtual worlds, whereas Facebook and Leisure Suit Larry games do not. Whenever a user logs into World of Warcraft or BatMUD, he or she becomes a participant in a world that has existed before and continues to exist independently of his/her interactions after logging out. Furthermore,

17 when the user’s avatar moves and performs actions in these worlds, other users within appropriate in-game distance can simultaneously observe these actions in real-time from their own computers. Even though a Facebook user can also perform actions that are immediately observable by other networked users, there is no concept of space and a Facebook profile is not an avatar. A BatMUD avatar is only a textual description but still counts as an avatar as the description is rich enough and a user can interact in BatMUD in a way that his avatar can be expressed to perform actions instead of him. In Leisure Suit Larry games, the user moves and interacts with an avatar, within a richly described environment, but each game instance runs only on the user’s own computer. There is no perpetual existence of the virtual world where users can observe each other’s actions and interact with each other. [4] In a similar short essay, Ralph Schroeder [5], gives a slightly stricter definition for virtual worlds and virtual environments. He emphasizes the need for a realistic sensory experience for something to be described as a virtual environment or a virtual world. As so, text-based multi-user games and environments, such as the BatMUD, are neither virtual environments nor virtual worlds because they do not utilize realistic graphics for representation of the artificial environment. Immersion, the notion of "being there" (more thoroughly defined in Chapter 2), delivered by realistic graphics is the definite requirement of a virtual world and a virtual environment. The difference between the two is that, as in Bell’s definition [4], a virtual world is a synchronized persistent space shared by multiple users, while a virtual environment is a more general definition without multi-user requirements. Therefore, a virtual world is a persistent virtual environment with multiple users. The concept of Cyberspace sometimes comes up in the context of virtual envi- ronments. This is a term coined by William Gibson and made popular by his novel Neuromancer [6] in the early eighties. Cyberspace, originating from fantasy, continues to live on as the term was soon adopted by computer professionals and is currently associated with numerous meanings from computer networks to virtual reality [7]. In short, a "virtual world" (VW from now on) refers to a multi-user virtual envi- ronment ,while a "virtual environment" is a more general definition for a computer generated virtual space with interaction capabilities. Following the categorization above, the experiments regarding this dissertation were done using virtual environments. The results can be leveraged in VWs as well, and some would even make sense in VWs only. However, the constructs described in the results section of this dissertation were developed utilizing virtual environments (VE’s from now on) without multiple

18 concurrent users.

1.2 Objectives and scope

The aim of this dissertation is to examine the transformation of human activity from the real world into VEs. Here, "human activity from the real world", refers to regular everyday human activity as opposed to activity taking place within a laboratory setting, or where the activity consists of a small set of predetermined actions. Transformation means the process of sensing human activity from a data source and visualizing the sensed results in a VE. Human activity is an extremely large topic with various and complex influences, ranging from culture to genetics, so the application of some generalizations is necessary. This leads to the first research question: "What attributes of human activity are relevant and should pass through the transformation process into virtual environments?". Exploring the generalization aspect further, the second research question is "What are the levels of abstraction that can be identified from various transformations". The third research question concerns the validation of the first two research questions and is stated as "Analysing use-cases for human activity transformations, how are different levels of abstraction mapping to the use-cases?". The research questions are examined through several constructs developed for various use-cases, each utilizing a different type of human activity transformation.

1.3 Delimitations

The studies presented focus only on transforming the visual aspects of human action. Therefore, sound, haptics (tactile feedback), or olfactics (sense of smell) resulting from human activity are not considered in the transformations. It should also be noted that human communication, as well as the display of emotion are not specifically considered in the activity transformations, being rather large categories demanding their own detailed studies. This work also does not consider human action transformations from the viewpoint of state melding in virtual worlds. State melding refers to the consistency maintenance and state dissemination in virtual worlds and forms a fundamental require- ment for a synchronized multi-user experience in online worlds [8]. The reason for this delimitation is that most of the human activity data used in this work was collected (sometimes years) before the visualization experiments took place. As such, no actual testing of any real-time network properties of the developed constructs took place.

19 1.4 Research process and dissertation structure

The research approach is the constructive research method known as the Design Science paradigm, as described by Hevner [9]. Design Science is a pragmatic research paradigm originating from information systems (IS) research. Design Science is constructive; it focuses on developing artefacts that solve organizational problems. The basis of Design Science lies in linking between business needs and researcher knowledge base. The needs and problems within an organization justify a problem that researchers attempt find a solution for. While the solution is being developed, it is iteratively evaluated and improved in collaboration between researchers and users. The usefulness of the artefact according to organizational needs determines the success of the artefact. Peffers et al. described Design Science research to be constituting of the following parts: Motivation, Objectives, Design and Development, Demonstration, Evaluation and Communication, claiming that a Design Science research should be able to fulfill each category [13]. Another way to describe the iterative research process in Design Science is with three distinct iteration cycles, Relevance cycle, Design cycle and Rigor cycle [10]. The specification process done by the user organization is referred to as the Relevance cycle while scientific knowledge is produced through the Rigor cycle. Artefact design and development is done within the Design Cycle. An example of very rigorous application of Design Science can be found in the work of Reinecke et al. [11]. In this work, the authors developed an interface that adapts to the user’s culture. While the need for the artefact did not come from a specific organization, the authors nevertheless were able to justify the benefits such an interface would bring to businesses. Multiple artefacts were then developed while evaluating them iteratively with user tests. Another example can be found in the work of Bemelmans and Voordijk where motivation was defined by five Dutch construction companies as the need to assess the purchasing maturity of a business unit [12]. The original motivation for the research presented in this dissertation arose from discussions with healthcare providers. Combining activity recognition and VE:s was seen as a potential way to develop novel solutions in support for elderly home care. Later, two cases of using body movements for controlling a VE were included and investigated in parallel with elderly activity visualization. Finally, simulating and visualizing pedestrian flows were investigated because of an identified need for visualization tools for city planners. As transformation of human activity as various types of abstraction into VEs is a novel problem, the research in this dissertation is what Hevner called a search process;

20 the research attempts to find good solutions for the research questions. Furthermore, the approach to solve each research question is to utilize constructs that attempt to solve the questions. Each of the constructs in this dissertation is an instantiation, a fully functioning application that can be used to demonstrate the solution it proposes for the research question. Some of these instantiations might consist of multiple sub-constructs, such as alternative models. The artifacts leverage known methods from other fields to overcome the novel problems they attempt to solve (borrowing a term from evolution biology, Hevner called this kind of approach exaptation). For example, many artifacts in this dissertation utilize the pattern recognition theory to identify human activity from the real world in order for the activity to be presented in the VE. The results achieved from experiments performed with the instantiations are used to discuss the research questions. The following subsections describe the Design Science process of this dissertation in detail, following the categorization introduced in Peffers et al. [13] while an overview can be seen in Fig. 1.

Motivation: Design and Development: - Augmenting real human - Develop constructs utiliz- activity into virtual envi- Objectives: ing human activity transfor- ronments is a novel re- - Investigate human activity mations. Transformations search problem. transformations from var- depend on purpose and ac- - Novel control methods en- ious data sources and ex- tivity data source. rich VE interaction. amine their differences and - Two control method con- - VE visualization can help utilizations. structs. in elderly home care and - Two visualization con- city planning. structs.

Demonstration: Evaluation: Communication: - Constructs were demon- - Constructs were evaluated - Results are communicated strated to users when appli- with user tests and experi- through scientific publica- cable. ments. tions and this work. Fig 1. Research process according to Peffers et al. [13].

1.4.1 Problem identification and motivation

Modern VEs can display very realistic human activity, whereas various technologies exist for capturing human activity data. While augmented reality systems are becoming increasingly popular, there is little research concerning augmenting virtual environments with real world human activity which provides a novel research problem. Investigating

21 the transformation of various human activity properties can be carried out utilizing multiple constructs that focus on different activity properties. The constructs were provided by use-cases originating from the local research infrastructure and research projects that the original publications of this work contributed to. As is common in the Design Science paradigm, this research addresses issues originating also from outside of the research community. While two constructs were developed to serve as interfaces to VEs, another two constructs were developed for the real-world use cases of elderly home care and city planning. The applicability of using virtual environment for visualizing elderly patient activity was studied in conjunction with healthcare professionals working with elderly home and hospice care. Also, the authors’ earlier research [14] identified that municipal traffic officials could benefit greatly from traffic visualizations and simulations. This leads to the development of a pedestrian simulation tool that can utilize VEs. Each construct provides a case of transforming human activity into VEs. The research approach according to Hevner’s design science paradigm [9] can be seen in figure 2.

Fig 2. The research approach of the dissertation as described by the Design Science paradigm.

1.4.2 Objectives of a solution

The constructs can detect real-life human activity at various levels of detail utilizing various data sources. The constructs can or allow the detected activities to be visualized in 3D virtual environments. Examining the properties of the constructs transforming

22 human activity data from various sources for various purposes, new knowledge can be gained on what kind of abstractions can be made for human activity for it to be visualized in VEs.

1.4.3 Design and development

Four construct instantiations, more specifically, two interfaces (Constructs 1 and 3) and two visualization systems (Constructs 2 and 4), were developed that each transform human activity from various data sources, see Table 1. Construct 1 is a mobile prototype system detecting fine hand movements as well as estimating gaze direction. Hand and eye motions are transformed into control inputs in a mobile 3D VE. In Construct 2, elderly patient activities are recognized through a portable sensor network and the recognized activities, as well as the rough location of the patient are visualized with an avatar mirroring the patient activity in a 3D VE. Construct 3 uses GPS data to control an avatar in a VE representing a real location. In Construct 4, the spatio-temporal characteristics of pedestrian flows in Oulu city downtown are analyzed and transformed into agent-based pedestrian simulation models.

1.4.4 Demonstration

The hand-eye interface of Construct 1 was demonstrated in user studies. The elderly home care visualization system of Construct 2 was both analyzed in classification performance and demonstrated in focus groups and in an online survey. The GPS avatar control method of Construct 3 was applied in an experiment regarding the 3D VE visualization of a test user’s walking patterns. The simulation and visualization of pedestrian flows by Construct 4 were thoroughly analyzed in simulation experiments.

Table 1. Papers responding to each research question Artefact Article Construct 1: Eye-hand VE interface III Construct 2: Elderly home care visualization tool I, II, V, VI Construct 3: GPS avatar control method IV Construct 4: Pedestrian crowd simulation and visualization VII

23 1.4.5 Evaluation

Each construct was originally developed for a specific purpose that determines their primary evaluation. However, this dissertation leverages the constructs to seek answers to research questions regarding human action transformations at various levels of abstraction; this requires additional evaluation. The original evaluations were as follows. With data gathered from user studies, various properties of the hand-eye interface of Construct 1 were evaluated. The purpose of these studies was to evaluate the usability of the hand-eye interface. For Construct 2, the elderly home care visualization system, the classification accuracy of the activity transformation prototypes were evaluated in several studies. The usefulness of the VE visualization was evaluated together with home care professionals. For the GPS avatar control method of Construct 3, the performance of three alternative models was evaluated. The acquired GPS trail was transformed with the three alternative models and their performances were evaluated against ground truth. The pedestrian simulation and visualization system of Construct 4 was evaluated by comparing the performance of a generalized pedestrian simulation model against a model based on detailed mobility traces from a municipal WiFi network. For the purposes of this dissertation, the aspects of human activity transformations that each construct utilizes are evaluated against multiple theoretical viewpoints. In short, these viewpoints can be described as follows:

1. Dimension of the VE affected by the transformation 2. Immersion 3. Mixed Reality categorization 4. Activity recognition 5. Information visualization

The theoretical viewpoints allow the identification and analysis of the various differences that exist in the human activity transformations of each construct. The viewpoints are described in detail in Chapter 2.

24 1.4.6 Communication

The studies are communicated through multiple scientific publications and this disser- tation. The original publications corresponding to each artifact can be seen in table 1.

1.4.7 Dissertation structure

The remainder of the dissertation is structured as follows. Chapter 2 describes the theoretical foundations of the studies presented. Chapter 3 presents the original studies and evaluates their results from their original viewpoints as well as in the light of the research questions described above. Finally, Chapter 4 discusses the results, their theoretical and practical implications, limitations as well as introduces suggestions for future research.

25 26 2 Theoretical foundation

The research presented within this dissertation can be examined from multiple theoretical viewpoints. This chapter gives a brief introduction to the main theoretical viewpoints and their current state of research where relevant.

2.1 Virtual environments

This section gives a three-fold overview on VEs and VWs. First, an overview is given on the technical properties that are common to VEs and VWs. The second theoretical viewpoint concerns immersion and what it means in the context of virtual environments. At the end of this section, the specific types of VEs that combine reality with virtuality are introduced. These types of applications are called Mixed Reality (MR) applications. The theory of analyzing and categorizing MR applications is introduced at the end of this section.

2.1.1 Basic components

In addition to providing a definition for VWs, Singhal and Zyda [3] provide a thorough overview on many technical requirements of multi-user VWs. According to their classification, the four basic components of a networked virtual environment (net-VE) are graphics engines and displays, communication and control devices, processing systems and a data network [3]. Although virtual reality technologies have advanced greatly since 1999, these four cornerstones still stand today.

Graphics and game engines

Graphics engines and display devices provide immersive visual experiences for the users. 3D computer graphics form the basis of 3D VEs and virtual worlds. Graphics engines perform the operations required to display the complex objects and scenery that forms 3D VEs that can be viewed through the display devices. A game engine is a larger concept that, besides rendering, handles also other tasks, such as object navigation and physics, which Singhal and Zyda refer to as processing systems [3]. A good mathematical basis on 3D computer graphics can be found for example in "3D

27 Math Primer for Graphics and Game Development" by Dunn and Parberry [15]. In VEs, a visually observable phenomenon consists mostly of translation of vertices in three-dimensional space 1. The vertices interconnect and form surfaces that represent objects that form the VE such as people, trees or houses. A primitive example of such an object modeled with the Blender modeling software can be seen in Figure 3. Objects can be static or animated. The objects in virtual environments are defined in multiple nested coordinate spaces. Dunn and Parberry define these coordinate spaces as the World Space, Object Space, Camera Space and Inertial Space [15]. The World Space defines the largest area of interest, for example a city, or even an entire planet. The location and orientation of objects are defined in the World Space. The movement of objects and their paths between points are also defined in the World Space. For moving non-user-controlled objects, such as non-player-characters (or agents), pathfinding algorithms, such as the A* [16] can be used to determine paths between points. The pathfinding algorithms can leverage a navigation mesh that describes the navigable and non-navigable areas of the VEs as a connectivity graph [17].

Fig 3. A cube consisting of eight vertices.

The navigation mesh does not determine all agent movement. Besides finding the shortest path between two points, the agents often need local avoidance methods to dodge other agents. There is an abundance of research concerning different local avoidance methods, especially in conjunction with the simulation of realistic human

1There is also other visual phenomenon such as background imagery, particle effects and post processing effects, however, they are not relevant to studies presented in this work

28 crowds. An example of realistic micro-level crowd movement is the massively parallel simulation by Guy et al. [18]. Their method allows for massive amounts of pedestrian agents to successfully move past each other, even in crowded spaces. Another example of micro-level simulation is the work of Lemercier et al. [19] that allows a realistic following behavior of pedestrian agents [19]. Populating entire cities with pedestrian agents has been studied as well. DaSilveira et al. [20] researched the parametric generation of populated cities and Tecchia et al. [21] the real-time rendering of massive amounts of pedestrians. Object Space refers to a local coordinate space of an object. The vertices that form the objects in a virtual environment are typically defined in the Object Space. The location of an object in the World Space refers to the origin of the object’s Object Space. Object animations, such as arm and leg movements, are defined as the translation of vertices within the Object Space. Objects might also contain multiple nested object spaces to simplify the calculation of the animations. Micro-level human activity, such as realistic limb movements, are typically transformed into VE:s by utilizing motion capture. This usually involves markers or a specialized suit to be placed on an actor, as well as a laboratory with multiple sensors capable for accurately capturing the subject’s movements [22]. The movements of this real-world actor are then translated directly on to the vertices of the virtual humanoid model. Animations can also be produced manually, but this is typically a time-consuming task. There is, however, research concerning user-friendly manual animation modeling. For example, the MAge-AniM system allows for simple humanoid animation modeling for novice users, as well as presenting the animations with a mobile device [23]. The work of Buttussi et al. also generated a software for simple animation generation, as well for re-using animations and poses in a way that simplifies future modeling efforts [24]. An example of the simulation of micro-level human limb movements without real-life source data is the research made by Kenwright et al. [25]. Their model was able to generate physically accurate results, as well as human-like features without motion capture or extensive manual animation. The Camera Space is an Object Space which refers to the coordinate space of the point where the virtual environment is observed. Camera Space is used to calculate what is rendered on the display device. Finally, Inertial Space is used to calculate transformations between Object Space and World Space.

29 Display devices

Contemporary display devices are much the same as described by Singhal and Zyda [3], although their capabilities have increased significantly. In addition to displaying high resolution 2D projections of 3D virtual spaces, modern desktop displays are capable of producing stereoscopic 3D graphics when combined with specialized glasses [26]. Devices such as the Nintendo 3DS [27] can even produce autostereoscopic graphics, stereoscopy without the use of any kind of headgear. CAVE environments project the virtual environment to the walls surrounding the user, covering the user’s entire field of vision with the virtual environment [28, 29]. At the time of writing of this thesis, the world is following the development of the Oculus Rift [30], a full head mounted display (HMD) that enables full six degrees of freedom (6-DOF) head tracking in addition to a field-of-vision covering stereoscopic display.

Communication and control devices

Communication and control devices for the VEs of today have advanced from what was described by Singhal and Zyda in 1999 [3]. Mouse and keyboard are still very common interface devices in desktop PC games and other desktop virtual environments but game controllers for gaming consoles have advanced since. Nintendo Wii [31] and PlayStation Move [32] allow the transfer of physical arm movements into VE applications such as Tennis, Golf or even sword fighting. Microsoft Kinect [33] takes the idea even further abandoning hand controllers and utilizing machine vision to track the user’s physical location and pose. Contemporary touch-screen devices such as tablets and smartphones are also capable of running VE:s. This makes touch-screens a special category of communication and control devices. Also, another special category for control devices is simulation controllers that imitate the real-world controllers of cars, aeroplanes, helicopters and the like.

User interfaces

User interfaces for VEs, and especially the 3D aspects of VE interfaces, are of interest within the scientific community as well. Doug Bowman’s book "3D User Interfaces, Theory and Practice" gives a thorough overview of 3D interfaces for those who wish to pursue the topic further [34]. A recent study on 3D interfaces in multi-user scenarios

30 and their special requirements in tablets was given by Pakanen et al. [35]. The authors found that while personal 3D GUIs were found interesting, the designers should pay attention to visual design that creates the feeling of security (in multi-user environments, the 3D visuals should indicate distinction between public and private elements), does not occlude background, as well as is distinct enough not to be confused with the background [35]. The doctoral dissertation by Tony Manninen studied interaction of virtual characters in virtual worlds and provided heuristics for virtual character interaction [39]. These heuristics provided development suggestions, for example, for expressive non-verbal interactions and collaborative construction. In this context, the study of virtual humans should also be mentioned. Already in the nineties, Norman Badler carried out a research on how complex, responsive and task driven human activity can be performed by a virtual human [36, 37]. Although virtual humans look realistic and are capable of complex interaction tasks, they are so far controlled with complex laboratory equipment [38]. The work of Funatsu et al. provided a novel way to control a virtual character through voice commands [40].

Data network

While network speeds have increased, and the evolution of processing power has increased as expected by Moore’s Law [41], contemporary virtual environments are beginning to enjoy other benefits in terms of network performance and processing power unavailable in 1999 as well. Namely, Cloud Computing [42] provides the possibility to separate the physical location of information processing and user interface. An example of this is the NVidia SHIELD [43] that allows the user to stream game data from a home computer to a tablet device. State melding refers to the simultaneous upkeep and dissemination of the state of virtual objects and users in VWs. A state-of-the art overview as well as a taxonomy for state melding can be seen in [8].

2.1.2 Immersion

It has been stated that while early ideas of (non-military) VWs were sense-spanning intimate experiences utilizing HMD:s and other advanced equipment, VWs have since evolved into video games simply utilizing 3D graphics through desktop displays [44]. Immersion is generally thought as the notion of being present in a VW, but various authors have different interpretations of the exact definition. In "Communication in the

31 age of virtual reality", immersion is defined as follows: "Immersive is a term that refers to the degree to which a virtual environment submerges the perceptual system of the user in computer-generated stimuli". The more the system captivates the senses and blocks out stimuli from the physical world, the more the system is considered immersive [45, pp. 57]. This interpretation means that immersion is strictly linked to the capabilities of the display devices that are used to view the VW. Pausch et al. attempted to quantify the immersion of a VW by examining the effect of different VE displays to perception; this was achieved by measuring the users’ ability to find letters in a virtual room [46]. In "Narrative as Virtual Reality: Immersion and Interactivity in Literature and Electronic Media", Marie-Laure Ryan examines the link between immersion in literary theory and VWs examining the role of interaction in immersion [47]. Ryan examines various aspects of immersion, also comparing virtual reality immersion into being "caught in a book", thus not considering immersion as a pure technical capability of a virtual environment. "Suspension of Disbelief", immersion by voluntarily overlooking obvious unrealistic aspects in a work of art, a concept attributed to Samuel Taylor Coleridge’s "Biographia Literaria" [48], is also included in Ryan’s book as a theory of immersion also applicable to virtual environments [47]. The definitions presented earlier suggest that immersion can be interpreted as 1) the technical capability of a VE display device to provide the user’s senses with a thorough presentation of virtual reality while omitting the true reality surrounding the user and 2) The capability of the VE to provide a captivating experience regardless of the type of technology used. The study of Slater and Wilbur [49] treats these definitions as separate concepts, definition 1 being immersion, while definition 2 not being immersion but presence. Slater and Wilbur define immersion as an objective and quantifiable description that can be applied to any system, while presence is a state of consciousness, evaluated in a more qualitative manner. Their definition of immersion consists of three dimensions. The first dimension is the capability of a display system to provide inclusive, extensive, surrounding and vivid illusion of reality to the user. Inclusive refers to the display system’s capability of shutting off the true reality surrounding the user. Extensive refers to the range of senses that the VR system is capable of stimulating. Surrounding refers to the extent to which the VE can be observed in a panoramic manner rather than limited to a narrow view. Vivid refers to the quality of the visuals, for example, resolution, color range, texture quality and such. The two remaining dimensions are matching and plot. Matching refers on how the VE responds to bodily movements of the user and thus requires at least some type of body tracking. Plot means the capability

32 of the VE to provide an alternate reality, a self-contained world in which the user can participate. This can mean, for example, autonomous agents that interact with each other in the VE, as well as the interactions that the user can perform with the VE. [49] According to Slater and Wilbur, presence is a description of the user’s state with respect to an environment [49]. It is estimated in both subjective and objective manner. The subjective description is the user’s own evaluation of how much he/she feels like "being there" when using a VE application; it is the degree to which the user is a subject to "Suspension of Disbelief". The objective description (sometimes referred to as behavioral presence) is the extent of the user being caught up in the VE in an observable manner. It can be seen by observing how much the user is following the "rules" of the VE. If a user acts in the VE in a very similar way that he/she would act in a similar situation in the real world, it can be attributed to a high level of presence. Presence often correlates with immersion. However, Slater and Wilbur state that the correlation is not necessarily due to a causal connection. While poor graphics, bad controls or lack of sounds often have a negative effect on presence, the effect of immersion on presence seems to be context-dependent. Realism, whether it relates to graphics or other properties of the VE, does not necessarily have a positive impact on presence if the highly realistic aspect is irrelevant to the user. For example, Slater and Wilbur refer to a study by Uno and Slater where the effect of realistic physics to presence was studied in a bowling alley simulation. While users reported realistic simulation of friction having a positive influence on presence, according to the users, elasticity and collision detection did not have such an effect [50]. The authors did, however, observe behavioral presence in the form of subjects trying to actually dodge virtual objects bouncing back at them [50]. The authors also state that the effect of display aspects (Inclusive, Extensive, Surrounding and Vivid) is mediated through two filters. Firstly, certain display aspects might be more or less relevant to the application, and secondly, different users might have very different expectations to the display aspects. Slater and Wilbur give an example of a concert hall simulation where realistic audio can be seen as being more important than visual aspects. However, in this kind of simulation, realistic spatial audio might be a crucial element of presence to some users, while others might find it barely noticeable. [49] The constructs presented in this work are evaluated in terms of immersion to investigate whether activity transformations have properties that look blatantly unrealistic to observers. In this work, immersion is defined as a combination of Slater and Wilbur’s immersion and presence. The constructs are evaluated in terms of display properties (I,

33 S, E, V), matching and plot as well as hypothetical impacts on presence. The emphasis, however, is more on the latter three properties because many aspects of the display properties are also discussed according to Milgram’s Mixed Reality taxonomy [51] that is used for evaluating the constructs as well. It should also be noted that the property "Extensive" is not relevant as this work focuses only on visual aspects, (as stated in section 1.3). It should also be noted that the original studies did not include any user tests that would specifically address presence, therefore, presence in this work is estimated in a purely speculative manner. While Slater and Wilbur discussed the effect of unrealistic graphics having a negative effect on presence, we seek visually unrealistic phenomena when estimating activity transformations’ negative effect on presence.

2.1.3 Mixed Reality

Mixed Reality (MR) is a type of VE application that combines properties from true reality and virtual reality and covers a wide range of application types. Paul Milgram’s influential article from 1994 [51] not only offers a thorough explanation on the concept of MR, but also suggests a taxonomy on classifying various types of MR applications. The taxonomy has many similarities to Steiner and Wilbur’s [49] framework of immersion and presence. This taxonomy is later used to situate the constructs presented in this work. A simple explanation of MR system can be given using the Virtuality continuum that depicts the decree of true elements and virtual elements visualized in an application, see Figure 4. For example, while Augmented Reality applications augment true reality (for example, a video feed) with virtual elements, Augmented Virtuality, on the other hand, augments an application consisting mostly of virtual elements with real elements. However, because the aspects of "real" and "virtual" are rather ambiguous, Milgram’s taxonomy explains how these aspects can be evaluated in MR systems with more detail, examining "real" and "virtual" from three viewpoints. [51]

Fig 4. Virtuality continuum as a simplified representation [51].

34 In Milgram’s three-dimensional taxonomy, the first dimension is called The extent of world knowledge (EWK). It is depicted in Figure 5. The purpose of this dimension is to tell how much of reality is modeled in the virtual environment application. It should be noted, however, that modeling here does not mean merely the graphical modeling of the reality. The higher the EWK, the more the application underneath knows about the aspects of reality it displays. For example, a virtual city application with high EWK could not only display realistic virtual models of the city buildings, but can also provide their accurate geographical locations in reality and the purpose of each building. [51]

Fig 5. Extent of World Knowledge [51].

Reproduction fidelity (RF) essentially means the quality of graphics used in representing the environment within the application similarly to Slater and Wilbur’s vivid display property. It should be noted that the computer graphics scale in Figure 6 is from the original 1994 article and even then showed only a rough generalization of 3D computer graphics techniques. In any case, the purpose of this dimension is to show the level of realism for the visual aspects of the MR application. [51]

Fig 6. Reproduction Fidelity [51].

While the final dimension, Extent of Presence Metaphor (EPM), lists display devices in a similar manner to the definition of Immersion in "Communication in the age of virtual reality" [45, pp.57], it also means the manner in which the VE is constructed to allow the user to feel "present" in the VE in terms of viewpoint, movement and real-time

35 display update. As can be seen from Figure 7, the upper portion of the scale rates display devices based on how immersive sensory experiences they can provide. Here, immersive is defined similarly to Slater and Wilbur’s inclusive and surrounding display properties [49]. Regular 2D displays are on the low end of the realism scale while CAVE and HMD -devices provide the highest realism. The lower portion of the scale refers to viewpoint manipulation capabilities in location, orientation and response time. [51]

Fig 7. Extent of Presence Metaphor [51]

Milgram’s 20 year old taxonomy provides a convenient way to measure various levels of abstraction in contemporary mixed reality systems as well. While in Augmented Reality applications reality plays the dominant role and is usually given through direct observation [52], Augmented Virtuality usually yields an abstracted projection of reality on top of predominantly virtual environment. Milgram’s taxonomy can be used to categorize these abstractions. The taxonomy is directly applicable to the constructs presented in this work as well, although it can be seen later in this work that EWK is the most relevant dimension concerning human activity transformation in the constructs.

2.2 Activity recognition

Activity recognition means the computational recognition of human activities and is a subtype of pattern recognition. Pattern recognition problems have been engaged in the scientific community throughout history. Within the last century, one of the earliest examples can be found from 1936 [53]. This example introduces discriminant analysis for the purpose of assigning labels to input data. This sort of problem is usually referred to as Classification, which is a fundamental problem in pattern recognition and also often utilized in this dissertation. The book “Classification, estimation and pattern recognition” by Young et al. from 1974 [54] is one of the early examples cited as “Pattern recognition theory”, for example in [55], which as itself is one of the first articles concerning human activity recognition. However, patterns in data have been

36 identified as early as in the 16th century when Tycho Brahe’s astronomical observations were used by Johannes Kepler to discover empirical laws of planetary motions [56]. Pattern recognition is about finding regularities from data. Regularities in data are described as “opposite of chaos” by Watanabe [57]. When these regularities can be computationally identified, it allows a wide range of applications to be developed. These applications can be anything from sorting email from regular email to understanding human speech patterns from audio signal. The purpose of a pattern recognition system can be assigning each part of a dataset a label, which is referred to as classification, like earlier. A pattern recognition system can also be used to predict continuous variables, which is referred to as regression. Supervised learning problems in pattern recognition refer to using a teaching set to act as examples for correctly identifying unknown patterns. In Unsupervised learning problems, the input data has no target variables to be assigned into, but pattern recognition can be used to discover groups or other regularities of interest. [56] Current technology allows real human action to be identified and analyzed in several manners. A common way to identify human activities is using wearable sensors and pattern recognition. One of the most exhaustive early works on recognizing activities (such as walking, running and household work) in everyday context utilizing wearable accelometers was performed by Bao et al. [58]. A good recent overview on activity recognition through wearable sensors is offered by Preece et al. [59]. While wearable sensors, such as accelometers, are common in activity recognition, also sound can be harnessed for this purpose [60, 61]. Modern wearable computing also offers an interesting alternative for today’s activity recognition [62]. As for tracking a massive amount of activity, the work of Bandini et al. [63] offers a good overview of various methods when estimating real-world human flows and pedestrian density. The movement of human masses as estimated from cell phone tracking is discussed by Calabrese et al. [64, 65]. A thorough look at utilizing computer vision for crowd tracking is discussed by Junior et al. [66]. Finally, estimating crowd densities real-time utilizing probes is presented by Wirz et al. [67]. Even though sensor-based recognition systems can achieve very high recognition rates using specific sensors in limited settings, more general recognition systems are still a problem. Some scientists argue that future activity recognition should adapt to the data provided by the ever-increasing number of pervasive sensors [68] so they could offer more general, smart and context-specific activity recognition solutions [69]. Giannotti et al. even suggest the planet-spanning harvesting of human activity from big data to

37 understand global phenomena and to generate a true knowledge society [70]. Pattern recognition systems, both classification and regression, are utilized in the transformation processes described in this dissertation. However, the scope is not the development of novel pattern recognition algorithms, but the transformation methods utilized in this dissertation merely leverage pattern recognition.

2.3 Information Visualization

This dissertation often refers to the representation of human activity on screen through VE as "visualization". However, it should be noted that this dissertation is not studying the use of 3D VEs as a method in Information Visualization or Scientific Visualization. Information visualization (InfoVis) generally refers to systems leveraging visuals and interactivity to amplify cognition [71] in what is often a specific task. The purposes of Scientific Visualization are similar, whereas the focus is specifically on visualizing data from scientific sensors and instruments. This dissertation studies the transformation of human activity for 3D VEs in a more general manner than in InfoVis or scientific visualization systems. However, for the purposes of this work, it is useful to scan through some of the theoretical frameworks the InfoVis community provides. Even though InfoVis is also a relatively young research tradition often described as lacking theoretical background [72, 73], in the recent years, the InfoVis community has made suggestions for introducing a theoretical background to InfoVis. In [73], Liu and Stasko discuss applying theories in InfoVis in general, and interpret theories as laws, models, frameworks, taxonomies and interpretations. They state that whereas scientific laws, such as Newton’s three laws of motion, have predictive power when accurate and precise enough, the dynamic and context-dependent nature of InfoVis render the generation of such laws impossible. Models and frameworks are more abstract descriptions of reality, whereas laws describe strict correlations and dependence between variables. Models can describe InfoVis as mechanical procedures and frameworks as slightly more abstract worldviews based on assumptions. Taxonomies resemble models and are often used in InfoVis to categorize and classify data, representations, tasks and interactions. Interpretations refer to theories that do not attempt to make valid generalizations of a phenomenon, but merely interpret it. They can also bring a novel perspective to a previously described phenomenon. [73] In their work, Purchase et al. [72] discuss three theoretical approaches to information visualization, which are data-centric predictive theory, information theory, and scientific

38 modeling. The predictive data-centric theory focuses on finding the right tools when considering the nature of the data to be visualized. The Information theory approach, using the concept from Shannon’s information theory [74], treats visualization as a noisy channel that leads from data to human cognition. In the third approach, it is argued that InfoVis research currently relies too much on an engineering approach only. Scientific foundations from other fields should be leveraged to address specific fields of visualization, such as color perception, to generate predictive models for InfoVis systems. Two scientific model approaches are suggested: the Visualization Exploration model and the Visualization transform Design model. [72]

Location

Reality Virtual Reality Observer

Activity

Fig 8. Virtual reality as a noisy channel for transmitting true reality.

In this work, the information theory approach seems most compatible with the research questions as transformations to VE, as well as its representative quality can be treated as a channel for information transmission. An illustration can be seen in Figure 8. A data- centric approach suggested by Purchase et al. would require evaluating the compatibility of the visualization with data at hand [72]. Since this thesis specifically focuses on VEs, it would be artificial (and probably fruitless) to argue for the superiority of VE in each study in this thesis. The scientific modeling approaches suggest very thorough models encompassing all aspects of a typical InfoVis system, including different interaction aspects [72]. These approaches are too detailed as the transformation methods that are the topic of this dissertation are not complete InfoVis systems designed for specialized tasks. The Information Theory approach however seems suitable, as it suggests treating a visualization system as a noisy channel transmitting data. Human activity transformations convert rather intuitively to the concept of a noisy channel; reality is the origin and the receiver observes reality in reduced detail. There is previous research concerning the visualization of real-life data, including

39 human activity, in VE:s that is closely related to the transformations presented in this work. A large portion of that research utilizes the Second Life [75] platform for visualizing the activities. One of the early discussions about mixing real objects, such as human hands with virtual environments was related to the CAVE research in the early nineties [29]. Real world data feeds such as GeoRSS feeds were connected to Second Life to explore the linking between data connected with a geographical location and virtual worlds by Boulos and Burden [76]. Gorini et al. proposed a new e-health paradigm called "P-health" that would incorporate mobile phones, PDAs as well as bio- and activity sensors to transform real-life patient activity into a virtual world [77]. Kwan et al. utilized 3D capabilities of the GIS system for geovisualization of human activity patterns [78]. Simple real-life activities of elderly patients were visualized in VE’s in the Smart Condo research by Boers et al. [79]. The falls of an elderly patient were visualized utilizing 3D graphics, utilizing smart cameras for location and fall detection by Fleck et al. [80]. Mixed reality games such as the Human Pacman [81] and the ARQuake [82] combined real-world human location sensing and tangible interfaces for gaming. There are also working prototypes that sense daily general activities through mobile phones and visualize those activities with an avatar in Second Life. Incorporating built-in smartphone sensors, such as accelerometers, to detect general activities and visualize them in Second Life was done by Musolesi et al. [83]. Using smartphone microphone to classify activities from environmental sound cues was done by Shaikh et al. [84, 85]. Finally, the Dual Reality concept by Lifton and Paradiso attempts to turn virtual environments into a medium for art and self-expression by capturing human activities through the Plug [86] sensor network and visualizing the sensed activities in Second Life [87].

40 3 Results

This chapter gives a summarization of each study presented in the original publications this work is consisted of. Each study presents a construct that utilizes human activity transformations for a particular purpose. While the original studies have evaluation and validation procedures that were performed from the studies’ original point of view, the scope of this dissertation requires some additional evaluation. The additional evaluation takes place after each summarization of each original study and the evaluations themselves are later summarized in Chapter 4. The additional evaluation is carried out as follows. First, the abstraction level of each transformation is given a qualitative description. The presented constructs are then discussed from the viewpoints of human activity transformations reflecting on different theoretical viewpoints presented earlier. The coordinate spaces that constitute VE:s provide a rough way to categorize the scale in which events take place. Therefore, the coordinate space that is affected by the activity transformation is identified. Immersion is considered here as the VE:s capability to produce an unbroken virtual reality experience. The concepts of immersion and presence from Slater and Wilbur [49] are used as the basis of the evaluation. Since no user tests were made to evaluate presence, it is speculated as the existence of a phenomenon that could reduce subjective presence. On each construct, it is discussed if the activity transformations produce visual phenomena that are blatantly unrealistic in a presence reducing way, such as walking through walls. Milgram’s taxonomy provides the next three properties for discussion, the Extent of World Knowledge (EWK), Reproduction Fidelity (RF) and the Extent of Presence Metaphor (EPM). Each construct relies on some type of activity recognition to sense human activities. The data source, as well as the activity recognition method utilized is briefly discussed. Finally, inspired by Information Theory, it is qualitatively discussed what are the specific properties of human activity that the transformation is able to transmit.

41 3.1 Construct 1: Using Gaze Tracking and non-touch gestures to interact with a mobile 3D VE

Paper III presents a method to interact with mobile 3D virtual environments without a mouse or a keyboard. The interaction utilizes continuous gaze tracking and non-touch hand gestures for turning viewpoint and manipulating objects. To test the interaction methods, user tests with short interviews and questionnaires were conducted. In this study, the author was specifically responsible for developing the non-touch gestures as well as designing and leading the user studies. Mr. Antti Karhu was responsible for gaze tracking development and provided assistance in conducting the user studies. Dr. Seamus Hickey and Dr. Leena Arhippainen provided scientific expertise for the paper writing process and user studies, respectively.

3.1.1 System Development

The purpose of the experimental system was to emulate a generic tablet device utilizing gaze tracking and non-touch gestures. The system was built on top of a HP EliteBook 2760p hybrid tablet/laptop. The laptop was augmented with two custom modalities: the gaze tracking system and the non-touch gesture recognition system. For the gaze tracking system, four infrared (IR) led lights were installed into the face of a tablet device. A monochrome camera with an IR filter was installed underneath the device. The non-touch gesture recognition system utilizes a WAA-010 6-DOF sensor attached to the back of the dominant hand of a user. The sensor communicates with the laptop through Bluetooth. For user testing, a test scene running on realXtend Tundra virtual space platform was generated [88]. The gaze tracking system and the non-touch gesture recognition system were separate applications and communicated with realXtend through custom interfaces utilizing TCP and JSON.

Gaze Tracking Implementation

The gaze tracking system utilizes the Pupil Corneal Reflection (PCR) method [89]. The PCR method tracks corneal reflections (glints) from the eyes of the user. The corneal reflections are produced by the IR led lights and tracked by the front-facing camera. The direction of the user’s gaze is estimated by calculating feature vectors from the glints’ relative positions to the center of the user’s pupil. Calibration is needed to acquire

42 model parameters. During calibration, the user’s head is in a fixed position as he/she sequentially gazes at 16 fixed points on screen. During use, feature vectors are mapped through model parameters. The output is the coordinates of the point on the screen that is the user’s point of regard.

Non-Touch Gesture Implementation

The non-touch gestures provide virtual object manipulation capabilities for the ex- perimental system. The manipulation of objects consists of acquisition (grabbing an object), translation, rotation and releasing. The gesture recognition system developed provides the following gestures for these manipulations: grab/switch, tilt, shake and still. Grab/switch (a fast downwards jerk) marks the selection of an object or switches between two interaction modes (translation and rotation). Tilting is used for object translation and rotation. While slightly tilting his/her hand from horizontal level, the user can translate or rotate objects while the speed of translation or rotation depends on how far the user is tilting. Shaking (turning hand quickly towards left and right) marks the releasing of a held object. The "still" gesture meant simply keeping the hand relatively still and was used to inform the system that no gesture is taking place. It enabled continuous gesture recognition so that the user did not need to press a button or otherwise signal his/her intention to perform a gesture. An illustration of grab/switch, shake and tilt gestures can be seen in Figure 9. The gesture recognition system also had a throw gesture that was not used in this particular study, but its use is further described in [90].

Fig 9. Grab, shake and tilt gestures, reprinted from Paper III with permission from ACM.

The "tilt" gesture was utilized measuring direct accelerometer values for pitch and roll. Other gestures were utilized through a pattern recognition system. The pattern

43 recognition system is a k/NN classifier that classifies the remaining gestures as classes using accelerometer and gyro values as feature vectors [91]. The feature space for "still" was trained to be large enough not to trigger other gestures when the user was manipulating objects utilizing the "tilt" gesture. The overall gesture recognition accuracy was evaluated with a 10-fold cross validation which yielded approximately 98% accuracy. As with the gaze recognition technology, user tests were conducted to test the actual usability of the modalities. Detailed use of each modality is described in the "Experiment results" -section.

3.1.2 Experimental setup

Small scale user tests were conducted to evaluate the gaze recognition and non-touch gestures. Initially, a pilot study with two subjects was conducted to find and fix any potential problems with the test setup. Nine male and four female subjects participated in the actual user test. The user tests consisted of tasks, questionnaires and an interview. In its entirety, the user tests went as follows:

1. Background questionnaire 2. Introduction to task and controls 3. Gaze calibration 4. First object manipulation task (with either modality) 5. First questionnaire 6. Second object manipulation task (with either modality) 7. Second questionnaire 8. Interview.

The subjects were given a task which involved viewport turning and object manipulation in a 3D VE using a multi-modal user interface utilizing gaze tracking and non-touch gestures. To establish a baseline for comparison, we also evaluated the subjects’ perfor- mances in a similar task performed with touch-screen gestures. The test scene consisted of six dices and a goal area. The task of the subjects was to translate the dices from their initial position to the goal area and rotate the dices into a particular orientation; the subjects had to orient the numbers of the dices sequentially from 1 to 6. This required the subjects to rotate their point of view while simultaneously manipulating the location and orientation of several objects. Half of the subjects performed the touch-screen

44 version of the test first and vice versa. The task completion times for both methods were measured, although the subjects were not specifically encouraged to perform the tasks as fast as they can. Each subject had one chance to complete the task. The subjects were given a short introduction but no practice round. The control taxonomies for both methods as inspired by [92] can be seen in Figures 10 and 11.

Fig 10. Gaze and non-touch taxonomy, reprinted from Paper III with permission from ACM.

Fig 11. Touch control taxonomy, reprinted from Paper III with permission from ACM.

45 3.1.3 Experiment results and validation

With questionnaires and qualitative data from interviews, the evaluation of the following factors was sought: disorientation, comfort, ease of learning, naturalness, speed, accuracy, ease of use, sense of control and fun. Analysis of variance (ANOVA) was performed for the questionnaire data to compare the aforementioned factors between the gaze-gesture interface and the touch-screen interface. Task completion times were used as indicators of the fluency of the control methods.

Disorientation and comfort

Very little disorientation was reported by the subjects. The subjects had limited chances of controlling their orientation in the virtual space, so they seldom felt lost. No dizziness or nausea was reported by any of the subjects, including those who claimed to have a tendency for queasiness during 3D virtual space usage.

Ease of learning and naturalness

All modalities were considered easy to learn by the majority of the subjects. Also, no modality stood out significantly easier or more difficult to learn in the opinions of individual subjects. All modalities were either equally easy to learn or equally difficult to learn. According to questionnaire data, non-touch gestures were considered most easy to learn on average but the difference had no statistical significance according to the ANOVA analysis. Opinions on naturalness varied a lot between subjects as they had different ideas on the concept of naturalness. Some subjects thought naturalness equals with ease of use, while some regarded natural body movements as a natural modality regardless of their difficulty in controlling the tasks. In the questionnaire data, gaze tracking had the top score in naturalness on the average but with no statistical significance according to the ANOVA analysis.

Accuracy and speed of selection

With the gaze-gesture multimodal interface, object selection was conducted by looking at the targeted object and performing the "grab/switch" gesture. After this, the object

46 could be manipulated further. With the touch-screen modality, object selection was conducted by placing a finger on top of the desired object and holding the finger pressed for a short moment. With the questionnaires, we gathered quantitative data for accuracy and speed of selection for each individual modality. There were significant differences in the scores received by the modalities: f(2,24) = 17,45, p < 0,05 for the accuracy and f(2,24) = 7,85, p < 0,05 for the speed of selection. These differences are caused by the instabilities in the gaze accuracy. The gaze tracking accuracy varied from satisfactory to intolerable between subjects. The accuracy either remained satisfactory all the times, was intolerable from beginning to end or gradually deteriorated during the test. The accuracy was reported as a bias in gaze tracking: the rectangle indicating the point of regard was in a different position than the subjects’ actual point of regard. If the gaze tracking accuracy was inadequate from the beginning, the reason was most likely found from failed calibration. Deterioration of gaze tracking accuracy was due to user head movement during the test. One user with particularly bad gaze tracking accuracy had squint; it is possible that the squint had an effect on the calibration or tracking result. There were also two cases where the calibration failure was significant enough to prevent the completion of the task from the subject. In one of these cases, the gaze tracking system failed to detect the subject’s pupil entirely. The subjects who reported having constantly satisfactory accuracy in gaze tracking were almost as fast to finish the task with both gaze-gesture and touch-screen controls. Some users could also overcome bad gaze tracking accuracy by compensating their point of regard as needed and could finish the task easily nevertheless. The median completion time was 7:59 minutes for gaze-gesture interface and 4:54 for the touch-screen. The biggest difference a single user had in task completion times between gaze-gesture and touch-screen controls was over 15 minutes. The non-touch gestures were perceived rather fast and accurate. The sensor was fast enough to react and the recognition of the "grab/switch" gesture was rather reliable. The subjects thought that non-touch gestures and touch gestures were somewhat similar on accuracy and speed of selection, although some considered touch gestures to be slightly slower.

Ease of use and sense of control

According to questionnaire data, subjects considered object moving to be significantly easier with non-touch gestures than it was with touch gestures f(1,16)=5.24, p < 0.05.

47 Object turning was slightly easier with touch gestures on the average but the difference was not statistically significant. There was no significant difference between the sense of control either, although the gaze-gesture controls received a slightly bigger score on the average. Questionnaire mean scores for Ease of Movement, Ease of Turning and Sense of Control can be seen in Table 2.

Table 2. Ease of use and sense of control in 5-point Likert scale. Modified from Paper III Method/Factor Gaze-gesture mean Touch mean Ease of Movement 4.1 2.8 Ease of Turning 3.4 4.8 Sense of Control 3.8 3.4

Even though the non-touch gestures were seen fast and accurate according to the questionnaire data, there were plenty of qualitative observations that point out needs for improvements. The most significant flaw was the right-handed subjects’ difficulty of tilting the sensor to the far left, as well as the opposite with left-handed subjects. According to some subjects, tilting backwards was more difficult than tilting forwards. Sometimes, the "shake" gesture was triggered accidentally when a subject was tilting the sensor in a rapid motion. Although subjects thought translating the objects by tilting was easy enough, there were some who initially tried to perform the translation by moving their hand as if actually picking the object and placing it elsewhere. Some subjects also expected the orientation of the objects to be directly mapped to the pitch, roll and the yaw of the sensor for more realistic turning.

Fun

According to questionnaires, no modality was significantly more fun than the others. According to the interviews, the gaze tracking modality was clearly the most interesting one, even though occasionally the tracking problems made its use stressful.

Completion Times

We measured the task completion times for gaze-gesture control method and the touch- screen control method. The mean and median completion times were 9:47 and 7:59 minutes for the gaze-control interface. The mean and median completion times were

48 5:08 and 4:54 for the touch screen interface. The standard deviation was 4:15 for the gaze-gesture interface and 1:01 for the touch-screen interface. The fluctuating gaze tracking accuracy was the main cause for the large deviations in gaze-gesture task completion times. Also, people experienced in 3D video games seemed to generally perform the fastest.

3.1.4 Discussion on the human action transformation

The transformations performed by the prototype system were essentially micro-level human activity, not entirely unlike done with game controllers such as Microsoft Kinect or Nintendo Wii. While tilting and hand gestures were utilized to control objects in World Space, these kinds of actions would be Object Space vertex translations directly controlling object animation if transformed into human avatar movements. The VE was non-immersive, except for the "matching" property. Except for primitive objects and viewpoint manipulation, the VE does not provide realistic sensory stimulation in terms of display properties. The environment does, however, respond to the user’s bodily movements. The user can interact with objects in the VE. The VE is too abstract and unrealistic for discussing whether the activity transformations have the ability to increase or decrease presence. The EWK does not cover any properties of the physical environment except how the objects of the VE respond to interaction. The graphics used are very plain and abstract and not even attempting to represent reality in any way, therefore, RF is very low. The display device is limited to a small mobile monoscopic screen. The viewpoint of the user is fixed to the proximity of the dices and the goal area. The user can pan the viewport with eye movements in real-time resulting somewhat mid-level EPM. While the tilting of the wrist was measured by reading raw accelerometer values, the hand gestures and eye tracking utilized classification to transform these actions into commands in the VE. In the transformation utilized in this research, only a small subset of human action is passed through. The focus is strictly on small motions of one hand and gaze direction. A summary of theoretical properties of Construct 1 can be seen in Figure 3.

49 Table 3. Theoretical properties of Construct 1 Property Quality Type of translation Micro-level activity Affected coordinate system Control inputs affect World Space and Camera Space. In humanoid avatars, inputs would apply to (sub-) Object Space Immersion Environment matches to body movements. Otherwise non-immersive. Extent of World Knowledge (EWK) How Reproduction Fidelity (RF) Environment not modeled Extent of Presence Metaphor (EPM) Limited to mobile displays. The viewer is in the middle of the VE and can pan viewpoint with eye movements. Data source and activity recognition Wearable sensor and an infrared camera - Sensor direct measurement, gesture classification, PCR Extent of action passing the translation Subset of hand movements. Eye movements

3.2 Construct 2: Avatar and 3D VE based visualization of elderly patient activities

Construct 2 consists of studies that are described in papers I, II, V, and VI. Papers I and II are focused on sensing and classifying activities and logical in-house location of individuals wearing accelerometer-gyro hybrid sensors. Papers V and VI also describe visualizing the classified activities in a 3D VE. Papers V and VI also describe user studies with home care personnel evaluating human activity visualization in a 3DVE. The evaluation was to confirm the hypothesis that virtual environments could act as a non-intrusive alternative for direct video surveillance in cases where some level of surveillance is desired. In each study, the individuals are elderly patients living in a hospice. Papers I and II describe and use patient data captured from Oulun Diakonissalaitos (ODL) in Oulu, Finland. Papers V and VI are based on data capturing experiments held in Karpalokoti hospice in Pyhäjärvi, Finland. Each study utilizes a similar sensor network for data capturing. The pattern recognition system used to classify subject activities varies slightly through each study. Multiple authors have contributed for the studies concerning this research question. In Paper I, Professor Tomohiro Kuroda and Dr. Haruo Noma provided instruction on the use of the portable sensor network system used to capture subject data, as well as participated in the data capturing experiment. Dr. Seamus Hickey also participated in the data capturing experiment which he also organized and provided assistance in the

50 paper writing process. In Paper II, Dr. Risto Honkanen developed a Neural Network [93] based classifier for subject data classification, as well as an improved version of the k/NN [91] classifier initially developed by the author of this thesis. In Paper VI, Professor Jonna Häkkilä provided expertise in performing the more in-depth user studies and analysis that augment the findings of Paper VI. The author of this dissertation participated by leading the data capturing experiments providing the data for papers I and II and organized and supervised the data capturing experiments providing the data for papers V and VI. The author performed the pattern recognition analysis solely in all papers expect in paper II in which Dr. Honkanen contributed significantly in the analysis. The virtual environment based visualizations in papers V and VI were developed by the author. The user evaluation in paper V was performed by the author and in paper VI together with Prof. Jonna Häkkilä.

3.2.1 System overview

This section gives an overview of the sensor network, pattern recognition and visualiza- tion systems in all studies.

Sensor network

The activity data in each study is captured through a portable sensor network. An overview of the sensor network is shown in Fig 12. The sensor network consists of the following components:

1. Two wrist-held sensors to capture activities from hand motions 2. Proximity sensors placed in the close surroundings to capture location approximation 3. Proximity sensor master module held by the subject 4. A PDA utilizing a client software and Bluetooth to record all sensor data 5. An optional PC computer utilizing server software and WLAN to record large amounts of data transferred by the PDA.

Activity recognition systems

Each study uses a classifier for the prediction of subject activities. Essentially, this means that there is a definite set of activities from which the most probable one is chosen at a time. How this choice is made, depends on the properties of the classifier.

51 An overview of the classifiers used in each study is given in Table 4.

Fig 12. Overview of the sensor network. Modified from paper VI.

Table 4. Overview of the classifiers used in papers I, II, V and VI Properties Paper I Paper II Papers V and VI Classifier 1/NN 1/NN and Neural Network 1/NN with location constraints Features Mean, STD Raw data Mean, STD, FFT Energy, Correlations Accuracy 74% Nearly 100% and 93% 80%

Visualization system

The visualization system utilizes the realXtend Tundra platform [88] to present subject activity with an animated avatar acting in a virtual environment. For experimentation, a single dataset containing activity from a single elderly patient was chosen in an attempt to roughly replicate all activity taking place with a finite set of activity classes. This set of actions could then be transformed into pre-recorded avatar animations with a number equal or greater than the set of classes that represent the activity. Location is treated in a more general manner. Whereas multiple subjects performing the same activity can be presented using same the avatar animation, their surroundings might differ greatly which adds complexity to location presentation. With this in mind, an abstract virtual environment representation of the subject’s surroundings was created.

52 The hypothesis was that multiple different locations could be effectively visualized with a single virtual environment representation. The initial design can be seen in Figure 13. The actual location data is derived from proximity sensor readings. The avatar’s location is transformed within the virtual environment representation according to the closest proximity sensor.

Fig 13. VE representing elderly patient activity. Reprinted from Paper V c 2013 IEEE.

In the chosen dataset, the subject activity was chosen to be represented with 18 classes that are visualized with 24 avatar animations. The six additional animations result from standing and sitting animations for the "Idle" activity class, as well as extra animations for activities that can be performed either while standing up or sitting down. According to the characteristics of the activity, they were divided into three categories: "Transition", "Manual actions" and "Social/Miscellaneous". Some activities are very symbolic, such as "Use Objects" which represents the majority of different object manipulation activities from reading newspaper to wiping hands with a towel. At this point, there was no domain-specific criterion available for choosing the classes (such as relevance from home care providers’ perspective). It was merely attempted to visualize all activities with rough categorization. All activity classes can be seen in Table 5.

53 Table 5. Activity classes with their recognition accuracies Transition Manual actions Social / Miscellaneous Walk 81% Open 58% Wave hands 0% Idle (Sit/Stand) 88% Grab 50% Point 60% Sit down 27% Put down 52% Touch head 71% Get up 37% Pick up (from ground) 54% Touch abdomen 66% Pull 40% Turn on faucet 67% Turn off faucet 0% Wash hands 86% Sweeping motion 78% Use objects 75%

Skeletal animations for animating humanoid avatars can be purchased online as well as downloaded free of charge from repositories such as the CMU Graphics Lab Motion Capture Database 2. In this study, The Vicon Workstation motion capture system was used to capture the required animations. The animations were refined in Blender modeling software and transformed into Ogre [95] skeleton animations, which is the format required by the realXtend Tundra virtual environment platform.

3.2.2 Evaluation and specification gathering

To investigate if the virtual environment visualization had the potential to transmit information as well as practical usefulness, the visualization was evaluated with medical personnel working with elderly patients. The evaluation consisted of two focus group interviews as well as an online survey. In addition to evaluating the current state of the prototype, it was sought to better understand the requirements medical personnel might have for this and similar systems for future development.

The first focus group interview

The first focus group interview (FG1), was held at ODL, Oulu. The interview participants consisted of ten nurses and one medical doctor working with elderly patients in a hospice. The aim was to assess the intelligibility and potential usefulness of the virtual environment visualization. First, the prototype visualization application was shown to the participants, to enquire about their first impressions. The participants were also

2http://mocap.cs.cmu.edu/

54 asked how often they require information of a specific patient’s activities while he/she is not present. It was also asked which patient activities are relevant for their work. In the second phase of the interview, each participant was given a questionnaire with three questions for each activity. The questions were 1) the meaning of the animation, 2) the importance of the activity and 3) the intelligibility of the animation. Question 1) was open-ended, whereas questions 2) and 3) used Likert-scale in the range of 1-5. After this, each activity was presented as avatar animations from video and the participants filled the questionnaire for each activity accordingly. With the first question, it was intended to find out if the participants understood the meaning of the animation. With question 2), rough subjective quantification for the relevance of each activity from the perspective of the participant’s work was sought. The intent of Question 3) was to quantify the intelligibility of the animation after the true meaning of the animation was revealed to the participants.

The second focus group interview

The second focus group (FG2) consisted of two nurses working with elderly patient home care. While FG1 concentrated on evaluating the intelligibility and the usefulness of the actual prototype, in FG2, additional features to the system were introduced at concept level. The additional features were alternatives to the virtual environment representation (see Figure 14), an ability to view historical activity information and automatic alarms. The participants were first interviewed to find their general routines and challenges in their daily work. A concept video of the visualization system was then shown to enquire about first impressions. The Next step was to present the participants with a questionnaire such as in FG1, but this time asking only for the importance of each activity in a Likert scale of 1-5. Next, 7 different concepts were shown for visualizing the patients location and activity, as well as three different concepts for avatar representation. Each alternative was discussed and the participants explained the pros and cons they saw in each alternative. The visualization alternatives were: (a) Patient’s location and activity as plain text. (b) Patient’s location on the map and activity as a text. (c) Patient’s location in a 3D virtual environment with an imaginary layout and activity displayed with animation and text. (d) Patient’s location in a 3D virtual environment with realistic layout and activity

55 displayed with animation and text. (e) Patient’s location in a 3D virtual environment with realistic textures and layout. Activity is displayed with animation and text. (f) Patient’s location in a 3D virtual environment with an imaginary layout. Activity is displayed with animation and text. The visualization has a photograph of the patient’s house as a reminder for the nurse. (g) Patient’s location in a 3D virtual environment with realistic layout. Activity is displayed with animation and text. The visualization has a photograph of the patient’s house as a reminder for the nurse.

The visualization concepts can be seen in Figure 14.

Online survey

Seventeen participants working in private and municipal home care, aged 19-53, answered the online survey. Preferred visualization concepts for virtual environment and avatar representation were enquired. Activities that were perceived to be the most important ones were asked in a Likert scale 1-5. The survey also included open-ended questions to gather qualitative data such as general impressions. The survey also collected preferences regarding visualization of the subject’s surroundings. The importance of the following properties of the surroundings were asked using Likert scale of 1-5: Stove is turned on/off, Water faucet is turned on/off, TV is turned on/off, Front door is open, Balcony door is open. It was also asked which other properties of the surroundings were important if any.

3.2.3 Results

The results gathered from FG1, FG2 and online survey participants are summarized in this section.

User needs and general suitability

The interviews emphasized the fast-paced and intensive work of the home-care providers. The stressful aspects are further intensified by temporary changes and unusual arrange- ments such as during vacations and sick leaves. General properties of a potentially

56 helpful mobile system were: 1) The system should display location aware information of the next patient to be visited, 2) Relevant information should be acquired at a single glance 3) The system should allow for taking manual notes and displaying them for the home care provider next in turn. The first impressions ranged from positive to very negative regarding the use activity recognition and virtual environment in remote surveillance of the patient. While some participants saw potential in the visualization system, it was not uncommon that the visualization was seen as somewhat intrusive. Also, a minority of participants considered the visualization to be very unethical. Privacy and ethical concerns were usually alleviated after it became clear that the visualization can show only a limited, predetermined set of actions. The survey participants were generally more concerned in privacy and ethical issues than FG1 and FG2 participants. Some participants also thought that even if the system would be helpful and accepted by the home-care personnel and the patient, the patient’s relatives might be difficult to convince. Some participants thought the visualization system should be used only with memory-impaired patients instead of all patients in the scope of home-care. Generally, it was regarded as a positive fact that the system uses sensors instead of cameras to visualize only a predetermined set of actions of the patient and the patient only. Anything beyond that would be too intrusive. Also, the system should be used as a support system and not to alienate the nurses from their patients.

The relevance of the visualizations

All activities listed at Table 5 were not considered equally important. The average importance of each activity according to the quantitative survey data can be seen in Table 15. We also asked the participants to describe activities they considered important for their work. The most requested activities were sleeping, eating, going to the toilet, falling as well as entering and exiting the house. Other patient activity that was mentioned were inactivity, taking (or abusing) medicine, cooking, accessing refrigerator, movement within the apartment, going to the mailbox and detailed walking routes outside the house. In some activities, not only the mere ocurrence of the activity was considered relevant but how the activity was performed. Especially walking was considered an activity that required additional descriptive information (FG1, online survey). Properties, such as if the patient was limping, walking with or without a cane, or using his arms to keep balance was considered worth visualizing.

57 Fig 14. (a) Plain text. (b) 2D map. (c) 3D with standard layout. (d) 3D with realistic layout. (e) 3D with realistic layout and realistic textures. (f) 3D with standard layout and a reminder pho- tograph. (g) 3D realistic layout and a reminder photograph. (h) Three avatar representations. Reprinted from Paper VI.

58 Generally, it was not considered necessary to visualize patient activity in great detail. However, an exception was the nurses who thought the visualization system would be helpful with memory impaired patients; a detailed visualization was considered helpful in some use cases. Main use cases are visualized in Table 6.

Fig 15. Average importance of each animation as seen by participants. Reprinted from Paper VI.

Table 6. Key use cases for the 3D visualization-based activity recognition system. Modified from Paper VI. Use Case Example Comments to check how the patient "The daily rhythm of a memory-impaired patient, if something hap- fares through the day pens ... eating, taking medicine and other daily information like that..." (Online survey, "What information would ease your work?").

to help the patient to re- "This could work with memory impaired patients when there’s con- member what he or she cern if they are getting by. Quite often they cannot tell when you ask has or has not done what they’ve been doing." (Online survey).

to verify the validity of "There is, for instance, this one patient, who claims he has done the patient’s own reflec- these long walks every day, and I am sure he is not able to do that. tions or narratives With that kind of system, I could quickly check if there is any truth in there." (FG2).

In general, visualizations concerning the patient’s surroundings were considered more relevant than the patient’s own activities. Most of the suggestions we made in the online survey concerning visualizations from the patient’s surroundings were considered important as can be seen in Table 7. The participants also suggested the following visualizations: bathroom lights being turned on/off, general cleanliness of the apartment,

59 any objects on the floor that might cause the patient to fall, temperature, air quality and presence of other people.

Table 7. Importance of visualizing properties of patient’s surroundings. Modified from Paper VI. Visualization Very Important Moderately Somewhat Not Important Important Important Important Stove on/off 93.8% (15) 0.0% (0) 0.0% (0) 0.0% (0) 6.3% (1) Water on/off 81.3% (13) 12.5% (2) 0.0% (0) 0.0% (0) 6.3% (1) TV on/off 18.8% (3) 18.8% (3) 37.5% (6) 18.8% (3) 6.3% (1) Front door open 87.5% (14) 6.3% (1) 0.0% (0) 0.0% (0) 6.3% (1) Balcony open 56.3% (9) 31.3% (5) 6.3% (1) 0.0% (0) 6.3% (1)

Clarity of the activity visualizations

The clarity of the visualizations was studied mainly in FG1. The visualizations describing patient activity were not equally understandable according to participants. Activities within the transition category were easiest to understand, while manual actions were least clear. The social/miscellaneous category was understood relatively well. The least intuitive visualization was the highly symbolic "Use Objects" animation intended for depicting a wide range of manual activities. Even though the participants were able to correctly identify the actions in the social/miscellaneous category, it received low clarity scores in the questionnaire. While it was not confirmed from the participants, the author believes the low score in fact reflects the low importance of the activities in this category. The average clarity of each animation according to the quantitative survey data can be seen in Figure 16. The question of how to visualize the patient’s surroundings was brought forth by the of difficulties of understanding the manual actions category.

Representing the virtual environment

As mentioned earlier, one hypothesis was that multiple patient domiciles can be presented using a single, symbolic representation. The participants of FG1 highlighted the need for at least somewhat responsive and animated virtual environment. The following examples were given by the participants: If the virtual character picks an object, a symbolic object (not necessarily representing the actual object) should be seen in the hand of the virtual character. If a piece of furniture, such as a table or a sink, is related to an action, the animation should be carefully aligned with a corresponding virtual object

60 that is also animated as necessary. Placing the virtual character only somewhat close to the object while performing the animation is not clear enough. Also, the participants stated that the real home of the patient might have properties that are relevant for the nurses to know of, such as stairs, balconies or high doorsteps. In FG2 and online survey, it was sought to gather more specifications for the virtual environment and presented visualization alternatives for the virtual environment, as well as for the virtual character, as shown in Figure 14. In general, the plain text and the 2D map (Figure 14a,b) were seen as best options. Within the 3D alternatives, the realistic layout without textures (Figure 14d,g) was seen as the best alternative; realistic textures were seen as unnecessary cognitive load. A photograph for memory aid (Figure 14g) was seen helpful by some, but intrusive by others. The participants saw potential in using the visualizations in combination: plain text could be used for fast and clear overviews, while the 3D realistic layout could be switched on when in need for details, temporal overviews or otherwise browsing the patient activity history. Several participants also mentioned that the realistic layout visualization could be used to help finding a fallen patient or observing a patient walking in areas with a high risk of falling.

Fig 16. Average clarity of each animation as seen by participants. Reprinted from Paper VI.

Representing the avatar

The majority of the participants considered a monochrome avatar to be most suitable for the application. They were considered non-intrusive as well as suitable for easily distinguishing between different patients or signaling alerts using colors. Each patient could be assigned with a different color, and the virtual character could flash red if a

61 dangerous situation, such as a fall, was detected. One special case remained in support of a realistic looking avatar: if a memory impaired patient has wondered outside and needs to be found, the realistic appearance of the avatar could be used for identification.

3.2.4 Discussion on the human action transformation

The transformations performed by construct 2 produce actions that can be associated with a meaning. While the eye-hand interface of construct 1 produced micro-level movements that can be transformed into subtle movements, the transformations produced by construct 2 result in more general activities that can be identified with a purpose. These types of actions are animated in the Object Space of the avatar, possibly by triggering pre-captured animations. The prototype system can be used with any type of display system, therefore, the VE is inclusive and surrounding, however, not vivid due to simplified graphics. The avatar responds to multiple bodily movements, attributing to high matching. The bodily movements are not transferred explicitly, however. The VE has minimal interaction capabilities. The recognized activities trigger animations as detected, which can result in sudden changes in activities that can be harmful for presence. Also, the rough location transitions are applied by simply teleporting the avatar between locations. In the prototype system, the EWK amounts to rough knowledge of key locations and a pre-known set of human actions. By gathering interview data, it was found out that home care personnel found only a subset of the action data to be relevant while the location details should be increased. A mere abstract representation of locations was not enough but the virtual environment should match the patient’s actual environment, therefore, the evaluation participants hoped for higher EWK that was originally presented (adding Where as well as an increased amount of What). While EWK is less relevant for the activity transformations, the interview participants found EWK to be highly important. Even though meaningful objects are displayed in the VE, the realism of computer graphics is kept at rather abstract level. The objects are modeled with simplified geometry and monochrome textures. While sketches of more realistic representations was presented as an option, it was found out that the simplified and monochrome representation of graphics was actually preferred, resulting in low RF. Using RealXtend platform, immersive display devices are supported and the viewpoint can be controlled freely resulting in high EPM. Classifiers, such as used in these studies, assign an input vector into one of from discrete set of classes ([56], p 179). This means that the activity set visualized at a time is also limited. Therefore, the extent of how much

62 human action can pass through the transformation can be adjusted but is always limited. Generally, the smaller the number of actions transformed, the better the classification performance. A summary of theoretical properties can be seen in Table 8.

Table 8. Theoretical properties of Construct 2. Property Quality Type of translation Coarse level activity Affected coordinate system Mainly Object Space Immersion Body movements match a limited set of animations. Sudden animation and location changes can reduce presence. Extent of World Knowledge A general set of activities, key location properties (What without geographic Where) Reproduction Fidelity Simplified geometry and monochrome textures Extent of Presence Metaphor Any type of display Activity recognition method Classifier with constraints, proximity sensing Extent of action passing the translation Predetermined set of actions, rough (logical) location

3.3 Construct 3: Using GPS to Control an Avatar in a 3D VE

Paper IV describes a study concerning the use of a GPS device to control an avatar in a virtual 3D environment. Three models for transforming a GPS signal into an avatar movement are evaluated.

3.3.1 System overview

The 3D scene used was a representation of the Centre of City of Oulu, Finland. The scene was running on the realXtend Tundra platform [88]. The model is oriented among cardinal directions and a single coordinate unit of the scene World Space corresponds to one meter in the real world. Due to the small size of the simulated environment, the coordinate transformation from real world to the VE is rather straightforward: we transformed the geospatial locations acquired from a GPS device into the coordinate system of the Oulu virtual scene using equations 1, 2, 3 and 4. These equations were chosen because the simulated area is no larger than 300m2. For significantly larger distances, the Haversine formula might be more suitable [96]. In equations 1 and 2, the displacement from scene origin, ∆lat and ∆lon, is acquired by subtracting the geographical coordinates of the GPS device lat and lon from the geographical coordinates of scene origin, latc and lonc. As the distances calculated are short in terms

63 of global scale, we transform ∆lat and ∆lon into meters by assuming that one degree of latitude equals to approximately 111.28 kilometers (which, admittedly, is a slight underestimation, the degree of latitude in Oulu being closer to 111.4 km in the WGS84 system [97]). The equation 3 shows how the displacement from the origin is transformed from degrees into meters by multiplying ∆lat with 111.28 and multiplying the result by 1000. When converting longitude, the length of a degree depends on latitude, hence in equation 4, the conversion to meters is acquired by multiplying the length of a degree of latitude (111.28) with displacement in longitudinal degrees times the cosine of latitude at a point of reference (here, the variable lat). Contrary to latitude, the final conversion from kilometers to meters is done by multiplying by -1000 instead of 1000. The reason for this is that the coordinate system in RealXtend Tundra places the positive x-axis towards east but the negative z-axis towards north.

∆lat = latc − lat (1)

∆lon = lonc − lot (2) x = 1000(111.28∆lat) (3) z = −1000(111.28∆loncoslat) (4) After the GPS data has been transformed into the virtual model coordinate system, the virtual character can use the transformed coordinates as waypoints and thus mimic the movements of a real person acting as the source for the GPS data. However, the initial GPS coordinates are rarely accurate enough for the purposes of virtual agent movement; the virtual character might turn out elsewhere than its real-world counterpart or might try to walk into impossible locations, such as inside walls. For the purpose of increasing the accuracy and usefulness of the transformed coordinates, three models for enhancing the transformed coordinates were developed and tested. The first model, Path Averaging, attempts to increase the accuracy through aggregation of a large number of data points. The latter two models utilize a navigation mesh [17] to describe locations within the virtual environment that are allowed for virtual characters.

Path Averaging

The purpose of the Path Averaging model was to examine if a large number of inaccurate GPS points cancel each other’s inaccuracies if aggregated. K Nearest Neighbor [91]

64 algorithm with K=12 was used for aggregation. The GPS coordinates of the virtual character are compared against a previously recorded teaching set of GPS coordinates. The centroid of 12 closest points of the teaching set determines the location of the virtual character. The teaching set used in this experiment was generated manually using portable GPS transmitters and smartphones.

Map Matching

Essentially, Map Matching means using prior knowledge of a location to translate unlikely coordinates into their closest probable locations. As an example, see [98]. The Map Matching model in this study utilizes a navigation mesh to fix GPS points from unwalkable locations (such as inside walls) in to locations that are known to be walkable. As described earlier (see section 2.1.1), a navigation mesh is a polygon mesh that defines the walkable areas of a virtual environment. Moving objects within a virtual environment are limited to navigate only inside the polygons of the mesh (here, referred to as nodes). In this work, when using the Map Matching model with a navigation mesh, a test is made for each GPS coordinate of a waypoint trail, if the coordinate lies within one of the nodes of the navigation mesh. For each coordinate falling outside the navigation mesh, six closest nodes of the navigation mesh are sought and the coordinate is matched within the node that lies closest to the previous coordinate of the waypoint trail. Favoring waypoints residing in neighboring nodes, the resulting waypoint trails are more realistic than simply finding the geographically closest node. A comparison can be seen from Figure 17.

Fig 17. Left, a map matching to closest node. Right, a map matching favoring nodes closer to the previous waypoint. Red color shows the uncorrected route, green the corrected route. Areas inside triangles are permissible for walking. Reprinted from Paper IV c 2013 IEEE.

65 Path Generation

In the Path Generation model, the GPS coordinate trail is first downsampled and map matched using the model in the previous section. After this, an A* pathfinding algorithm is used to generate routes through the navigation mesh between the intermediate waypoints. Finally, the resulting path is further smoothed using a funneling algorithm [17]. The funneling algorithm fixes the waypoints of the path to use the shortest distances through adjacent nodes; this makes the path more realistic as the character does not walk through each centerpoint of the navigation mesh nodes.

3.3.2 Experimental setup

To test each of the path correction models, a validation experiment was performed. The experiment consisted of three trial walks in the actual downtown Oulu while recording the GPS trails of each walk. Afterwards, each model was tested with the GPS trails produced from the experiment. The route of the first walk was 620 meters, while the second and the third route were 560 meters in length. The second and the third route were essentially identical. A map illustrating the routes can be seen in Figure 18. Three metrics were produced to evaluate the path correction models, one quantitative and two qualitative ones. The quantitative metric is the average error in meters. The average error is the distance the virtual characters path deviates from the actual path averaged from eight locations within the trail. The first qualitative metric is the existence of sudden u-turns an illogical winding in the path. The second qualitative metric is the existence of critical errors in the path. Critical errors mean properties in the path that cause the virtual character to walk towards walls or other areas that are considered unwalkable.

3.3.3 Results and validation

A figure depicting the average error metric for each model, as well as raw GPS data for each walk can be seen in Figure 19. As can be seen, the data from walks two and three are especially inaccurate. The Path Generation model introduces the least average error, especially when used with original data that is greatly inaccurate. However, when the original data is fairly accurate, the Path Generation model can actually increase the average error slightly because the entire waypoint trail is abstracted to the navigation

66 Fig 18. Actual walking routes. Reprinted from Paper IV c 2013 IEEE. mesh.

Fig 19. Average error metrics. Reprinted from Paper IV c 2013 IEEE.

For the latter two error metrics, the original paths have to be inspected visually with their corrected counterparts. Figures 20, 21 and 22 display the results of walk three corrected with all three models. From Figure 20, it can be seen that the path averaging model transforms the waypoints slightly closer to reality, however, the improvement is very marginal. The model does not introduce any sudden u-turns or unnatural winding to the path, however,

67 Fig 20. Route 3 fixed with path averaging. Red is the uncorrected route, green is the cor- rected route. Areas inside triangles are permissible for walking. Reprinted from Paper IV c 2013 IEEE.

Fig 21. Route 3 fixed with map matching. Red is the uncorrected route, green is the corrected route. Areas inside triangles are permissible for walking. Reprinted from Paper IV c 2013 IEEE.

68 for the most of the time, the corrected path crosses through unwalkable locations. Map matching model generally improves the path and mostly stays within walkable areas. However, at times, the map matching model moves the waypoints into opposite sides of obstacles, resulting in critical errors within the path. Also, as the waypoints go through the centers of the nodes of the navigation mesh, the resulting path winds unnaturally. The Path Generation model has no critical errors and due to the funneling algorithm, the resulting path has no similar unnatural windiness as with the map matching model. There are however some unrealistic u-turns and winds. As the A* algorithm finds a walkable route through the navigation mesh, the path never crosses through unwalkable locations. However, even though Path Generation model has the least average error, the corrected path is occasionally still quite far from reality when the model is applied to Route 3.

Fig 22. Route 3 fixed with path generation. Red is the uncorrected route, green is the corrected route. Areas inside triangles are permissible for walking. Reprinted from Paper IV c 2013 IEEE.

3.3.4 Discussion on the human action transformation

The transformation utilized in Construct 3 allows the location of an individual human to be projected into a VE. The geographical coordinates produced by the GPS device are transformed into the World Space of the VE in which an avatar navigates using the coordinates, visualizing the location of the original user. Three different models for transformation were evaluated according to three metrics: average error in meters, sudden u-turns/unrealistic winding and critical errors (navigating to non-navigable areas).

69 Best results were obtained by abstracting the World Space coordinates into a navigation mesh and then utilizing the A* and funneling algorithms to navigate the avatar. The VE is immersive in terms of display properties. The user’s body movements do not match the avatar’s, however, location does. The VE does not contain autonomous dynamics or interaction capabilities. The abstraction of location and paths into the navigation mesh is done to increase presence through the activity transformation. Utilizing the abstractions, the avatar mimics the navigation of a real-world person without moving into places that would be unrealistic. The VE represents nine blocks of the actual city of Oulu. The city is a model that utilizes realistic graphics and scale to present the city resulting in high RF. The navigation mesh provides the model with information on which areas are walkable and unwalkable, but other than that, the model provides no additional information concerning the city. In other words, the EWK concerns Where but not What. The construct utilizes the RealXtend [88] platform which allows the VE to be presented on any major display device type, including CAVE systems [28] and the Oculus Rift HMD [30]. The applicability of any display device combined with the possibility to freely control viewport in real-time results in high EPM. The best working activity recognition model utilizes direct measurement of GPS data combined with abstractions. Utilizing pre-known constraints in the form of the navigation mesh, immersion-preserving abstraction is applied to the GPS signal. See Table 9 for a summary.

Table 9. Theoretical properties of Construct 3 Property Quality Type of translation Location of an individual Affected coordinate system World Space Immersion VE is highly immersive. GPS route correction is used to preserve presence. Extent of World Knowledge User location, walkable and unwalkable areas (Where but not specifically What) Reproduction Fidelity Everything is modeled utilizing modern 3D graphics Extent of Presence Metaphor Can be used with any type of display device. 3D viewport control. Activity recognition method GPS signal combined with map matching and pathfinding Extent of action passing the translation The location of an individual person is abstracted into navigation mesh scale

70 3.4 Construct 4: Pedestrian Simulation and Visualization

Paper VII describes a study concerning the simulation of meso-level pedestrian traffic in city center, Oulu. Mr. Jorge Goncalves, Dr. Denzil Ferreira and Prof. Vassilis Kostakos generated the origin-destination data that works as the core of the ground-truth model as well as participated in the model analysis and paper writing process. The author of this thesis is responsible for the development of all models and model analysis scripting.

3.4.1 System overview

This study relies on the rich and granular mobility dataset gathered from City of Oulu municipal WiFi network. This preliminary analysis resulted in the origin-destination (OD) matrix that yields the number of pedestrians moving between any two given locations on an hourly basis. In this study, the analysis is based on two one-month periods: May 2011 and May 2012. The mobility data for each month consists of approximately 40,000 unique devices which represents roughly a quarter of the entire population moving within the limits of the WiFi network. The WiFi access points are not uniformly distributed.

The virtual city

In our simulation, we use a navigation mesh to represent the street network that is a subset of the entire WiFi area. The simulation roughly covers the area of downtown Oulu. There are three main benefits for using a navigation mesh in this study. First, it is de facto standard for pathfinding in games and virtual environments. Second, it was easy to perform footprint analysis for pedestrian flows because it was possible to quantify the number of pedestrians passing through each node. Third, it was found that the node of the navigation mesh can also be utilized for estimating pedestrian density. The virtual city is populated with approximately 200 virtual WiFi access points (from this on referred to as "hotspots") that spatially correspond to ones found in the actual city. The simulated area is a subset of the area covered by the WiFi network, so some hotspots exist outside the simulated area. The hotspots are thus filtered as follows. Origin-destination pairs where both hotspots reside within the simulated area are treated as observations describing how pedestrians move in the simulated area. The origin-destination pairs where only one hotspot exists within the simulated area are

71 used to describe how pedestrians enter and exit the simulation. Given the geographic orientation of the simulated city, these hotspots are grouped into Northwest, Northeast, Southeast and Southwest for describing pedestrian flows to and from each direction.

Establishing exact walking speed

In this study, autonomous agents are used as virtual pedestrians. Each agent is assigned a natural walking speed between 1.25 m/s and 1.5 m/s [99] when created. The agent uses its natural walking speed throughout the simulation unless interfered by other agents. When an agent is in proximity to other agents, its walking speed is adjusted according to pedestrian density in its node. Each node upkeeps pedestrian density as D = N/A, where N is the number of agents within the node and A the area of the node. The effect of density to walking speed is estimated using density-velocity relationship as described by [100, 101]. The density-velocity, as seen in Figure 23 diagram, identifies four domains where the change in velocity varies and changes in pedestrian domains can be qualitatively observed. The domain edges are at D = 0.7, D = 2.3 and D = 4.7. The decline in walking speed in each domain is estimated as a line with a slope (equation 5)

m = y2 − y1/x2 − x1. (5)

The approximate walking speed for each agent in the node is found with linear equation 6 y = mx + b. (6) where x is the density within the node and m the slope of the domain in which x resides in. Although thousands of agents can be simulated in real-time in the simulation, some additional programming language independent optimizations were developed in order to perform analysis for a month’s worth of pedestrian traffic at a time. Unless a node contains more than one agent, the density calculation D = N/A is ignored. The node in which the agent resides in is estimated by finding the closest node centerpoint. It was found that this estimation does not significantly reduce the accuracy of the density estimation unless the nodes contain extremely small angles, i.e. the nodes are closer to equilateral and right triangles than oblique. The closest node center for each agent is found using kd-tree search [102]. This increases the simulation performance significantly in comparison to performing point-in-polygon tests for each agent. Nodes with an area less than 1m are ignored, so that agents cannot get caught in small oblique "invisible"

72 nodes that might occasionally result from triangulation process. N is multiplied by four to increase the density-velocity effect on pedestrians, as the OD matrix yields observations from roughly 1/4 of the total population.

Fig 23. Weidmann’s density-velocity relationship, [100, 101].

Ground truth model

The ground truth model is a statistical model that is based exactly on the hourly pedestrian mobility observations described by the OD matrix. The popularity of routes is transformed into the simulation by giving weights to origin-destination pairs according to the OD matrix observations. The agents move between the hotspots according to the weights. When an agent enters the simulation, its entry direction is first picked according to the enry-exit weights. After the general direction is determined, the exact entry point is picked at random among the exact entry-exit points modeled after the actual roads and sidewalks that lead to the simulated area. The agent then picks its first target among the weighted "origins" list that is formed according to the hotspots’ overall popularity at the simulated time. After an agent reaches this initial origin, all subsequent destinations are picked according to origin-destination weights, i.e. the last visited hotspot determines the probability of the next visited hotspot. This causes the agents to form their routes following the characteristics of the real pedestrian traffic of the date and time it simulates. However, the agents move as individuals and do not simulate group behavior. The agents do not try to reach the exact geographical position of the hotspot, but pick their target from ten closest nodes. The agent can also pick an

73 entry-exit direction as its next target. In this case, the agent picks the closest entry-exit point to its current location and upon reaching this point is removed from the simulation. The simulation maintains the number of pedestrians to match that on the observations of the OD-matrix within the simulated time. If the number of pedestrians drops below this level, new pedestrians are added to the simulation through the enter-exit directions. The agents navigate using the A* algorithm [103]. The algorithm first calculates the shortest path to the target node through node centerpoints. The resulting path is further smoothed using a funnel algorithm [17]. This reduces path waypoints by modifying the route to follow the shortest path through adjacent polygons. The result of funneling is that the resulting path aesthetically resembles natural pedestrian paths on micro-level.

Comparing the ground truth model against a random model

The characteristics of the pedestrian flows were analyzed by comparing the ground truth model against a random model. In the random model, the agents do not use hotspots as targets but simply pick a random node from the navigation mesh each time they reach their current target. The models were compared by simulating a month’s (May 2012) worth of pedestrian traffic while performing a footprint analysis on each model. In footprint analysis, each node of the navigation mesh quantifies the amount of pedestrian traffic it receives by counting each route passing through it. We were specifically interested in variations between specific time periods and split the month period as follows: Weeks were split into three periods: week (Monday to Thursday), end of the week (Friday and Saturday) and Sunday. Each day was split into four periods: morning (8-10), working hours (11-17), evening (18-22) and night. This segmentation is culturally derived and might differ from studies performed elsewhere. The footprints were averaged within each time period resulting in twelve distinct time periods. The random model results were then subtracted from ground truth results so that the overestimations and underestimations of the ground truth model could be easily identified. The deviations were not constant but fluctuated across time segments and locations.

Generalized model

Utilizing the results of the ground truth - random model comparison analysis as well as crowdsourcing, a generalized model was built to simulate realistic pedestrian traffic

74 without directly leveraging the observations derived from the WiFi mobility dataset. This model is from now on referred to as the “weighted POI model”. The model uses crowdsourced Points of Interest (POIs) mainly from the OpenStreetMap database to act as attractors instead of the virtual WiFi hotspots. All POIs use a 75 meter radius. The results of the ground truth - random model comparison analysis was used to assign weights for each POI category, as seen in Table 10. The weights attempt to counter the over- and underestimations of the random model and were derived as follows. The footprints of the comparison analysis are averaged within the each POI radius. Additive inverse is taken from the results so that the weights increase traffic at underestimated locations and vice versa. The weights are scaled between one and ten and finally averaged across each POI category. The weights are not updated hourly, as in the ground truth model, but change according to the twelve time segments. The agents pick their target destinations according to the POI category weights. Once the weighted choice of a POI category is made, the agent picks a random POI that belongs to the category and proceeds to a node within its radius. Upon reaching the destination, the next choice is made in a similar way and the currently visited POI does not affect the next choice, unlike in the ground truth model.

3.4.2 Results and validation

The weighted POI model was evaluated by estimating how closely it resembles the pedestrian traffic generated by the ground truth model. Using weights calculated from May 2012 mobility dataset, the weighted POI model’s footprints were compared against footprints generated the ground truth model using May 2011 mobility dataset. To establish a baseline for comparison, we also compared the Random model’s footprints against the May 2011 ground truth model dataset. To estimate the spatio-temporal fluctuations only, all models were allowed to use the number of pedestrians derived from the May 2011 mobility dataset. The performance of each model was assessed by calculating percentile difference from ground truth model footprints. The traffic generated by the weighted POI model is closer to ground truth, especially inside the radiuses of the individual POIs. Inside POI’s, the average deviation is 34% by the weighted POI model, whereas the random model deviates by 51%. In areas outside the POI’s, the difference in the deviations is smaller. The weighted POI model’s deviation is 47%, whereas the random model

75 deviates from ground truth by 52%. In total, the POI model deviates by 41% and the random model by 50%.

Table 10. POI category weights for each category in each 12 time segments. Modified from Paper VII - Colors from www.ColorBrewer2.org. Week End of the week Sunday Category MWEN MWEN MWEN Fast food 4 5 6 6 5 5 6 5 4 5 5 5 Place of worship 5 6 5 0 6 6 6 0 5 6 6 1 Cafe 3 4 4 4 4 4 4 3 3 4 3 3 ATM 5 6 5 1 6 6 6 0 5 5 6 2 Hospice 5 6 5 0 6 6 6 0 5 6 6 2 Theatre 6 7 6 2 6 7 6 2 6 6 6 3 Restaurant 5 6 6 5 6 6 6 4 5 6 6 4 Gym 7 7 6 1 7 7 7 1 7 7 6 2 Cinema 4 5 6 5 5 6 6 3 4 5 5 4 Car rental 5 6 5 1 5 6 6 0 5 5 6 2 Town hall 5 6 6 1 6 6 6 1 5 6 5 2 Pub 4 5 5 4 5 5 5 4 4 5 5 4 Nightclub 5 6 6 7 5 6 6 5 5 6 6 5 Library 5 6 6 1 6 6 7 1 5 7 6 2 Parking 5 5 5 4 5 6 6 3 4 5 5 4 Bar 4 5 5 7 5 5 6 8 3 5 4 7 Casino 0 1 1 6 1 1 1 2 0 1 1 0 Pharmacy 4 4 4 4 4 5 4 2 4 4 4 2 Hospital 5 6 6 4 6 6 6 2 6 6 6 3 Bank 3 4 4 4 4 4 4 3 3 4 4 3 Bureau de change 0 0 0 5 0 0 0 3 0 0 0 2

The overall difference between the models was also estimated as follows. The individual footprints per node were averaged through all time segments. Regression analysis was then performed on the resulting footprints. Scatterplots with their corresponding trends and coefficients of determination (R2 values) can be seen in Figures 24 a, b and c. More specifically, A: Ground truth vs. Random model R2 = 0.7228, B: Ground-truth model vs. POI model R2 = 0.8267 C: POI model vs. random model R2 = 0.5391. The highest R2 value between the Ground-truth model and the POI model indicate that the footprints generated by the POI model match closest to the footprints generated by the Ground truth model. When inspecting the deviations at individual time segments, it can be seen that the POI model outperforms the random model at other time segments except end of the

76 week morning and Sunday morning inside POI’s and during week night outside. The comparison of the models can be seen in Table 11.

Fig 24. Scatterplots visualizing the comparison of each model’s average footprints. Reprinted from paper VII.

Table 11. Average deviation from ground truth model. The results of the model closer to ground truth are highlighted in bold. Modified from Paper VII Loc. Model Week End of the week Sunday MWENMWENMWEN Tot. In Rand. 0,41 0,43 0,52 0,75 0,36 0,43 0,48 0,69 0,35 0,38 0,44 0,65 0,51 POI 0,39 0,37 0,28 0,52 0,41 0,33 0,30 0,39 0,38 0,36 0,36 0,30 0,34 Out Rand. 0,48 0,51 0,52 0,58 0,44 0,49 0,55 0,61 0,46 0,45 0,53 0,64 0,52 POI 0,42 0,45 0,50 0,63 0,40 0,42 0,46 0,56 0,44 0,45 0,47 0,49 0,47 Tot. Rand. 0,45 0,47 0,51 0,66 0,39 0,45 0,50 0,64 0,42 0,41 0,48 0,62 0,50 POI 0,40 0,40 0,40 0,58 0,38 0,36 0,36 0,48 0,41 0,40 0,41 0,39 0,41

The over- and underestimations of different models were also inspected in each POI category. It can be seen that the POI model is usually more prone to overestimations and the random model to underestimations. The Casino and the Bureau de change categories are especially overestimated in each time segment by the weighted POI model. The

77 random model similarly underestimates these categories. All three Night time segments are heavily underestimated in both models. In the POI model, these time segments are outliers; the POI model contains underestimations exceeding 100 negative footprints only in these time segments, whereas they are commonplace in the Random model. An overview of the over- and underestimations of each category can be seen in Table 12. For detailed inspection of over- and underestimations of all categories and all time segments, see supplementary Tables 1 and 2 of publication VII.

Table 12. Over and underestimations of the POI model and the Random model averaged through all time segments. The results of the model closer to ground truth are highlighted in bold. Modified from Paper VII Model Fast food Place of Cafe ATM Hospice Theatre worship Random -132 45 -100 7 17 -38 POI -10 11 37 9 3 -22

Model Restaurant Gym Cinema Car rental Town hall Random -90 -2 -79 2 -24 POI -14 -31 1 12 -2

Model Pub Nightclub Library Parking Bar Random -79 -108 -19 -57 -186 POI -12 -45 -31 -8 -8

Model Casino Pharmacy Hospital Bank Bureau de change Random -202 -117 -26 -128 -242 POI 139 40 -9 48 153

3.4.3 Discussion on the human action transformation

The transformations utilized in construct 4 produce crowd level human activity. The location and navigation of a massive amount of individuals is abstracted into crowd- level spatio-temporal fluctuations. The transformations occur in the World Space. The visualization in construct 4 is a map visualization without immersive display properties. The ground truth model combined with simple local avoidance was initially utilized with a small 3D VE version of Oulu city Consisting of nine blocks and a subset of virtual hotspots, however, nine blocks being insufficient to proper analysis, a 2D version was used, utilizing only the navigation mesh without the 3D VE. Body

78 movements or individual location is not matched. The VE has self-contained dynamics as spatio-temporal crowd fluctuations and changing location attractiveness. Due to the non-immersive nature of the VE, it is difficult to estimate the transformation’s implications to presence. However, if the transformation was to be applied to a 3D VE with pedestrians visualized as avatars, at least some kind of local avoidance method should be utilized to prevent the avatars from walking through each other. The EWK of construct 4 is the highest of the constructs presented. The EWK consists of geographical location of walkable and unwalkable areas, spatio-temporal crowd fluctuations as well as a large number of POI’s that describe different location types in the city. In the final construct, the reproduction fidelity is very low as it only covers 2D map visualizations of pedestrian movements. The 2D map visualization extents to monoscopic displays of any size is low as well, although a CAVE display can be used in addition to a regular display. The viewpoint controlling amounts to orthographic panning. The activity recognition leveraged the mobility dataset acquired from WiFi connectivity and handover data. The transformation is able to roughly depict the navigation of massive amounts of humans without the possibility to track individuals. A summary of theoretical properties can be seen in Table 13.

Table 13. Theoretical properties of Construct 4 Property Quality Type of translation Crowd level location Affected coordinate system World Space Immersion The VE has some self-contained dynamics but is otherwise non-immersive. Extent of World Knowledge Walkable areas, meso-level spatio-temporal pedes- trian flows, POIs Reproduction Fidelity 2D map visualization (can be extended to 3D VE:s) Extent of Presence Metaphor Regular monoscopic display. Orthographic viewport panning. Activity recognition method Statistical model acquired from mobility traces Extent of action passing the translation Massive level pedestrian flow properties preserved at statistically significant level

79 80 4 Discussion

This chapter summarizes the findings made from the studies presented in the previous chapter. Theoretical as well as practical implications of the results are discussed, including the reliability and validity of the results. The results present multiple opportunities for future studies. These opportunities are discussed at the final section of the dissertation.

4.1 Theoretical implications

According to the findings of this work, human activity can be transformed into VEs according to actions and location. Furthermore, the results imply that there are at least four ways to abstract human activity to virtual environments. In this dissertation, these levels of abstraction are described qualitatively as Micro level actions, Coarse level actions, Individual level location and Crowd level location. The differences between these levels of abstraction can be discussed from multiple theoretical viewpoints. The abstraction levels are useful in specific use cases. The most significant difference between the abstraction levels is the way the abstractions generate observable phenomenon in the VE. Micro level activity refers to directly tracking the motion of limbs, eyes, fingers and such. In Construct 1, micro-level activity was tracked to be used as control inputs for a multimodal user interface. Micro- level activity can also be transformed into corresponding avatar movements in a VE. If utilized for avatar controlling, Micro level activity data transforms into the Object Space of the avatar, directly affecting the vertices thus making it suitable for presenting activity in a detail preserving manner. Coarse level activity transforms into meaningful actions. These actions can be associated with a meaning such as "walking", "sitting", "cleaning" or "talking". When transformed into a VE, the actions affect an avatar’s Object Space, however the sensed data does not directly control the vertices of the avatar. Coarse level activity data provides a known action that can be displayed for example by triggering a pre-captured animation of the avatar performing a general instantiation of the action. Individual level location provides the geographical location of an individual. The geographical location data of an individual person can be transformed into the World Space of a VE that is modeled after a real location. The transformation is straightforward

81 if the distances between buildings and other objects in the VE scale to their real-world counterparts. The individual level location can be used to directly control the navigation of individual avatars. Crowd level location does not provide the location of individual human beings, but rather creates a statistical abstraction of general spatio-temporal properties of crowd movements realized as individual avatars. The spatio-temporal mobility data is used to generate a number of agents that navigate in World Space following rules that are derived from the mobility data. This type of transformation can be used when there is a need for massive scale human location information, such as pedestrian density, but individual location tracking is not suitable.

4.1.1 Immersion

In construct 1, the VE does not attempt to present a realistic location. The only immersive aspect is the VE’s reaction on the user’s hand and eye movements. Because Construct 1 did not utilize activity transformations for avatar controlling, micro-level transformation’s effect on presence as visually unrealistic behavior could not be evaluated. In Construct 2, immersion was kept low by utilizing simplified graphical repre- sentation. However, some useful implications on presence could be found: In the VE visualization of the Coarse level action data, avatar action animations are displayed as soon as the pattern recognition system classifies the actions. Also, the transition between locations is visualized by immediately changing the avatar’s location in the VE. There are no transitions between action animations; as soon as the pattern recognition system classifies an action, the avatar animation is played from the beginning. This produces lit- tle problems for presence when the pattern recognition system works perfectly. However, because of the instantaneous playback of action animations, erroneous classifications produce rapid sudden changes of animations. The lack of animation transitions and sudden changes of avatar location are visually unrealistic, thus producing obvious deterioration of presence. The FG and survey participants that produced evaluation for Construct 2 also gave immersion-related feedback. While low vividness was preferred, the participants stated that avatar action animations should match the environment better, which would increase the clarity as well as the immersiveness of the VE visualization. In many senses, Construct 3 is the most immersive one. The display properties are immersive. There is some extent of matching due to location tracking. However, there

82 are no self-contained dynamics or interaction with the environment. Construct 3 is also the only construct that makes specific effort to maintain presence. Transforming raw GPS data to the World Space of the virtual city model results in unrealistic avatar navigation that reduces presence. Utilizing navigation mesh, unwalkable areas such as buildings and water could be identified. The identification of unwalkable areas allows the utilization of a pathfinding algorithm that prevents presence-deteriorating navigation of the avatar. Construct 4 utilizes a navigation mesh to agent navigation similarly to Construct 3. The VE is, however, a non-immersive map visualization. Self-contained dynamics exist in form of the crowd movements according to simulated time. The agents are only aware of the environment, but not of each other. This results in the agents walking through other agents which would be clearly unrealistic and could thus possibly reduce presence if this transformation method would be used in an immersive VE.

4.1.2 Mixed Reality taxonomy

Milgram’s MR taxonomy was used to evaluate each construct. Evaluating the results, it can be seen that EWK has the most significant effect on the transformations at different abstraction levels. While RF and EPM can be used to describe other properties of the constructs, they have no relationship to the activity transformations. EWK on the other hand has to increase when the abstraction level of the activity transformation increases. In micro-level transformations, no world knowledge is required; sensor data that captures human motions is directly transformed into vertex translations within the target model. In coarse-level actions, the minimum requirement is the pre-knowledge of the set of actions that are transformed. Location information is not required but it can be utilized to increase classifier performance. In individual location transformations, the minimum requirement is that the VE itself somehow represents the real-world location that the transformation is carried out in. If the relative building and object distances match the relative distances of their real-world counterparts, the geographical location of the VE World Space origin can be used as a reference point for individual location transformations. While the graphical modeling of the VE provides the visual frames for the location transformation, the EWK can be increased utilizing a navigation mesh. Navigation mesh provides the walkable and unwalkable locations of the VE. This knowledge can be utilized to increase the realism and immersiveness of the location transformation when the source (GPS) signal

83 does not provide enough accuracy. The crowd level transformations require the largest EWK. Because no continuous localization is utilized, the navigation mesh is a compulsory requirement for agent navigation. In addition, the transformation depends on spatio-temporal mobility data that determines the navigation of the agents that represent the human crowd that is transformed. The mobility data can originate from sensor observations or from a set of rules. In Construct 4, the mobility data is provided either by the OD matrix harvested from WiFi mobility dataset or the POI model.

4.1.3 Activity recognition

Micro level transformations identify motion and translate it into a VE by moving the vertices of the target model. This way, the target model (such as a human avatar) mimics the identified motions exactly. The identification is usually done with motion capture systems, such as described in Chapter 2. Other options are game controllers such as Nintendo Wii [31], PlayStation Move [32] or Microsoft Kinect [33]. In Construct 1, the motion identification was done utilizing a wearable sensor for hand movements and PCR for eye tracking. In coarse level, the transformations rely on classification, which is a traditional pattern recognition method [56]. Each input data instance is matched to a class from a limited, predetermined set. Therefore, unknown human activity can be classified to be, for instance, walking, dishwashing or sweeping dust from a table. The set of possible (or relevant) action classes has to be known beforehand. This makes classification a very context-dependent way of recognizing human activity. However, this way, the actions can always be associated with a meaning. In Construct 2, ordinary classifiers such as the K Nearest Neighbors [91] and Neural Networks [93] were used. Classifiers were also combined with location constraints and feature selection [104] In individual level location transformation, continuous localization was utilized to track human position; there was little pattern recognition involved. GPS signal was first directly translated into World Space coordinate system of the VE. After this, the signal was abstracted to a navigation mesh. The navigation of the avatar was performed utilizing a pathfinding algorithm operating on the navigation mesh. Crowd level transformation does not utilize continuous localization, but large scale human mobility tracking instead. In Construct 4, the backbone of the simulation models is the OD matrix that was formed by mining cellphone WiFi connectivity data. WiFi

84 network connections and horizontal handover data provided the spatio-temporal data that made it possible to develop statistical models that replicate actual pedestrian movement.

4.1.4 Extent of activity passing the transformation

Each abstraction level focuses on different aspects of human action. The abstractions pass through a subset of human activity depending on their area of focus. Micro-level human activity provides details of motions but no context. Micro-level activity should be used when detailed visual information or the physical trajectory of human motions are needed. Coarse-level activity loses the exact details of motions but provides their purpose. For example, it can tell whether the target of observation is walking but not how the observee performs the actual motions that constitute walking. Coarse level activity transformation can be done if the meaning of the action is more important than the visual accuracy or otherwise fine details of the action. Individual level location transformation essentially means the continuous localization of an individual and visualizing the localization through an avatar. In VE:s, it can be useful to abstract the geographical location into the navigation mesh of the VE. If the original localization data is perfectly accurate, the abstraction causes some loss of accuracy. With noisy source data, however, the abstraction guarantees that inaccurate source data does not cause unimmersive behavior of the avatar. Crowd level location transformation means losing the identifications and exact locations of individual human beings. Multiple amounts of individuals are treated as crowds and agents displayed by the visualization do not correspond to actual individual humans but rather to their statistical representations. This kind of transformation can be used when information on large scale human mobility is required.

4.1.5 Theoretical implications summarized

Examining the summary of transformations in Table 14, some general conclusions can be made. The nested coordinate spaces of VE:s provide their own abstraction layers to activity transformations: motions and actions take place in the Object Space, whereas locomotion takes place in the World Space. Even if when the transformation provides realistic activity information as such, the transformation is not necessarily visually realistic; all transformations cause deterioration of presence unless immersion

85 is specifically considered in the transformation. There is a relationship between the abstraction level of the transformation, specificity of the sensors involved and the EWK required for the transformation. Detailed activity transformations focusing on micro-actions require the most specialized sensing process. However, the process can remain oblivious to the context in which the activity takes place. On the contrary, city simulations, transforming crowd-level activity, can rely on generic equipment, such as smartphones, or even function without any sensor data by utilizing a set of rules. However, the knowledge of the surrounding environment plays the more significant role, the more abstract type of human activity is transformed.

Table 14. Significant theoretical properties summarized. Abstraction level Micro level ac- Coarse level ac- Individual level Crowd level loca- tion tion location tion Affected coordi- Object Space Object Space World Space World Space nate system

Immersion Low Low High Minimal

EWK How What Where Where and What

Activity recogni- Motion tracking, Classification, Continuous lo- WiFi mobility tion eye tracking location con- calization (GPS) traces, agent straints simulation

Extent of activ- Subset of hand Set of activities, Individual loca- Crowd move- ity passing the movements, eye rough location tion abstracted ments statisti- translation movement to navigation cally abstracted mesh

4.2 Practical implications

While the constructs presented in the results suggest that different abstraction levels have different uses overall, the specific use-cases of the constructs suggest that the abstraction levels might require some intermixing in actual use. While Construct 1 utilized hand and gaze tracking for viewport and object manipulation, classification, that was categorized as coarse-level activity, was required for some of the object manipulation actions. While EWK is generally not needed for coarse-level activity transformations, in the user evaluation of Construct 2, EWK was considered highly important. It was also found that

86 the specifics of certain actions, such as walking, were found important. For these kinds of specifics, micro-level transformations are needed. In constructs 3 and 4, the location transformations do not take into account how the individuals navigate between locations. Classification is one option to distinguish between various ways of locomotion such as walking, running or bicycling.

4.2.1 Utilization potential

The results obtained from Construct 4 show how realistic information on pedestrian traffic flows can be acquired through a simulation. It is easy to think of practical implications for this kind of knowledge: municipal traffic designers, the police as well as public transportation companies can benefit greatly from knowing the temporal characteristics of pedestrian movement. The information could even be used for the estimation of business property values. Pedestrian flow of the simulation can also be artificially changed to simulate exceptional situations, such as disasters. In papers V and VI, through the original evaluation of Construct 2, healthcare professionals were addressed to evaluate the practical implications of visualizing the coarse-level activity of an elderly patient. The visualizations were evaluated together with home care providers to examine their usefulness in a practical setting. Lots of novel information is found regarding using visualization systems in elderly home care. While the tool evaluated is still far from approaching the level where it can be commercially utilized, the findings of the studies can be harnessed for future applications. Constructs 1 and 3 present insights into using novel control methods for VE:s. The user evaluation of Construct 1 provides knowledge for future development of gaze and gesture interfaces for mobile 3D user interfaces. Construct 3 provides instructions on how to use GPS signal to control an avatar in a 3D VE.

4.3 Reliability and validity

The main drawback of the studies presented in this dissertation is that their original approach does not attempt to cover the theoretical viewpoints that are the main focus of this research. While each construct presents a case of human activity transformation, all theoretical viewpoints cannot be considered. Construct 1 is the weakest study in this sense; it does not attempt to replicate the human activity it detects through any kind of representation of virtual human activity.

87 The user tests made to validate Construct 3 were somewhat limited. More thorough examination with multiple routes could have been made. In the massive level location visualization construct, the most obvious drawback is the difficulty of validating the simulation results in other cities. This is because municipal WiFi networks are currently rare and the only way of performing such validation would be to perform manual pedestrians counts.

4.3.1 Technical limitations

As the constructs were not originally developed to consider the theoretical viewpoints presented in this dissertation, their technical capabilities are not optimal for that purpose. The prototype device of Construct 1 does not actually provide real, everyday human activity that is considered to be the main interest in this work; while it does study the mobile use of gaze and gesture tracking, the technical capabilities of the device renders it unusable in any other than laboratory conditions. In coarse level activity transformations presented with Construct 2, the main drawback is the issue of generalizability. Classifiers generally work best within limited setting (and thus with a limited number of classes). Using similar methods in larger contexts remain a question for future research. The methods described are suitable only in situations where the visualization is applied into limited, specific domain. Another drawback of Construct 2 is that the activity recognition approaches presented are very basic and possibly not optimal for this kind of human activity transformation. There was no comparison between different activity datasets. Construct 3 lacks in performance when GPS data is severely inaccurate. Even though the A* navigation used with navigation mesh can guarantee that the visualization looks realistic enough, the simulated avatar can be, in fact, far off from its real world counterpart. In the light of the studies presented in this dissertation, the recognition and transfor- mation of actions is an issue more complex than the recognition and transformation of location. All the methods for activity transformation presented in this work utilize specialized technology to some extent. The location transformations, however, require only the utilization of contemporary smartphones.

88 4.4 Recommendations for future research

The studies present multiple opportunities for future research. The use cases presented in this work are in no way an exhaustive representation. Additional use-cases might bring forth types of transformations that were not provided by the use-cases presented in this work. Each construct implements its own solution of visualizing the sensed actions in a virtual environment. However, the study of virtual humans specifically focuses on how to present complex realistic human behavior with virtual avatars [36–38]. The integration of human activity transformations presented in this work with the state-of-the-art virtual human technology provides a very interesting future research topic. The coarse activity transformation method presented in this work suffers from a scalability issue; a limitless number of actions cannot be classified (or visualized) with a limited number of classes. A hierarchical and context-dependent approach to activity recognition combined with current virtual human activity topologies could greatly improve the coarse-level activity transformations. Another potential topic for future research is to study real-time human activity transformations. Because current constructs rely only on local datasets, it can only be speculated how the transformations would work in real time. In this category, the effect of transformation types on state melding in VE:s is a potential research opportunity. The results of this work imply the immersion, and specifically presence, should be specifically considered in all transformations. Visually realistic transformations of human activity is a relevant topic for future research. The crowd level transformation presented by Construct 4 is currently tested only in a single city. Validation of the POI model’s performance in other cities would bring new insights. Improving the technical properties of the micro-level action, coarse level action and individual level location transformations provide their own viewpoints for future research. The wearable sensors in coarse activity recognition might be replaced for example by modern smartphones [85]. The problems of unreliable gaze tracking could be overcome using a more sophisticated machine vision solution that can estimate the users head orientation in addition to pupil movement to overcome calibration problems due to head movement [105]. Machine vision in general could be a more viable option to detect micro-level movements than wearable sensors. In slightly more distant future, opportunistic activity recognition algorithms could provide even more

89 interesting applications [69, 70]. If activity taking place throughout the world can be detected algorithmically, that activity can also be visualized in virtual worlds utilizing transformation methods such as the ones described in this dissertation. Theoretically, the future might bring us a virtual mirror world replicating the one we live in. However, this would also bring forth the need for research of ethics in human activity transformations.

90 References

1. Mennecke BE, Konsynski BR, Townsend AM, Bray DA, Lester J, Roche E & Rowe M (2007) Second life and other virtual worlds: A roadmap for research. In: ICIS, p. 4. Association for Information Systems. URI: http://dblp.uni- trier.de/db/conf/icis/icis2007.html#MenneckeKTBLRR07. 2. Wasko M, Teigland R, Leidner D & Jarvenpaa S (2011) Stepping into the in- ternet: New ventures in virtual worlds. MIS Q. 35(3): 645–652. URI: http://dl.acm.org/citation.cfm?id=2208923.2208931. 3. Singhal S & Zyda M (1999) Networked Virtual Environments: Design and Implementation. ACM Press/Addison-Wesley Publishing Co., New York, NY, USA. 4. Bell M (2008) Toward a definition of "virtual worlds". Journal For Virtual Worlds Research 1(1). URI: https://journals.tdl.org/jvwr/index.php/jvwr/article/view/283. 5. Schroeder R (2008) Defining virtual worlds and virtual environments. Journal For Virtual Worlds Research 1(1). URI: https://journals.tdl.org/jvwr/index.php/jvwr/article/view/294. 6. Gibson W (1984) Neuromancer. New York: Ace Books. 7. Strate L (1999) The varieties of cyberspace: Problems in definition and de- limitation. Western Journal of Communication 63(3): 382–412. URI: http://www.tandfonline.com/doi/abs/10.1080/10570319909374648. 8. Liu H, Bowman M & Chang F (2012) Survey of state melding in virtual worlds. ACM Comput. Surv. 44(4): 21:1–21:25. URI: http://doi.acm.org/10.1145/2333112.2333116. 9. Hevner AR, March ST, Park J & Ram S (2004) Design science in information systems research. MIS Q. 28(1): 75–105. URI: http://dl.acm.org/citation.cfm?id=2017212.2017217. 10. Hevner AR (2007) A three cycle view of design science research. Scandinavian journal of information systems 19(2): 4. 11. Reinecke K & Bernstein A (2013) Knowing what a user likes: A design science ap- proach to interfaces that automatically adapt to culture. MIS Q. 37(2): 427–454. URI: http://dl.acm.org/citation.cfm?id=2535658.2535665. 12. Bemelmans J, Voordijk H & Vos B (2013) Designing a tool for an effective assessment of purchasing maturity in construction. Benchmarking: An International Journal 20(3): 342–361. 13. Peffers K, Tuunanen T, Rothenberger M & Chatterjee S (2007) A design science research methodology for information systems research. Journal of Management Information Systems 24(3): 45–77. URI: http://dx.doi.org/10.2753/MIS0742-1222240302. 14. Picozzi M, Verdezoto N, Pouke M, Vatjus-Anttila J & Quigley AJ (2013) Traffic visual- ization - applying information visualization techniques to enhance traffic planning. In: GRAPP/IVAPP’13, pp. 554–557. 15. Dunn F & Parberry I (2002) 3D Math Primer for Graphics and Game Development. Wordware game math library. Wordware Pub. URI: http://books.google.fi/books?id=f6RXshMOsqgC. 16. Hart P, Nilsson N & Raphael B (1968) A formal basis for the heuristic determination of minimum cost paths. Systems Science and Cybernetics, IEEE Transactions on 4(2): 100–107. 17. Demyen D & Buro M (2006) Efficient triangulation-based pathfinding. In: Proceedings of

91 the 21st National Conference on Artificial Intelligence - Volume 1, AAAI’06, pp. 942–947. AAAI Press. URI: http://dl.acm.org/citation.cfm?id=1597538.1597687. 18. Guy SJ, Chhugani J, Kim C, Satish N, Lin M, Manocha D & Dubey P (2009) Clearpath: Highly parallel collision avoidance for multi-agent simulation. In: Proceedings of the 2009 ACM SIGGRAPH/Eurographics Symposium on Computer Animation, SCA ’09, pp. 177–187. ACM, New York, NY, USA. URI: http://doi.acm.org/10.1145/1599470.1599494. 19. Lemercier S, Jelic a, Kulpa R, Hua J, Fehrenbach J, Degond P, Appert-Rolland C, Donikian S & Pettré J (2012) Realistic following behaviors for crowd simulation. Computer Graphics Forum 31(2pt2): 489–498. URI: http://doi.wiley.com/10.1111/j.1467-8659.2012.03028.x. 20. da Silveira LG & Musse SR (2006) Real-time generation of populated virtual cities. Proceedings of the ACM symposium on Virtual reality software and technology - VRST ’06 p. 155. URI: http://portal.acm.org/citation.cfm?doid=1180495.1180527. 21. Tecchia F & Chrysanthou Y (2000) Real-Time Rendering of Densely Populated Urban Environments. Rendering Techniques URI: http://www.cs.ucy.ac.cy/ yior- gos/publications/crowd_egrw00.pdf. 22. Moeslund TB, Hilton A & Krüger V (2006) A survey of advances in vision-based hu- man motion capture and analysis. Comput. Vis. Image Underst. 104(2): 90–126. URI: http://dx.doi.org/10.1016/j.cviu.2006.08.002. 23. Chittaro L, Buttussi F & Nadalutti D (2006) Mage-anim: A system for visual modeling of embodied agent animations and their replay on mobile devices. In: Proceedings of the Working Conference on Advanced Visual Interfaces, AVI ’06, pp. 344–351. ACM, New York, NY, USA. URI: http://doi.acm.org/10.1145/1133265.1133337. 24. Buttussi F, Chittaro L & Nadalutti D (2006) H-animator: A visual tool for modeling, reuse and sharing of x3d humanoid animations. In: Proceedings of the Eleventh International Conference on 3D Web Technology, Web3D ’06, pp. 109–117. ACM, New York, NY, USA. URI: http://doi.acm.org/10.1145/1122591.1122606. 25. Kenwright B (2012) Generating responsive life-like biped characters. In: Proceedings of the The Third Workshop on Procedural Content Generation in Games, PCG’12, pp. 1:1–1:8. ACM, New York, NY, USA. URI: http://doi.acm.org/10.1145/2538528.2538529. 26. Nvidia 3d vision. http://www.nvidia.com/object/3d-vision-main.html. Accessed: 2014-10- 29. 27. Nintendo 3ds. http://www.nintendo.com/3ds. Accessed: 2014-10-29. 28. Cruz-Neira C, Sandin DJ, DeFanti TA, Kenyon RV & Hart JC (1992) The cave: Audio visual experience automatic virtual environment. Commun. ACM 35(6): 64–72. URI: http://doi.acm.org/10.1145/129888.129892. 29. Cruz-Neira C, Sandin DJ & DeFanti TA (1993) Surround-screen projection-based virtual reality: The design and implementation of the cave. In: Proceedings of the 20th Annual Conference on Computer Graphics and Interactive Techniques, SIGGRAPH ’93, pp. 135–142. ACM, New York, NY, USA. URI: http://doi.acm.org/10.1145/166117.166134. 30. Oculus rift. http://www.oculus.com. Accessed: 2014-10-29. 31. Nintendo wii. http://wii.com. Accessed: 2014-10-29. 32. Playstation move. http://us.playstation.com/ps3/accessories/playstation-move-motion- controller-ps3.html. Accessed: 2014-10-29. 33. Microsoft kinect. http://www.xbox.com/en-US/xbox-360/accessories/kinect. Accessed: 2014-10-29. 34. Bowman DA, Kruijff E, LaViola JJ & Poupyrev I (2004) 3D User Interfaces: Theory and

92 Practice. Addison Wesley Longman Publishing Co., Inc., Redwood City, CA, USA. 35. Pakanen M, Arhippainen L & Hickey S (2013) Designing for 3d user experience in tablet context design and early phase user evaluation of four 3d guis. International Journal On Advances in Intelligent Systems 6(3 and 4): 266–278. 36. Badler N (1997) Real-time virtual humans. In: Computer Graphics and Applications, 1997. Proceedings., The Fifth Pacific Conference on, pp. 4–13. 37. Badler NI, Palmer MS & Bindiganavale R (1999) Animation control for real-time virtual humans. Commun. ACM 42(8): 64–73. URI: http://doi.acm.org/10.1145/310930.310975. 38. Takács B & Kiss B (2003) The virtual human interface: A photorealistic digital human. IEEE Comput. Graph. Appl. 23(5): 38–45. URI: http://dx.doi.org/10.1109/MCG.2003.1231176. 39. Manninen T (2004) Rich interaction model for game and virtual environment design. Ph.D. thesis, University of Oulu, Acta Universitatis Ouluensis. Series A, Scientiae rerum naturalium nro 409. 40. Funatsu S, Koyama T, Saito S, Tokunaga T & Nakajima M (2004) Action generation from natural language. In: Proceedings of the 5th Pacific Rim Conference on Advances in Multimedia Information Processing - Volume Part III, PCM’04, pp. 15–22. Springer-Verlag, Berlin, Heidelberg. URI: http://dx.doi.org/10.1007/978-3-540-30543-9_3. 41. Moore GE (1965) Cramming More Components onto Integrated Circuits. Electronics 38(8): 114–117. URI: http://dx.doi.org/10.1109/jproc.1998.658762. 42. Mell PM & Grance T (2011) Sp 800-145. the nist definition of cloud computing. Technical report, National Institute of Standards & Technology, Gaithersburg, MD, United States. 43. Nvidia shield. http://shield.nvidia.com. Accessed: 2014-10-29. 44. Morie JF (2006) Virtual reality, immersion, and the unforgettable experience. In: Proceedings of SPIE, volume 6055, pp. 60551X–60551X–10. URI: http://dx.doi.org/10.1117/12.660290. 45. Biocca F & Levy MR (1995) Communication in the age of virtual reality. Routledge. 46. Pausch R, Proffitt D & Williams G (1997) Quantifying immersion in virtual reality. In: Proceedings of the 24th Annual Conference on Computer Graphics and Interactive Techniques, SIGGRAPH ’97, pp. 13–18. ACM Press/Addison-Wesley Publishing Co., New York, NY, USA. URI: http://dx.doi.org/10.1145/258734.258744. 47. Ryan ML (2001) Narrative As Virtual Reality: Immersion and Interactivity in Literature and Electronic Media. Johns Hopkins University Press, Baltimore, MD, USA. 48. Coleridge ST (1817) Biographia Literaria. 49. Slater M & Wilbur S (1997) A framework for immersive virtual environments (five): Speculations on the role of presence in virtual environments. Presence: Teleoperators and virtual environments 6(6): 603–616. 50. Uno S & Slater M (1997) The sensitivity of presence to collision response. In: Virtual Reality Annual International Symposium, 1997., IEEE 1997, pp. 95–103. IEEE. 51. Milgram P & Kishino F (1994) A taxonomy of mixed reality visual displays. IEICE TRANSACTIONS on Information and Systems 77(12): 1321–1329. 52. Gervautz M & Schmalstieg D (2012) Anywhere interfaces using handheld augmented reality. Computer 45(7): 26–31. 53. Fisher RA (1936) The use of multiple measurements in taxonomic problems. Annals of Eugenics 7(7): 179–188. 54. Young T & Calvert T (1974) Classification, estimation, and pattern recognition. American Elsevier Pub. Co. URI: http://books.google.fi/books?id=ZWhQAAAAMAAJ. 55. Recker WW, McNally MG & Root GS (1981) Application of pattern recognition theory to

93 activity pattern analysis . 56. Bishop CM (2006) Pattern Recognition and Machine Learning (Information Science and Statistics). Springer-Verlag New York, Inc., Secaucus, NJ, USA. 57. Watanabe S (1985) Pattern Recognition: Human and Mechanical. John Wiley & Sons, Inc., New York, NY, USA. 58. Bao L & Intille SS (2004) Activity recognition from user-annotated acceleration data. In: Pervasive computing, pp. 1–17. Springer. 59. Preece SJ, Goulermas JY, Kenney LPJ, Howard D, Meijer K & Crompton R (2009) Activity identification using body-mounted sensorsâA˘Taˇ review of classification techniques. Physiological Measurement 30(4): R1. URI: http://stacks.iop.org/0967-3334/30/i=4/a=R01. 60. Stäger M, Lukowicz P & Tröster G (2007) Power and accuracy trade-offs in sound- based context recognition systems. Pervasive . Comput. 3(3): 300–327. URI: http://dx.doi.org/10.1016/j.pmcj.2007.01.002. 61. Ward J, Lukowicz P, Troster G & Starner T (2006) Activity recognition of assembly tasks using body-worn microphones and accelerometers. Pattern Analysis and Machine Intelligence, IEEE Transactions on 28(10): 1553–1567. 62. Cheng J, Lukowicz P, Henze N, Schmidt A, Amft O, Salvatore G & Troster G (2013) Smart textiles: From niche to mainstream. Pervasive Computing, IEEE 12(3): 81–84. 63. Bandini S, Federici M & Manzoni S (2007) A qualitative evaluation of technologies and techniques for data collection on pedestrians and crowded situations. Proceedings of the 2007 Summer . . . pp. 1057–1064. URI: http://dl.acm.org/citation.cfm?id=1358075. 64. Calabrese F, Pereira F & Lorenzo GD (2010) The geography of taste: analyz- ing cell-phone mobility and social events. Pervasive computing pp. 22–37. URI: http://link.springer.com/chapter/10.1007/978-3-642-12654-3_2. 65. Calabrese F, Colonna M, Lovisolo P, Parata D & Ratti C (2011) Real-Time Urban Monitoring Using Cell Phones: A Case Study in Rome. IEEE Transactions on Intelligent Transportation Systems 12(1): 141–151. URI: http://ieeexplore.ieee.org/lpdocs/epic03/wrapper.htm?arnumber=5594641. 66. Jacques Junior J, Raupp Musse S & Jung C (2010) Crowd analysis using computer vision techniques. Signal Processing Magazine, IEEE 27(5): 66–77. 67. Wirz M, Franke T, Roggen D, Mitleton-Kelly E, Lukowicz P & TrÃ˝usterG (2013) Probing crowd density through smartphones in city-scale mass gatherings. EPJ Data Science 2(1): 5. URI: http://dx.doi.org/10.1140/epjds17. 68. Roggen D, Troster G, Lukowicz P, Ferscha A, del RMillan J & Chavarriaga R (2013) Opportunistic human activity and context recognition. Computer 46(2): 36–45. 69. Lukowicz P, Pentland A & Ferscha A (2012) From context awareness to socially aware computing. Pervasive Computing, IEEE 11(1): 32–41. 70. Giannotti F, Pedreschi D, Pentland A, Lukowicz P, Kossmann D, Crowley J & Helbing D (2012) A planetary nervous system for social mining and collective awareness. The European Physical Journal Special Topics 214(1): 49–75. URI: http://dx.doi.org/10.1140/epjst/e2012- 01688-9. 71. Card SK, Mackinlay JD & Shneiderman B (eds.) (1999) Readings in Information Visual- ization: Using Vision to Think. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA. 72. Purchase H, Andrienko N, Jankun-Kelly T & Ward M (2008) Theoretical foundations of information visualization. In: Kerren A, Stasko J, Fekete JD & North C (eds.) Information

94 Visualization, volume 4950 of Lecture Notes in Computer Science, pp. 46–64. Springer Berlin Heidelberg. URI: http://dx.doi.org/10.1007/978-3-540-70956-5_3. 73. Liu Z & Stasko JT (2010) Theories in information visualization: What, why and how. In: Workshop on The Role of Theory in Information Visualization, InfoVis ’10. 74. Shannon C (1948) A mathematical theory of communication. Bell Sys- tem Technical Journal 27: 379–423, 623–656. URI: http://cm.bell- labs.com/cm/ms/what/shannonday/shannon1948.pdf. 75. Second life. http://secondlife.com. Accessed: 2014-12-15. 76. Boulos MK & Burden D (2007) Web gis in practice v: 3-d interactive and real-time mapping in second life. International Journal of Health Geographics 6(1): 51. URI: http://dx.doi.org/10.1186/1476-072X-6-51. 77. Gorini A, Gaggioli A, Vigna C & G R (2008) A second life for ehealth: Prospects for the use of 3-d virtual worlds in clinical psychology. Journal of Medical Internet Research 10(3). 78. Kwan MP & Lee J (2004) Geovisualization of human activity patterns using 3d gis: a time-geographic approach. Spatially integrated social science 27. 79. Boers N, Chodos D, Huang J, Gburzynski P, Nikolaidis I & Stroulia E (2009) The smart condo: visualizing independent living environments in a virtual world. In: Pervasive Computing Technologies for Healthcare, 2009. PervasiveHealth 2009. 3rd International Conference on, pp. 1–8. 80. Fleck S & Strasser W (2008) Smart camera based monitoring system and its application to assisted living. Proceedings of the IEEE 96(10): 1698–1714. 81. Cheok AD, Fong SW, Goh KH, Yang X, Liu W, Farzbiz F & Li Y (2003) Human pacman: A mobile entertainment system with ubiquitous computing and tangible interaction over a wide outdoor area. In: In Proc. Mobile HCI 2003, volume 2795 of LNCS, pp. 209–223. ACM, Springer-Verlag. 82. Thomas B, Close B, Donoghue J, Squires J, De Bondi P, Morris M & Piekarski W (2000) Arquake: an outdoor/indoor augmented reality first person application. In: Wearable Computers, The Fourth International Symposium on, pp. 139–146. 83. Musolesi M, Miluzzo E, Lane ND, Eisenman SB, Choudhury T & Campbell AT (2008) The second life of a sensor - integrating real-world experience in virtual worlds using mobile phones. In: In Proc. of HotEmNets âA˘ Z08.´ 84. Al Masum Shaikh M, Molla M & Hirose K (2008) Automatic life-logging: A novel approach to sense real-world activities by environmental sound cues and common sense. In: Computer and Information Technology, 2008. ICCIT 2008. 11th International Conference on, pp. 294–299. 85. Shaikh MAM, Helmut P, Hirose K & Mitsuru I (2009) Easy living in the virtual world: A noble approach to integrate real world activities to virtual worlds. In: Web Intelligence and Intelligent Agent Technologies, 2009. WI-IAT ’09. IEEE/WIC/ACM International Joint Conferences on, volume 2, pp. 466–473. 86. Lifton J, Feldmeier M, Ono Y, Lewis C & Paradiso J (2007) A platform for ubiquitous sensor deployment in occupational and domestic environments. In: Information Processing in Sensor Networks, 2007. IPSN 2007. 6th International Symposium on, pp. 119–127. 87. Lifton J & Paradiso J (2010) Dual reality: Merging the real and virtual. In: Lehmann-Grube F & Sablatnig J (eds.) Facets of Virtual Environments, volume 33 of Lecture Notes of the Institute for Computer Sciences, Social Informatics and Telecommunications Engineering, pp. 12–28. Springer Berlin Heidelberg. URI: http://dx.doi.org/10.1007/978-3-642-11743-5_2.

95 88. Alatalo T (2011) An entity-component model for extensible virtual worlds. IEEE Internet Computing 15(5): 30–37. URI: http://dx.doi.org/10.1109/MIC.2011.82. 89. Guestrin E & Eizenman E (2006) General theory of remote gaze estimation using the pupil center and corneal reflections. Biomedical Engineering, IEEE Transactions on 53(6): 1124–1133. 90. Koskela T, Arhippainen L, Pouke M, Karhu A & Hickey S (2013) User experiences in throwit: A natural ui for sharing objects between mobile devices. In: Next Generation Mobile Apps, Services and Technologies (NGMAST), 2013 Seventh International Conference on, pp. 75–80. 91. Fix E & Hodges JL (1951) Discriminatory analysis, nonparametric discrimination: Consis- tency properties. US Air Force School of Aviation Medicine Technical Report 4(3): 477+. URI: #. 92. Bowman DA, Johnson DB & Hodges LF (1999) Testbed evaluation of virtual environment interaction techniques. In: Proceedings of the ACM Symposium on Virtual Reality Software and Technology, VRST ’99, pp. 26–33. ACM, New York, NY, USA. URI: http://doi.acm.org/10.1145/323663.323667. 93. Haykin S (1994) Neural Networks: A Comprehensive Foundation. Prentice Hall PTR, Upper Saddle River, NJ, USA, 1st edition. 94. Pouke M & HÃd’kkilÃd’ J (2013) Elderly healthcare monitoring using an avatar-based 3d virtual environment. International Journal of Environmental Research and Public Health 10(12): 7283–7298. URI: http://www.mdpi.com/1660-4601/10/12/7283. 95. Ogre - object-oriented graphics rendering engine. http://www.ogre3d.org/. Accessed: 2014-11-21. 96. Sinnott RW (1984) Virtues of the Haversine. Sky and Telescope 68(2): 159+. 97. (1998) WGS 84 Implementation Manual. 98. Lou Y, Zhang C, Zheng Y, Xie X, Wang W & Huang Y (2009) Map-matching for low- sampling-rate gps trajectories. In: Proceedings of the 17th ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems, GIS ’09, pp. 352–361. ACM, New York, NY, USA. URI: http://doi.acm.org/10.1145/1653771.1653820. 99. LaPlante J & Kaeser TP (2007) A History of Pedestrian Signal Walking Speed Assumptions. In: 3rd Urban Street Symposium, pp. 1–8. 100. Weidmann U (1993) Transporttechnik der fussgänger - transporttechnische eigenschaften des fussgÃd’ngerverkehrs (literaturstudie). Literature Research 90, Institut füer Verkehrsplanung, Transporttechnik, Strassen- und Eisenbahnbau IVT an der ETH Zürich, ETH-Hönggerberg, CH-8093 Zürich. URI: http://www.ped-net.org. In German. 101. Seyfried A, Steffen B, Klingsch W, Lippert T & Boltes M (2007) The fundamental diagram of pedestrian movement revisited âA˘Tˇ empirical results and modelling. In: Schadschneider A, PÃ˝uschelT, KÃijhne R, Schreckenberg M & Wolf D (eds.) Traffic and Granular FlowâA˘ Z05,´ pp. 305–314. Springer Berlin Heidelberg. URI: http://dx.doi.org/10.1007/978-3-540-47641- 2_26. 102. Maneewongvatana S & Mount D (1999) It’s okay to be skinny, if your friends are fat. In: Center for Geometric Computing 4th Annual Workshop on Computational Geometry, October, pp. 1–8. URI: https://myplan.cs.umd.edu/ mount/Papers/cgc99-smpack.pdf. 103. Hart P, Nilsson N & Raphael B (1968) A formal basis for the heuristic determination of minimum cost paths. Systems Science and Cybernetics, IEEE Transactions on 4(2): 100–107.

96 104. Pudil P, NovovicovᡠJ & Kittler J (1994) Floating search methods in feature selection. Pattern Recogn. Lett. 15(11): 1119–1125. URI: http://dx.doi.org/10.1016/0167-8655(94)90127-9. 105. Yamazoe H, Utsumi A, Yonezawa T & Abe S (2008) Remote and head-motion-free gaze tracking for real environments with automated head-eye model calibrations. In: Computer Vision and Pattern Recognition Workshops, 2008. CVPRW ’08. IEEE Computer Society Conference on, pp. 1–6.

97 98 Original publications

I Pouke M, Hickey S, Kuroda T & Noma H (2010) Activity recognition of the elderly. In: Proceedings of the 4th ACM International Workshop on Context-Awareness for Self- Managing Systems (CASEMANS ’10). Article No. 7. DOI=10.1145/1858367.1858374 II Pouke M & Honkanen RT (2011) Comparison of nearest neighbour and neural network based classifications of patient’s activity. In: Pervasive Computing Technologies for Healthcare (PervasiveHealth 2011) 5th International Conference on pp. 331-335. c 2011 IEEE III Pouke M, Karhu A, Hickey S & Arhippainen L (2012) Gaze Tracking and Non-Touch Gesture Based Interaction Method for Mobile 3D Virtual Spaces. In: Proceedings of the 25th Annual Conference of the Australian Computer-Human Interaction Special Group (OZCHI 2012) pp. 505-512. DOI=10.1145/2414536.2414614 IV Pouke M (2013) Using GPS Data to Control an Agent in a Realistic 3D Environment. In: Proceedings of the 7th International Conference on Next Generation Mobile Apps, Services and Technologies (NGMAST 2013). pp. 87-92. c 2013 IEEE V Pouke M (2013) Using 3D Virtual Environments to Monitor Elderly Patient Activity With Low Cost Sensors. In: Proceedings of the 7th International Conference on Next Generation Mobile Apps, Services and Technologies (NGMAST 2013). pp. 81-86. c 2013 IEEE VI Pouke M & Häkkilä J (2013) Elderly Healthcare Monitoring Using an Avatar-Based 3D Virtual Environment. International Journal of Environmental Research and Public Health, vol. 10, no. 12, pp. 7283-7298. VII Pouke M, Goncalves J, Ferreira D & Kostakos V (2014) Practical simulation of virtual crowds using points of interest. Manuscript.

Reprinted with permission from ACM (I),(III), IEEE (II),(IV),(V), MDPI AG (VI) Original publications are not included in the electronic version of the dissertation.

99 100 C531etukansi.kesken.fm Page 2 Wednesday, June 3, 2015 11:13 AM

ACTA UNIVERSITATIS OULUENSIS SERIES C TECHNICA

516. Karinkanta, Pasi (2014) Dry fine of Norway spruce (Picea abies) wood in impact-based fine grinding mills 517. Tervo, Valtteri (2015) Joint multiuser power allocation and iterative multi- antenna receiver design 518. Jayasinghe, Laddu Keeth Saliya (2015) Analysis on MIMO relaying scenarios in wireless communication systems 519. Partala, Juha (2015) Algebraic methods for cryptographic key exhange 520. Karvonen, Heikki (2015) Energy efficiency improvements for wireless sensor networks by using cross-layer analysis 521. Putaala, Jussi (2015) Reliability and prognostic monitoring methods of electronics interconnections in advanced SMD applications 522. Pirilä, Minna (2015) Adsorption and photocatalysis in water treatment : active, abundant and inexpensive materials and methods 523. Alves, Hirley (2015) On the performance analysis of full-duplex networks 524. Siirtola, Pekka (2015) Recognizing human activities based on wearable inertial measurements : methods and applications 525. Lu, Pen-Shun (2015) Decoding and lossy forwarding based multiple access relaying 526. Suopajärvi, Terhi (2015) Functionalized nanocelluloses in wastewater treatment applications 527. Pekuri, Aki (2015) The role of business models in construction business management 528. Mantere, Matti (2015) Network security monitoring and anomaly detection in industrial control system networks 529. Piri, Esa (2015) Improving heterogeneous wireless networking with cross-layer information services 530. Leppänen, Kimmo (2015) Sample preparation method and synchronized thermography to characterize uniformity of conductive thin films 531. Pouke, Matti (2015) Augmented virtuality : transforming real human activity into virtual environments

Book orders: Granum: Virtual book store http://granum.uta.fi/granum/ C531etukansi.kesken.fm Page 1 Wednesday, June 3, 2015 11:13 AM

C 531 OULU 2015 C 531

UNIVERSITY OF OULU P.O. Box 8000 FI-90014 UNIVERSITY OF OULU FINLAND ACTA UNIVERSITATISUNIVERSITATIS OULUENSISOULUENSIS ACTA UNIVERSITATIS OULUENSIS ACTAACTA

TECHNICATECHNICACC Matti Pouke Matti Pouke Professor Esa Hohtola AUGMENTED VIRTUALITY University Lecturer Santeri Palviainen TRANSFORMING REAL HUMAN ACTIVITY INTO VIRTUAL ENVIRONMENTS Postdoctoral research fellow Sanna Taskila

Professor Olli Vuolteenaho

University Lecturer Veli-Matti Ulvinen

Director Sinikka Eskelinen

Professor Jari Juga

University Lecturer Anu Soikkeli

Professor Olli Vuolteenaho UNIVERSITY OF OULU GRADUATE SCHOOL; UNIVERSITY OF OULU, FACULTY OF INFORMATION TECHNOLOGY AND ELECTRICAL ENGINEERING, Publications Editor Kirsti Nurkkala DEPARTMENT OF INFORMATION PROCESSING SCIENCE; CENTER FOR INTERNET EXCELLENCE ISBN 978-952-62-0833-6 (Paperback) ISBN 978-952-62-0834-3 (PDF) ISSN 0355-3213 (Print) ISSN 1796-2226 (Online)