<<

Research Collection

Doctoral Thesis

A Framework for Optimal In-Air Gesture Recognition in Collaborative Environments

Author(s): Alavi, Ali Seyed

Publication Date: 2020

Permanent Link: https://doi.org/10.3929/ethz-b-000449030

Rights / License: In Copyright - Non-Commercial Use Permitted

This page was generated automatically upon download from the ETH Zurich Research Collection. For more information please consult the Terms of use.

ETH Library DISS. ETH NO. 26416

A Framework for Optimal In-Air Gesture Recognition in Collaborative Environments

A thesis submitted to attain the degree of

DOCTOR OF SCIENCES of ETH Zurich

(Dr. sc. ETH Zurich)

presented by

SEYED ALI ALAVI

Master of Science in Secure and Dependable Computer Systems

born on 15.09.1984

citizen of Iran

accepted on the recommendation of

Prof. Dr. Konrad Wegener, examiner Prof. Dr. Andreas Kunz, co-examiner Prof. Dr. Morten Fjeld, co-examiner

2020

Abstract

Hand gestures play an important role in communication between humans, and increasingly in the inter- action between humans and computers, where users can interact with a computer system to manipulate digital content, provide input, and give commands to a digital system.

Thanks to advances in computer vision and camera technology, in-air hand gestures can be tracked without the need for instrumenting the hands. This allows for a wide variety of interesting and powerful use cases.

First, hand gestures that happen naturally during human to human communication can be tracked and interpreted. This has been extensively used and researched, for example for communicating deictic ges- tures to remote participants. Most such solutions rely on communicating such gestures using extensive visual feedback, for example by showing remote participant’s hand, or even his or her full body, to their remote partners. While useful for many scenarios, such heavy reliance on visual feedback limits the us- ability and accessibility of such solutions, for example for blind and visually impaired (BVI) participants, or for scenarios where screen real state is limited.

Even when used for human-computer interaction, in-air interfaces rely on visual feedback. Because in-air gestures are ephemeral, and there is no haptic feedback, it is difficult for a new user to perform them properly. Thus, a typical approach to address this problem is by drawing the hand trajectory on the display. This causes distraction, especially if multiple users who share a single display simultaneously interact with the system. Another approach is to have a fast gesture classifier, which allows giving quick feedback to the user, even shortly before finishing the gesture, provided that it is sufficiently different. Due to the way that most of the current classifiers are designed, these feedbacks are mainly limited to reporting whether the gesture could be classified, and if so, to which class did it belong. Such feedback has limited usefulness, as the only thing the user can do after receiving such feedback is to repeat the gesture if it was failed. But why it failed and how their performance can be improved remains unknown.

This thesis proposes methods for utilizing in-air gestures for enhancing digital collaboration without

iii heavy reliance on visual feedback. This is especially useful for collaborative scenarios where some participants have limited access to the visual channel, most notably BVI participants and remote partic- ipants, and for scenarios where the display in the collaborative environment is crowded with content, to showing large visual cues is not desirable.

Specifically, this thesis addresses two main challenges:

• How to communicate in-air gestures, specifically deictic gestures, to blind and visually impaired participants, as well as remote participants, while minimizing (or eliminating) the need for visual feedback. For BVI participants, this is achieved by tracking deictic gestures of sighted participants, deciding whether they are performing a deictic gesture, and then communicating the target of the gesture to the BVI participants using a Braille display or a screen reader. For remote participants, this is achieved by showing the target of pointing gesture using a small highlighter on the screen, as well as by allowing the remote participant to control the opacity of the visual feedback if a more complicated visual feedback is necessary.

• How to use in-air gestures in collaborative scenarios for human-computer interaction, while min- imizing the use of visual feedback. This is achieved by proposing a new algorithm for gesture recognition that can provide fast, useful, and non-distracting feedback for in-air gestures. The algorithm always keeps the user informed about the state of the gesture recognizer, and informs the user about what they need to do next to get closer to finishing a gesture by giving them non- distracting visual cues. Moreover, the proposed algorithm is independent of the speed, scale or orientation of the gestures. This allows the users to perform gestures from different distances and angles relative to the camera, with a speed they are comfortable with, which gives them ample opportunity to learn how to perform gestures. Additionally, a new algorithm for creating large gesture sets for in-air interactions using a smaller set of gestures is introduced, thus reducing the need for learning new gestures by the users. The resulting gestures are also guaranteed to be easily detectable by the proposed gesture recognizer.

Finally, because studying these problems requires a setup capable of uninstrumented hand tracking, this thesis proposes cost-effective hardware setups that allow for setting up collaborative environments with horizontal or vertical displays that are capable of tracking in-air gestures.

iv Zusammenfassung

Handgesten spielen eine wichtige Rolle in der Kommunikation zwischen Menschen, aber in zunehmen- dem Masse auch in der Kommunikation mit Computern. Mit seinen Gesten kann der Mensch den digita- len Inhalt manipulieren, Eingaben vornehmen, oder Befehle in den Computer eingeben. Durch die stetige Weiterentwicklung in der Bildverarbeitung und der Kameratechnologie können solche frei im Raum aus- geübten Gesten erkannt werden, ohne dass hierfür Sensoren an der Hand oder an dem Arm angebracht werden müssen. Dies ermöglicht eine Vielzahl interessanter und leistungsfähiger Anwendungsfälle.

Diese Systeme können auch solche intuitiven Gesten erkennen und interpretieren, wie sie in der Kom- munikation zwischen Menschen auftreten. Diese Möglichkeit wurde intensiv erforscht, beispielsweise hinsichtlich der Übertragung deiktischer Gesten an entfernte Teilnehmer. Viele der Lösungen beruhen darauf, dass eine umfassende visuelle Darstellung ermöglicht wird, indem man beispielsweise die Hand des entfernten Teilnehmers darstellt oder sogar der ganze Körper den anderen Gesprächsteilnehmern gezeigt wird. Obwohl die Lösungen für viele Anwendungen nützlich sind, so ist doch diese starke Aus- richtung auf den visuellen Wahrnehmungskanal ein limitierender Faktor für einige Personen, beispiels- weise für Blinde und Sehbehinderte (BVI), aber auch für Szenarien, in denen nur kleine Bildflächen zur Verfügung stehen, z.B. Smartphones oder Tablets.

Aber auch für die Interaktion zwischen Mensch und Computer sind solche frei im Raum ausgeführten Gesten auf ein visuelles Feedback angewiesen. Da solche Gesten kurzlebig sind und keine haptische Rückmeldung liefern, fällt es einem neuen Anwender schwer, diese zu erlernen und richtig auszufüh- ren. Eine Möglichkeit, diesem Problem zu begegnen besteht darin, den durch die Geste ausgeführten Pfad auf einem Bildschirm darzustellen. Hierdurch entstehen aber Ablenkungen und Missverständnisse, insbesondere dann, wenn die Gesten mehrerer Anwender auf dem gleichen Bildschirm dargestellt wer- den. Ein anderer Ansatz besteht in einer schnellen Zuordnung der Gesten, welche dem Anwender eine Rückmeldung über die korrekte Ausführung der Geste gibt. Allerdings beschränken sich die heutigen Klassifizierungsmodule darauf, lediglich die korrekte Klassifizierung und die zugehörige Klasse auszu- geben. Diese Information ist allerdings für den Anwender nur bedingt hilfreich, da er im Falle einer

v fehlerhaften oder nicht erkannten Geste diese einfach nur wiederholen kann. Er erhält aber keine Mittei- lung darüber, warum diese Geste fehlerhaft war und wie sie korrigiert werden kann. Im Rahmen dieser Arbeit werden Methoden entwickelt, mit welchen solche Gesten für die Verbesserung der digitalen Zu- sammenarbeit eingesetzt werden können, ohne sich stark auf eine visuelle Rückmeldung abstützen zu müssen. Das ist insbesondere für solche Formen der digitalen Zusammenarbeit wichtig, in denen einige Teilnehmer nur einen eingeschränkten Zugang zu dem visuellen Wahrnehmungskanal haben, wie bei- spielsweise BVI oder Teilnehmer mit kleinen mobilen Endgeräten; aber auch für solche Fälle, in denen der Bildschirm bereits mit anderem Inhalt komplett belegt ist. Die Arbeit fokussiert insbesondere auf die folgenden beiden Punkte: · Ausgabe von frei im Raum ausgeführten Gesten – insbesondere deiktischen Gesten – an BVI oder an Anwender mit kleinen mobilen Endgeräten, während für den Ausführenden dieser Gesten die visuelle Rückmeldung minimiert oder komplett eiminiert wird. Hierzu werden die Gesten der sehenden Teilnehmer erfasst und entschieden, ob es sich um eine deiktische Geste handelt oder nicht. Die erkannten deiktischen Gesten werden dann für BVI auf einer Braille-Zeile oder über ein Bildschirm-Lesegerät ausgegeben. Für die Teilnehmer mit kleinen mobilen Endgeräten erfolgt die Ausgabe durch ein visuelles Hervorheben der relevanten Stellen auf dem Bildschirm, wobei der Teil- nehmer zudem noch die Möglichkeit hat, die Transparenz der visuellen Ausgabe einzustellen, falls eine kompliziertere visuelle Ausgabe erforderlich ist. · Verwendung von Gesten für die Mensch-Computer Zusammenarbeit bei gleichzeitiger Minimierung der visuellen Rückmeldung auf dem Bildschirm. Dies wird erreicht durch einen neuen Algorithmus für die Gestenerkennung, welcher eine schnelle und wenig störende Rückmeldung über die ausgeführte Geste liefert. Dieser Algorithmus informiert den Anwender über den jeweiligen Status der Gestenerkennung und darüber, was er tun muss, um die Geste korrekt zu beenden. Der Algorithmus ist unabhängig davon, mit welcher Geschwindigkeit oder in welcher Grösse die Gesten ausgeführt werden. Dadurch wird es ermöglicht, dass die Anwender die Gesten aus unter- schiedlichen Entfernungen und Winkeln zur Kamera ausführen können, und in einer jeweils für den Anwender angenehmen Geschwindigkeit. Hierdurch entstehen vielfältige Möglichkeiten, das Ausführen der Gesten zu erlernen. Weiterhin wird ein Algorithmus vorgestellt, mit dem es möglich wird, einen grossen Zeichenvorrat an möglichen Gesten zu erstellen, welcher auf deutlich weniger elementaren Ges- ten beruht. Hierdurch entfällt die Notwendigkeit, dass der Anwender einen grossen Gestensatz erlernen muss. Gleichzeitig stellt dieser Algorithmus sicher, dass alle Gesten durch die automatische Gestener- kennung eindeutig erkennbar werden.

Zur Erprobung der oben vorgestellten Algorithmen zur Erkennung und Verarbeitung der frei im Raum ausgeführten Handgesten werden unterschiedliche Testaufbauten vorgestellt, mit denen eine Kollabora- tionsumgebung aufgebaut werden kann, in welcher das Erfassen der Gesten über vertikalen oder hori- zontalen Bildflächen möglich ist.

vi Acknowledgments

First of all, I thank my advisors Prof. Dr. Konrad Wegener, head of the Institute of Machine Tools and Manufacturing (IWF), and Prof. Dr. Andreas Kunz, head of the Innovation Center Virtual Reality (ICVR) group.

Professor Wegener gave me the chance to do my research in his institute. In addition to providing all the facilities and means required for my research, he ensured I have absolute freedom in what I research on and how I conduct my research. I deeply appreciate his open mindedness and his support and encouragement for conducting a wide variety of research in IWF. I also have fond memories of being his bandmate in all the IWF Christmas parties during my time there. Thank you, Professor Wegener!

An enormous depth of gratitude goes towards Prof. Dr. Andreas Kunz, without whom this thesis would neither be started nor finished. He provided exceptional support, both intellectually and personally, during the time of my Ph.D. His unconditional support for solving administrative challenges, his deep, intellectual, inquisitive questioning, our many technical discussions (usually over a whiteboard in his office, my office, and the institute’s corridor), and his open-mindedness and patience during this research and all the turns it took all were instrumental in finishing this work. He also spent a considerable amount of time reviewing this thesis time and again, and was available day and night to provide any help and support I could ever ask for. Thanks for everything, Andreas!

I also wish to thank Prof. Dr. Morten Fjeld for providing motivation and guidance. Morten’s positive attitude and optimism are truly contagious, and his encouragement during this research went a long way.

I thank all the colleagues, staff, and friends at IWF. I especially thank Mohamadreza Afrasiabi for his friendship and support.

My special thanks go to my colleagues in ICVR, Anh Nguyen, Markus Zank, and Dr. Thomas Nescher. I enjoyed our many movie nights, lunch conversations, and Manege sessions. I also thank Thomas for generously providing the template of this thesis. And I thank Josef Meile for providing me with all the tech support I needed.

vii I am also thankful for all the students that I supervised and their thesis work helped to shape this research in one way or another.

My special thanks and appreciation go to my parents and my sisters for their continuous love, support, and encouragement, without whom this work would not be possible.

Finally, I acknowledge the Swiss National Science Foundation (SNSF) and Swiss Federal Commission for Innovation and Technology (CTI) for funding the majority of this research (SNSF project number CR21I2_138601, and CTI project number 16251.2 PFES-ES).

viii Contents

Abbreviations xiii

1 Introduction 1 1.1 Defintion and Terminology...... 3 1.2 Motivation and Problem Statement...... 5 1.3 Organization...... 7

I Hardware for In-air Interaction in Collaborative Environments9

2 Hardware for Detecting and Tracking In-Air Gestures 11 2.1 Introduction...... 11 2.1.1 Instrumented In-Air Interfaces...... 11 2.1.2 Uninstrumented In-Air Interfaces...... 14 2.2 Conclusion...... 15

3 Tracking In-Air Gestures in Collaborative Environments Using Commodity Hard- ware 17 3.1 Depth Cameras...... 18 3.2 Setup of a Collaborative Environmment with a Horizontal Display...... 20 3.2.1 Single Camera Setup Using V1...... 21 3.2.2 Multiple Camera Setup Using Leap Motion...... 26 3.3 Setup of a Collaborative Environmment with a Large Vertical Display...... 26 3.3.1 Methodology...... 28 3.4 Conclusion...... 30

ix Contents

II Integrating and Interpreting In-Air Gestures for Enhanced Collaboration 33

4 An Accessible Environment to Integrate Blind Participants into Brainstorming Sessions 35 4.1 Introduction...... 35 4.2 System Setup...... 36 4.3 Detection and Interpretation of Deictic Gestures...... 36 4.4 User Studies...... 39 4.5 Evaluation and Results...... 39 4.6 Conclusion...... 40

5 Using In-Air Gestures for Enhancing Remote Collaboration in Immersive Envi- ronments 41 5.1 Introduction...... 41 5.2 Related Work...... 42 5.3 Design...... 42 5.4 Evaluation...... 45 5.5 Conclusion...... 47

III In-Air Gestures for Human-Computer Interaction in Collaborative Envi- ronments 49

6 Human-Computer Interaction and In-Air Gestures 51 6.1 Introduction...... 51 6.2 Fundamental principles of interaction...... 52 6.2.1 Psychology of Interaction...... 52 6.3 Analysing interaction properties of user interfaces...... 55 6.3.1 Command Line Interfaces...... 55 6.3.2 Graphical User Interfaces...... 56 6.3.3 Touch User Interfaces...... 57 6.4 Gestural Interactive Systems...... 58 6.4.1 Gesture Detection...... 58 6.4.2 Gesture Classification...... 59 6.5 State of the Art in Uninstrumented In-Air Gesture Recognizers...... 59 6.6 Conclusion...... 65

7 Gesture Recognition for Human-Computer Interaction in Collaborative Environ- ments 67 7.1 Introduction...... 67 7.2 Design Goals...... 68 7.3 Gesture Datasets...... 68

x Contents

7.4 An overview of the algorithm...... 69 7.4.1 Preprocessing...... 70 7.4.1.1 Algorithm 1: Summarizing the point cloud...... 71 7.4.1.2 Scale and orientation-free representation...... 73 7.5 Distance Function...... 76 7.5.1 Dynamic Time Warping...... 76 7.6 Classification...... 77 7.7 Providing Feedback...... 77 7.8 Evaluation and Results...... 78 7.9 Selecting unambiguous gestures...... 80 7.9.1 Problem Statement...... 82 7.9.2 N-ary Huffman Codes...... 83 7.9.3 Using n-ary Huffman codes for selecting unambiguous gestures...... 83 7.10 Conclusion...... 84

8 Conclusion and Future Work 91 8.1 Optimizing In-Air Gestures for Minimum Fatigue...... 92 8.2 Effects on gesture guessability and gesture elicitation studies...... 93 8.3 Benchmarks and datasets for gesture recognizers...... 94

List of Figures 95

List of Tables 97

Bibliography 99

List of Publications 109

xi

Abbreviations

2D ...... Two Dimensional

3D ...... Three Dimensional

AI ...... Artificial Intelligence

ANN ...... Artifical Neural Network

AR ...... Augemneted Reality

BVIP ...... Blind and Visually Impaired Participant

CNN ...... Convolutional Neural Network

DNN ...... Deep Neural Network

DTW ...... Dynamic Time Warping

EBS ...... Electronic Brainstorming System

FOV ...... Field of View

GUI ......

HMD ...... Head-Mounted Display

HMM...... Hidden Markov Model

k-NN ...... k-Nearest Neighbors

ML ...... Machine Learning

MR ...... Mixed Reality

NUI ...... Natural User Interface

xiii Contents

RNN ...... Recurrent Neural Network

VR ...... Virtual Reality

xiv 1

Introduction

As for the hands, without which all action would be crippled and enfeebled, it is scarcely possible to describe the variety of their motions, since they are almost as expressive as words. For other portions of the body merely help the speaker, whereas the hands may almost be said to speak.

Quintilian (35-100 AD)

Digital collaborative environments are environments where people work together to perform a task using a computer system. During the process of collaboration, team members communicate with each other in what is known as a ‘communication space’, and work together, usually using a shared large digital display, in what is known as the ‘task space’. Figure 1.1) depicts these spaces in two collaborative environments, one with a vertical display and another with a horizontal display (although it is possible to have both types of displays in a collaborative environment, these two examples are good representatives). In some collaborative scenarios, such as remote collaborations where one or more collaborators join the collaboration remotely and access the task space over a computer network, lack of visual access to the communication space leads to a significant loss of context for the remote team member. After all, humans communicate not only verbally, but also visually, using facial expressions, body posture, hand gestures, and so on. Thus, communicating these visual communication elements that happen during a collaborative scenario to such participants will help them to achieve a better understanding of the communication among their remote partners. These visual communication elements include the likes of pose, facial expressions, full-body gestures, and hand gestures. Due to the expressiveness of hands, a lot of focus has been towards using detecting and interpretting hand gestures. Chapter 1 Introduction

(a)

(b)

Figure 1.1: Digital collaboration happens in two spaces: task space and communication space. Figures (a) and (b) depict these spaces in two different collaborative environments, one with a vertical display (a) and the other with a horizontal display (b).

2 1.1 Defintion and Terminology

1.1 Defintion and Terminology

As one of the most expressive communication tools for humans, hand gestures and postures are used to convey meanings, enhance communication, or to completely replace verbal communication, as in the case of sign languages. Although the terms "hand posture" and "hand gesture" are sometimes used interchangeably, they have distinct definitions:

Hand posture is the state of hand and fingers in an instance of time [59]. For example, having the hand in a fist position. This is also sometimes referred to as Hand pose.

Hand gesture is a dynamic sequence of postures over time [59]. For example, waving one’s hand to indicate a "bye" sign.

Similar definitions of posture and gesture exist for the whole body, but in this thesis, gesture, posture and pose always refer to the hand, unless explicitly specified otherwise.

Human gestures have been extensively studied in different research discplines, and for different purposes. As a result, different terminology has been used among the existing literature [12, 49, 93, 21, 112]. The terminology used here is the one presented in [44], which itself mainly uses the terminology used by [89]. Human gestures can be categorized in the following types:

Deictic Gesture is a gesture where the user uses pointing "to establish the identity or spatial location of an object within the context of the application" [44]. Examples of this gestures are used in many applications, such as in "Put-That-There" [10], to manipulate objects, or in [4] to improve collaboration over large interactive surfaces by tracking the objects of discussion that the users are pointing to.

Manipulative Gesture is a gesture which its "intended purpose is to control some entity by applying a tight relationship between the actual movements of the gesturing hand/arm with the entity being manipulated" [89]. Examples of this are the well-known Pinch-to-Zoom (Figure 6.4) gesture in touch interfaces, or in [39], which allows users to perform in-air interaction around a tabletop to pick up, move and resize objects.

Semaphoric Gesture "employs a stylized dictionary of static or dynamic hand or arm gestures" [89]. It is important to note that unlike manipulative gestures, semaphoric gestures can be both static (i.e. a posture), such as the "ok" sign, or dynamic, such as drawing an "X" sign in the air.

Gesticulation is a gesture that is usually performed in combination with speech. They are usually used to either emphasize a part of speech, or instead of it. Also referred to as coverbal gestures, these gestures can be seen as part of the natural human language. Examples of this are "waving" and "thumbs-up" gestures. It is noteworthy that one might categorize these gestures under other gesture categories: for example, an "ok" gesture can be seen as a gesticulation, even though it might also act as a semaphoric gesture. The difference is that the users do not need to learn these gestures, as they already use them in their everyday conversations. Moreover, these gestures happen as part of the natural conversation and speech and are not particularly targeted towards the interactive system as a command (even though an interactive system might use them as a form of user input).

3 Chapter 1 Introduction

Much research has been done to make such visual communication elements available to remote partners. Collaboard [54], for example, enhances remote collaboration on large interactive displays by superim- posing a video of remote partner over the task space, thus enabling the partners to gain an understanding of location, posture and body language of their counterpart. Tele-board [35] and its updated version [111] use a similar approach. Solutions such as "3D Board" [115] extend this idea by superimposing a 3D representation of the remote partner, thus enhancing the perception of the remote partner’s body posture. Other solutions, such as "Mini-me" [85], provide a more immersive experience of remote col- laboration by providing a Mixed Reality (MR) collaborative environment, where remote partners can visually perceive their partners’ body location and posture in the virtual world.

Additionally, due to the expressiveness of the hands, hand gestures can also be used to enhance the quality of interaction with the computer system, whether in a collaborative environment or not. Touchscreens, for example, have succeeded as a popular user input device by employing bare-hand interaction on the screen. Though yet not as popular as touch interaction, in-air gestural interaction are also employed as a form of interaction with a computer system, where users can manipulate digital content, provide input, and give commands to a digital system using free movement of their hands.

In-air gestures have many attributes that make them an interesting form of interaction. For example, the freedom allowed by freehand motion in the air provides many opportunities for new forms of interaction, since the users are not limited by a keyboard, mouse, or touchscreen and the constraints that these devices impose. This freedom of movement has been successfully exploited in the entertainment industry, where game controllers such as Nintendo Wiimote [62] and Playstation Move [86] allow the users to play sport on game consoles by performing hand movements reminiscent of those performed in real-life games, such as in sports. For example, users can play tennis or bowling as if they are holding a tennis racquet or a bowling ball. To track user’s hands, these devices employ ideas similar to that of "Put-That-There" [10], where the user’s hands (and usually the room) are instrumented with tracking devices. These controllers have been used for other applications as well. For example, in [23] the user can directly manipulate a robot using a Wiimote controller.

Similar technologies are also employed for interaction with virtual reality environments. For example, HTC Vive [107] allows users to interact with objects in virtual reality by using handheld controllers. In order to track these controllers, and thus the position and orientation of user’s hands, infrared transmitters need to be installed in the room. Other sensors such as data gloves and WiFi transmitters [2, 101, 91] have also been used to enable in-air hand gesture interaction.

But more importantly, recent advances in computer vision have allowed for uninstrumented tracking of hands. This allows using in-air hand gestures as a form of input, enabling the users to manipulate digital objects and give commands to a computer system without requiring any physical device. The interactive system only needs to employ one or more cameras, which are usually easily available and relatively inexpensive. Thus, adding a new in-air input modality will not require expensive and custom-designed hardware.

As an example, [40] allows users to manipulate digital artifacts on and above an interactive table by adding a secondary infrared camera to an existing vision-based tabletop. Users can ‘pick’ 3D models and ‘put’ them on top of each other using bare hand movements in the 3D space above the tabletop.

4 1.2 Motivation and Problem Statement

In addition to directly manipulating digital content, in-air gestures are also used as a way to give com- mands to a system. One advantage of using these in-air command gestures, when compared to more conventional forms of input, is their extensibility. This is exploited in In-Vehicle Infotainment (IVI) systems. As the functionalities of IVIs and consequently their input requirements increase, more in- put controls need to be provided. Adding physical buttons and knobs is expensive and it will lead to a crowded interface. Adding more menu items to the touch screen increases its complexity, which makes it more distracting, and it also might require increasing the screen size. On the other hand, comparing to physical buttons and touch screens, extending in-air gestures is much cheaper. It also does not have the complexities of voice commands, such as adopting them for different accents and languages. Thus, many new cars have a camera embedded in the cabin, used for recognizing in-air gestural commands, which users could initially use to perform simple commands such as increasing or decreasing the sound vol- ume by turning their index finger. Capitalizing on the extensibility of in-air gestures, other functionalities such as navigating through the music playlist, accepting or rejecting phone calls, and even programmable gestural commands are added to the IVI through software updates.

In-air commands have also been used to extend the capabilities of other interactive systems. For example, Song et al. [99] use in-air gestures around a mobile device to compliment touch interaction. One use case they explore is in a drawing application, where the user can use single-hand in-air command gestures to switch the drawing tool and its attributes while simultanously drawing on the touchscreen with the other hand. This allows for saving valuable screen space on small-screen mobile devices by removing the need for menus and toolbars. Another prominent example of using command gestures is HoloLense [67], an Augmented Reality interface which uses in-air gestures as its main input modality. In order to accept or reject prompts, users of HoloLense should perform an in-air pinching gesture.

Uninstrumented in-air interactions are also referred to as touchless interactions, highlighting the fact that users can interact with a computer system without touching any physical device. This feature makes in- air interaction attractive for use in operating rooms, where the physicians and medical staff cannot touch any unsterilized input device. In-air gestures allow the users to interact with medical image viewers or surgical robots without exposing themselves to contamination risks that may arise by touching a touchscreen, keyboard, or mouse [65]. For example, [83] replaces the mouse with a Leap Motion [61] device (enclosed in a transparent sterilized plastic bag). Here, surgeons can move their hands above the Leap Motion sensor to move the mouse cursor. In order to emulate different types of mouse button functionalities (left-click, right-click, press-and-hold, ...), users need to perform specific gestures. For example, bringing all the fingers together triggers a left-click, while moving the index finger downwards while keeping the other fingers stationary triggers a double-click.

1.2 Motivation and Problem Statement

The presence of hand gestures in collaborative environments, as well as the availability of tools and technologies to detect, interpret and communicate them, motivates the idea of using them for enhancing the quality and experience of collaborative work, both as a mean of communicating with the computer system, and also as a way to enhance the communication with partners who do not have visual access

5 Chapter 1 Introduction

to the communication space. Several research were motivated by this idea and provided different solu- tions. While in-air gestures provide many benefits, their usage is faced with multiple challenges. This is particularly the case when in-air gestures are used for collaborative environments.

First, many current tracking technologies cannot be easily used in a collaborative setting. Particularly, limitations of camera range and resolution, and interference with other devices used in the collaborative environment (such as interactive surfaces and other cameras) make setting up a collaborative environment which supports in-air gestural interaction challenging. During the course of this thesis, solutions to such problems are explored.

Moreover, while integrating hand gestures that happen during human communication can enhance col- laboration, the typical approach to communicating such gestures is heavily visual. That is, in some way or another, the visuals of remote partners’ body and hands are overlaid on the task space, similar to how video conferencing is done. This causes a significant portion of the task space to be occupied by the representation of the user body. Also, research in social psychology shows that delivering all the visual elements and social content of a meeting is not ideal, as the users will not be able to focus on the task [66, 108]. Moreover, such solutions are not accessible to an important class of participants, i.e. blind and visually impaired.

A similar problem is present when in-air gestures are used as a mean of interacting with the computer system (i.e. as a user interface), since effectively performing such in-air gestures is heavily dependent on expressive visual feedback. Again, providing such visual feedback can occupy large sections of the screen, and could be distracting. This is especially the case with in-air gestures since unlike most conventional user interfaces, a universal interaction vocabulary does not exist for in-air gestures [75]. Thus, designers of an interactive system should devise and select an application-specific vocabulary: what gesture should the user perform to delete an object, what gesture should be used to perform an ‘undo’ action, and so on. Moreover, users need to learn how to use the gestures required for interacting with each new system. This, among other things, magnifies the need of good and meaningful feedback which facilitates this learning process.

Finally, because in collaborative environments users do not use individual devices, it is difficult to rec- ognize which user is performing a gesture. This is important for collaborative scenarios since other participants need to know, for example, who is pointing to a particular object on the digital screen. In many cases, the system also needs to know which user is performing a command. For many collaborative scenarios, users join and leave the interactive room without signing in or out of the system, and many times the identities of the users are not known by the system before they join the collaboration. Thus, it is important to be able to identify and distinguish between the users without requiring them to authenticate or register themselves in the system.

During the course of this thesis, these problems are investigated, and systems and solutions for addressing them is proposed and evaluated. More concretely, the main research question that this thesis tries to answer is as follows: ‘How to enhance digital collaboration using in-air gestures with limited or no use of the visual channel?’. To answer this question, the following sub-questions must be answered, while taking into account the constraint set on the usage of the visual channel:

6 1.3 Organization

1. How to design a collaborative environment capable of sensing and tracking in air gestures using commodity hardware?

2. How to interpret and communicate these gestures to enhance collaboration for participants who do not have access to the communication space (such as remote or blind and visually impaired participants)?

3. How to recognize in-air gestures in order to use them for communicating with the computer system in a collaborative environment?

1.3 Organization

This thesis is structured into three main parts, each answering one of the subquestions introduced in the previous section: PartI discusses the hardware used for tracking and detecting in-air gestures, and how to use these hardwares to design a collaborative environment capable of incorporating these gestures. PartII provides solutions and algorithms for interpretting and integrating in-air gestures in collaborative environments for enhancing collaboration among team members, and evaluates them by presenting use cases and user studies. Finally, Part III presents solutions and algorithms which allow efficient use of in-air gestures as a user interface in collaborative environment.

A more detailed description of each chapter is as follows:

Chapter1 provides an introduction to in-air interactions in collaborative environments and presents the motivation of this thesis.

Chapter2 provides the related work and overviews the systems used for tracking and detecting in-air gestures, and how they are employed in collaborative environments.

Chapter3 discusses how to use commodity hardware, specifically commodity depth cameras, to set up a collaborative environment capable of tracking in-air gestures, and concludes with recom- mendations and solutions for allowing tracking and detection of in-air gestures in collaborative environments with horizontal or vertical displays using commodity depth cameras.

Chapter4 presents and evaluates a solution for enchancing collaborative brainstorming in presence of Blind and Visually Impaired Participants (BVIP), by integrating in-air gestures using one of the setups presented in Chapter3.

Chapter5 presents and evaluates a solution for enchancing remote collaborative brainstorming by in- tegrating in-air gestures using one of the setups presented in Chapter3.

Chapter6 gives an overview of in-air gestures as a mean of human-computer interaction, the chal- lenges that comes with using them in collaborative environments, and provides a dicussion how these challenges can be explained and addressed.

Chapter7 presents a gesture recognizer which helps addressing some of the challenges mentioned in Chapter6, thus improving the interaction qualities of in-air gestures in collaborative environments. This chapter concludes with explaining shortcomings of such gesture recognizers, and explores

7 Chapter 1 Introduction

solutions, and presents recommendations fro how to address those shortcomings.

Chapter8 Discusses the contribution of this thesis and proposes future direction for research on using in-air gestures in collaborative environments.

8 Part I

Hardware for In-air Interaction in Collaborative Environments

9

2

Hardware for Detecting and Tracking In-Air Gestures

2.1 Introduction

The research on in-air hand gestures started in 1980 with "Put-That-There" [10], a voice-activated system for manipulating virtual objects on a large graphics display in an interactive room. To track users hands, the system used a magnetic-field sensing technology, with transmitters installed in the room and sensors worn by the users on their wrist. The user could then select and move the virtual objects seen on a screen using a combination of pointing and voice commands (using a commercial voice recognizer).

2.1.1 Instrumented In-Air Interfaces

Use of handheld input devices is still a popular approach in gestural user interfaces: Nintendo’s Wii Remote [62], Sony’s Playstation Move [86], and more recently HTC’s VIVE Controller [107] are well known examples of such devices. Even though such technologies enable in-air gestural user interactions, they limit the user to simple in-air motions and prevent the users from using free-form hand gestures.

The advent of digital gloves [32], which detect the position of user’s fingers using accelerometers, bend sensors, and optical goniometers (fiberoptic bend sensors), facilitated the use of in-air hand gestures. A well-known digital glove is CyberGlove [48], which uses resistive bend sensors on major hand and finger joints to detect hand’s pose. One major problem of such data gloves is the need for dedicated, expensive electronics for each user. Moreover, these gloves need calibration for each user. These problems can be seen even in the more advanced, recent data gloves, such as [31], which use a stretchable capacitive sensor array, since all of these technologies rely on the relative measurement of joint angles or finger positions. Despite these shortcomings, data gloves have found their use cases in research, where users Chapter 2 Hardware for Detecting and Tracking In-Air Gestures

(a) Wii Remote [62] (b) Playstation Move [86] (c) VIVE Controller [107]

Figure 2.1: Handheld in-air input devices. These devices can be used for gestural user interfaces. are willing to go through an initial preparation and setup, during which the gloves are also calibrated.

(a) CyberGlove, a data glove using bend sensors (b) A strech sensing data glove a capacitive sensor array

Figure 2.2: Two types of data gloves. Regardless of the sensor technology, both of these gloves need per-user calibration

Advances in camera and computer vision technology motivated the advent of vision-based gloves, where markers are attached to fingers or joints, and their positions (and consequently the posture of the hand) is tracked using a camera. A well-known example of such a system is FingARTips [14], an urban planning AR system, where a head-mounted camera is used to track user’s hands and fingers using AR markers (Figure 2.3). This enables the user to directly manipulate the objects and to give simple commands (such as copying and deleting) using gestures. The system uses a simplified model of hand (Figure 2.4), and simple metrics such as the relative position of fingers to each other, to the scene’s plane, or to the target object, in order to detect the user’s commands and the intended object.

Apart from being bulky and using an overly simplistic hand model, occlusion is a major problem in systems such as FingARTips: the hand tracking system needs line of sight, and the markers should

12 2.1 Introduction

Figure 2.3: FingARTips setup [14]. A head-mounted camera tracks AR markers on the users hand.

Figure 2.4: FingARTips’ simplified model of the hand [14] always be in the camera’s view. This is a common problem with vision-based tracking in general, but particularly more in the case of tracking fingers and hand postures, since even if the full hand is in sight, fingers might occlude each other in certain poses (this is referred to as self-occlusion). Different solutions to these problems are explored. For example, [81] uses a wrist-mounted camera to track positions of fingers. This reduces the possibility of occlusion by external objects because the hand and fingers are

13 Chapter 2 Hardware for Detecting and Tracking In-Air Gestures

always in direct line of sight of the camera and the camera is very close to the hand. But still, the device is not easy to wear, needs calibration, and might suffer from the self-occlusion problem for some gestures.

Figure 2.5: Image-based hand glove using a camera and visual markers [81]

More user-friendly approaches have been pursued by using colored data gloves, as in [109]. Here, the user wears an ordinary textile glove, which has a specific color pattern. An RGB camera captures the user’s hand, which is then compared to a database of existing patterns to infer the most likely posture (Figure 2.6).

One major advantage of data gloves is that infering the posture is computationally inexpensive. In sensor- based data gloves, a one-to-one relation between sensors and joints is formed. For vision-based data gloves, feature extraction is simplified due to the existence of markers or color patterns.

2.1.2 Uninstrumented In-Air Interfaces

Increase of computational power and advances in computer vision facilitated the implementation of bare-hand gestural interfaces. Powerful Graphical Processing Units (GPUs) and processors make the computationally expensive problem of tracking bare hands solvable in real-time. Development of unin- strumented in-air interfaces is an important steps towards the vision of a Natural User Interfaces, as it will make the hardware even more invisible to the user, allowing for more freedom, and ideally a better experience (as it will be explained in Chapter6, this is not as easily achieved as researchers once hoped for). Many interactive systems use conventional color cameras as an input device for detecting in-air gestures. Using 2D images as input, these systems employ different computer vision algorithms to de- tect interactions in the air. An overview of these algorithm will be presented in Chapter6. Additionally, availability and affordability of better vision sensors, such as depth cameras, improves the quality of hand tracking. Using technologies such as time-of-flight, stereo-vision, or structured light, these cameras can provide accurate, 3-dimensional information about the state of human’s body and hands. Since such 3D information can help with understanding and intepretting of the gestures, many recent gestural tracking

14 2.2 Conclusion

Figure 2.6: Color data glove [109] systems use such cameras.

2.2 Conclusion

This chapter explored the history of the systems and hardware used for tracking in-air gestures. A categorization of such hardware into Instrumented and Uninstrumented was presented, with the former being hardware which require users to wear a glove, attach markers, put a sensor on their hand, an so on, an the latter being hardware that can detect and track users’ bare hands. Chief among uninstrumented hand tracking technologies are depth cameras, which sense the 3D information of the scene, thus allow for a more accurrate and meaningful tracking and interpretation of hands and their gestures.

Due to their ease of use and setup, as well as an increase in their availability and affordability, unin- strumented hand tracking using depth cameras is becoming the tracking technology of choice. This is particularly the case for collaborative applications, where instrumenting multiple users’ hands is imprac- tical. Although, due to specific requirements of collaborative environments, such as large tracking space, and presence of other interactive devices, their usage is not always straightforward.

15 Chapter 2 Hardware for Detecting and Tracking In-Air Gestures

In the next chapter, feasibility of using different types depth cameras in collaborative environments, challenges of such use cases, and solutions to overcome these challenges is provided.

16 3

Tracking In-Air Gestures in Collaborative Environments Using Commodity Hardware

Usage of commodity hardware in collaborative environments allows system designers to integrate hand tracking and gesture recognition capabilities to a collaborative system, thus enhancing the user experi- ence of participants. As depth cameras are on of the most prominent technologies for uninstrumented hand tracking, they are the main technology that is used for such scenarios.

The main constraint that a depth camera should fulfill in order to be used in a collaborative environment is its capability of tracking hands in the communication space. More concretely, the range of the tracking system should cover the whole communication space, or at least an important part of it.

As already depicted in Figure 1.1, the location of this communication space depends on how the col- laborative environment is set up. For collaborative environment with a horizontal display (tabletop), the interaction space is above the display, whereas for collaborative environments with a vertical display, the communication space can be anywhere in front and around the display.

Moreover, the size of the communication space also depends on whether the display is vertical or hori- zontal. Vertical displays are usually larger than horizontal diplays (since a large display is too heavy and will bend under its own weight if put horizontally), and so is the communication space around them. In fact, the communication space in such a collaborative environment might include the whole room, as the users might move in the environment during collaboration. Thus, a single depth camera might not cover the whole communication space and using multiple depth cameras might be necessary. This introduces several challenges, for example, due to possibility of multiple cameras interferring with each other. In contrast, for a small tabletop setup, a single depth camera can cover the whole interaction space, since the collaboration usually happens with users sitting or standing around the tabletop.

This chapter explores approaches for using depth cameras in collaborative environments. First, an in- troduction to depth cameras, their different types, and their working principles is presented in Section Chapter 3 Tracking In-Air Gestures in Collaborative Environments Using Commodity Hardware

3.1, as is helps with understanding the decisions made in the remainder of this chapter. As the type of the display used in the collaborative enviroment plays a significant role in the setup of the tracking sys- tem, the remainder of this chapter is split into two sections: Section 3.2 presents solutions for tracking hand gestures in tabletop collaborative enviroments, while section 3.3 presents a solution for a larger collaborative enviroment with a vertical display.

Students Philipp Sinn and Luzius Brogli are acknowledged for making major contributions to the work of this chapter through their Master and Bachelor thesis, which were supervised by the author.

3.1 Depth Cameras

A depth camera is a device which is capable of detecting the depth of the objects in its field of view. Commodity depth cameras employ different techniques to achieve this.

A prominent example of such cameras, Microsoft Kinect 360 [68] (hereinafter referred to as Kinect V1) uses a technique known as "structured-light 3D scanning". In this technique, an projector emits a known structure, known as the reference, usually using infrared light (so the pattern will be invisible), and a camera observes the pattern as it is projected onto the scene. The observed pattern is then compared with the reference, and the distortion between the observed pattern and the reference is used to infer the depth map of the scene. Such depth cameras are attractive choices for commodity set ups, because they are built of simple components and are thus affordable.

Time-of-Flight (ToF) is another technique employed by depth cameras. In this technique, light is emitted from the depth camera and its round trip time from the camera’s emitter to its sensor is measured. Since the speed of light is known, it is possible to measure the traveled distance and build a depth map of the scence.

One of the challenges of ToF depth cameras is that, due to the high speed of light, accurate measurment of the round trip time requires highly accurate time measurements, which leads to very expensive com- ponents. One solution to this problem is to indirectly measure the time of flight. For example, Microsoft Kinect One (hereinafter referred to as Kinect V2) uses a technique known as Amplitude Modulated Continuous Wave Time of Flight (AMCW ToF). Here, the intensity of the infrared light is continuously changed in order to achieve an continuous sinosoid signal. Each pixel in the camera’s sensor then re- ceives the reflection of this signal from the scene. The camera then reconstruct the per-pixel signal received by the sensor. These reconstructed signal is then compared with the reference signal, and their phase shift is used to calculate the time-shift, and hence the depth of each pixel. Due to the periodic nature of the emitted signal, cameras using this technique only work within a limited range (since out- side of that range, phase shift cannot be unambiguously correlated to the time-shift. This range can be controlled by selecting an appropriate modulating frequency for the desired range, and can be extended by using multiple modulating frequencies). Kinect V2, for example, can track objects in a 0.40m to 4.5m range.

Since each pixel needs several readings to reconstruct the reflected signal, AMCW ToF cameras need faster sensors and processors than those of the structured-light cameras to achieve similar frame rates.

18 3.1 Depth Cameras

As a results, they are usually more expensive than structure-light cameras, but significantly less expensive than direct ToF cameras.

Stereo vision is another technique used for computing the depth map of a scene. Having two cameras looking at the scence from two known vantange points and with known viewing angles, it is possible to infer the depth of an object based on the disparity of the position of an object in the two camera planes. One of the fundamental challenge of stereovision depth sensing is that system should know that an object seen by one camera is the same object seen by the other camera. More generally, the system should be able to find and match the corresponding pixels from both cameras. This is usually addressed by extracting features from each image and matching the similaring features to each other. To simplify this feature extraction, several techniques can be employed (marker based feature matching, using structued light, and so on), although recent advanced computer vision algorithms can do a decent job of feature extraction without using such techniques.

Moreover, the accurracy of the depth inference dependends on the accurracy of the angles and positions between the cameas. The cameras also need to be calibrated. Moreover, any changes in the setup requires a recalibration. As a result, setting up a stereo vision depth camera using commodity cameras is challenging and error prone. Such problems can be avoided by using pre-calibrated stereovision cameras. These devices consist of two pre-calibrated cameras, packages into a single device.

Leap Motion [61] is such a device. The device incorporates two calibrated infrared cameras, embedded into a small package. The device also constantly projects infrared light. The cameras are thus able to see the reflection of the objects in their field of view. The device is specifically designed for articulated hand and finger tracking, and has dedicated software to extract features from the images and correspond the pixels from the two cameras to each other, and also to infer hands and fingers skeletal models. While Leap Motion provides many benefits, it achieve its high accurracy (sub-millimeter) by having a short tracking range (60 centimeters).

Table 3.1 presents a summary of the features of each depth camera described in this section.

Name Technology Range Field of View Type of Light (H x V)

Kinect V1 Structured Light 0.8m - 4m 57.5◦x 43◦ Infrared

Kinect V2 AMCW ToF 0.5m - 4.5m 70◦x 60◦ Infrared

Leap Motion Stereovision < 60 cm 150◦x 120◦ Infrared

Table 3.1: Comparison of commodity depth cameras. H x V in Field of View column stands for Horizontal and Vertical, respectively.

19 Chapter 3 Tracking In-Air Gestures in Collaborative Environments Using Commodity Hardware

3.2 Setup of a Collaborative Environmment with a Horizontal Display

The setup of a collaborative environment starts with the choice of the display. At the time of this research, one of the major interactive tabletops was Microsoft PixelSense [69] an infrared multi-user and multi- touch tabletop. Its affordability, availablity of hardware, and presence of extensive software libraries made PixelSense the tabletop of choice for tabletop collaboration.

In order to decide how to track hand gestures that happen above this tabletop, first the type and location of these gestures need to be studied. A pilot study with three participants was thus performed, where user where collaborating on a brainstorming task. As the results clearly show, the overwhelming number of gestures (93%) are pointing gestures (see Figure 3.1).

Figure 3.1: Ratio of different gesture types to the total gestures

Not all pointing gestures are equal, and they differ both in the manner that they are performed, as well as the context that they add to the communication. Figure 3.2 shows the types of pointing gestures that happened above the tabletop, as well the the ratio of the deictic pointing gestures (that is, they refer to an artifact on the task space). Overall, 38% of these pointing gestures are deictic. The study also showed that most of these deictic gestures happen within a height of 465 millimeters from the screen. Thus, the Interaction Volume (IV), that is the space in which the deictic gestures happen, is defined as a cube, with its base being the same size as the screen (88.55cm x 49.81 cm), and its height being 465 mm.

20 3.2 Setup of a Collaborative Environmment with a Horizontal Display

Figure 3.2: Different types of pointing gestures, and their ratio to the total number of pointing gestures

3.2.1 Single Camera Setup Using Kinect V1

Given the information presented in Section 3.1, a setup with a single Kinect V1 depth camera was designed. In addition to being the most affordable of the available options, Kinect V1’s range and field of view can easily cover the interaction volume.

In order to provide the depth camera with an unobstructed view of the participants’ hands, it is set to have a bird’s eye view onto the PixelSense.

Given Kinect V1’s field of view (57.5◦x 43◦), the desired tracking distance, and the fact that the accurracy of depth measurement drops as the distance to the camera increases, Kinect V1 is set up at the minimum distance which still allows tracking the full IV. The schematics of this setup is depicted in Figure 3.3, and the realized setup is shown in Figure 3.4.

The main problem with such a setup is that it causes unwanted touch detections on PixelSense. Pix- elSense uses an infrared light source which projects infrared light from the back of the screen. A matrix of infrared sensors then detect the reflection on the screen, which could be caused by putting finger, palm, or any other IR-reflective object on the screen. These reflections are then analyzed using a computer vi- sion algorithm to infer the type of object(s) that caused the reflection.

Consequently, from the point of view of Kinect’s sensor, the whole screen appears as a bright plane, and thus the projected patterns cannot be distinguished from PixelSense’s backlight. Moreover, the glossy frame around the screen is reflective to infrared light, and thus its depth cannot be properly inferred. As a result, the whole table surface cannot be detected by Kinect’s depth sensor. These situations are shown in Figure 3.5.

Notably, as depicted in Figure 3.5c, any object above the screen can still be detected and its depth is inferred by Kinect, because the lower side of the object blocks the table’s infrared and the dot pattern on the upper side of the object can be seen by Kinect. But, as shown in Figure 3.5d, Kinect’s projected pattern is seen by PixelSense, which causes noticable noise, and makes PixelSense see ‘ghost’ touches as objects (that is, detecting touches and points on the screen even though no real touch or object is present).

In order to study the characteristics of these unwanted touch events, two separate recordings were per-

21 Chapter 3 Tracking In-Air Gestures in Collaborative Environments Using Commodity Hardware

Figure 3.3: Design of the bird’s eyes view of Kinect over PixelSense [55]

formed. In the first recording, the Kinect camera was turned on, without any user touching the screen. In the second recording, the camera was turned off and a user was working with PixelSense, performing typical tasks: tapping, double-tapping, and dragging.

The results of these recordings are shown in Figure 3.6.

Three important insights can be gained from these recordings. First, the number of ghost touches when a Microsoft Kinect is turned on is very high. In 15 minutes of recording, 9361 ghost touch events were detected. This corresponds to 624 ghost touches per minute, which makes PixelSense unusable.

Secondly, Kinect’s projected pattern causes touch events with a small variety of sizes. That is, the histogram of sizes of such events is quite sparse. This can be explained by the fact that Kinect’s projected pattern is static, and includes a limited constellation of pattern. Thus, the constellation of sensors that are triggered by such patterns is limited. On the other hand, touches that happen due to real user interaction vary in size, as the contact area of humans finger constantly changes as the user presses down his or her finger, and as the finger is dragged on the screen. These are depicted in Figure 3.6. Despite this difference, many noise touch events have sizes similar to that of real touch events, which complicates filtering them based on size only.

Thirdly, unlike the variety in size, the duration of events in either of the recordings is relatively similar. That is, whether triggered by Kinect’s projector or by reflections from user’s fingers, PixelSense sensors are activated for a large variety of durations.

22 3.2 Setup of a Collaborative Environmment with a Horizontal Display

Figure 3.4: The realized setup of Kinect and PixelSense [55]

This is unexpected for ghost touches: because the projected pattern is constant, one might expect that the ghost touches persist for a long time (in fact, one expects to have constant ghost touche events, as long as Kinect’s projection is on). This dircrepency between the expected duration of noise touches and the measured ones can be explained by the fact that variations in PixelSense’s projector intensity, as well as ambient light’s infrared component, can make the sensors oscillate between saturated and unsaturated states. That is, the superimposition of the ambient infrared light as well as Kinect’s pattern is at times marginally above the threshold of PixelSense’s infrared sensors, and at times below it.

To confirm this hypothesis, a simple experiment was performed by controlling the ambient light by simply turning off all the lights. This reduction in ambient light reduces the intensity of total infrared light observed by PixelSense’s sensors, and thus eliminated all the ghost touches. While this experiment provided some solutions to the problem (i.e. to ensure the collaborative environment in a dark, or to control the infrared component of the lights), such solutions are not practical.

Based on the insight gained from this simple experiment, another approach for reducing the intensity of of infrared light as seen by PixelSense sensor was taken. That is, reducing the intensity of the light projected by Kinect. Because Kinect is capable of tracking objects up to 3 meters away, and since in the current setup it only needs to sense objects with a 80cm distance, it is possible to reduce the intensity of its emmited light while still sensing the traget objects.

Since changing the internals of Kinect is not possible nor practical, a number of experiments using external optical attenuators, such as diffusion films and plexiglasses, were conducted. Such filters either

23 Chapter 3 Tracking In-Air Gestures in Collaborative Environments Using Commodity Hardware

(a) The scene, as seen by Kinect’s IR camera (b) User’s hand, as seen by PixelSense’s sensor when Kinect’s IR projector is off.

(c) Depth map of the scene, as inferred by Kinect. The darker (d) User’s hand, as seen by PixelSense’s sensor when Kinect’s the pixel, the closer it is to the camera. IR projector is on. Black indicates pixels that their depth could not be inferred.

Figure 3.5: How Kinect and PixelSense see a user and his hand [55].

(a) Registered touch events from Kinect (b) Registered touch events from the user with Kinect off

Figure 3.6: Size and duration of real and ghost touches [55]

24 3.2 Setup of a Collaborative Environmment with a Horizontal Display

deformed the projected pattern, or did not attenuate the infrared light significantly.

The best results were gained by attaching a linear polarization filter to Kinect’s projector. Because PixelSense’s sensors are behind a linear polarization filter, arranging Kinect’s polarization filter perpen- dicular to PixelSense significantly further decreased the number of ghost touches. Within 15 minutes of recording, only 281 ghost touches were detected (corresponding to 18 events per minute). That is, a 97% drop in ghost touches (compared to the original setup). Moreover, the distribution of touches were changed to a more managable one: all ghost touches had durations below 50 milliseconds. They also had less variety in their sizes. See Figure.

(a) A linear polarization attached in front of Kinect’s infrared (b) Arrangement of Kinect and PixelSense with the linear po- projector (horizontally). larization filter installed.

(c) Registered touch events from Kinect after the filter is applied.

Figure 3.7: Linear polarization filter for attenuating Kinect’s effect on PixelSense [55]

To verify whether the intensity of Kinect projector is still high enough to detect the farthest objects in this setup, a thin paperback book was put on the screen, and the depth of a fixed point on it, as reported

25 Chapter 3 Tracking In-Air Gestures in Collaborative Environments Using Commodity Hardware

by Kinect with and without the filter, was recorded. As the results show, applying the filter introduced an error in the depth measurement. Moreover, applying the filter have an adverse effect on hand’s depth map. As shown in Figure 3.8, while the arm and the hand are still detected, fingers are generally lost and cannot be tracked by the depth camera.

Figure 3.8: User’s hand, arm and fingers as seen by the depth camera [55]

Thus, even though such a setup can still be viable for tracking coarse hand movements in the communi- cation space, it is not a viable setup for tracking pointing and deictic gestures.

3.2.2 Multiple Camera Setup Using Leap Motion

To track individual fingers, another setup is realized. Here, instead of using a single sensor with a bird’s eye view, multiple low-range sensors are installed around PixelSense. This setup allows for noise-free interaction with PixelSense, and also enables articulated finger tracking.

But because, unlike the previous setup, the depth cameras are not aware of the position of the table, the pointing trajectory of the users’ fingers cannot be intersected with the table’s plane. This is addressed by an intial calibration phase. During this phase, three circles are shown on the tabletop’s screen, and the users are asked to touch them. When a touch event happens, the position of the tip of the finger, as seen by the Leap Motion, is also recorded. After touching all the three circles, the plane of the table is reconstructed in the coordinate system of the Leap Motion. This is repeated for each camera.

After calibration is done, vectors of the pointing gestures are intersected be intersected with PixelSense’s plane to find the approximate target of a pointing gesture.

3.3 Setup of a Collaborative Environmment with a Large Vertical Display

Since collaboration in collaborative environments with large vertical displays usually happens in larger rooms, and users might move in such environments, the the cameras used for tracking gestures need to have large tracking ranges. Since most commodity depth cameras have limited ranges, it is necessary to use multiple cameras in such environments. Moreover, the chosen cameras should have large tracking ranges as well.

Thus, Kinect V2, which had the largest tracking range among the shortlisted depth cameras was selected. Multiple Kinect V2s have already been used by other research. For example, [27] uses multiple Kinect V2s for quantitative gait analysis over a 10 meter walking distance. As shown in Figure 3.11, the cameras in this setup are far away from each other (1.5 meter), and each has an angle of 70◦relative to the direction of the walkway. That is, the setup is specifically designed for a user walking in a straight walkway.

26 3.3 Setup of a Collaborative Environmment with a Large Vertical Display

Figure 3.9: Multiple leap motions set up around a PixelSense tabletop [55]. Angle of the leap motions prevents the interference between them and PixelSense, thus no ghost touches are registered.

Figure 3.10: The touch position is inferred by the camera and the tabletop in different coordinate systems.

27 Chapter 3 Tracking In-Air Gestures in Collaborative Environments Using Commodity Hardware

Figure 3.11: A multiple Kinect V2 setup to extend the tracking range of users walking in a straight walkway [27]

But in a collaborative environment, where users might freely move around, different arrangements are usually necessary. A close look into working principles of Kinect V2, as explained in 3.1, suggests that mutliple Kinect V2s might be affected by intereferences. Thus, it is important to know whether or not there is a possibility of interference between multiple Kinect V2s, and if so, under which circumstances and which kind of arrangements.

3.3.1 Methodology

To study effects of using multiple Kinect V2s on depth measurement in collaborative enviroments, a study with two Kinect V2s with in a variety of arrangements was conducted. This variation were achieved by fixing the position and angle of one sensor, and moving and rotating the other one. The room in which the cameras were set up was not emptied and was furnished with typical furniture (such as desks, cupboards, etc.), which resembels a typical room used for ad-hoc meetings and collaborative work. A 100 mm by 100 mm flat surface, covered with paper in order to reduce specular reflection, was then selected as the region of interest (see Figure 3.12). Using a laser range finder, the fixed camera was setup to be parallel to the region of interest.

Using this setup, the ground truth was first established by measuring the average depth of the surface, observed over 100 frames, using only the fixed camera, while the other camera was turned off.

The effect of inference was then measured in two different settings: in one, the moving camera was facing the fixed one, and in the another, it was facing the region of interest. In both cases, the average depth of the surface, measured by the fixed camera over 100 frames was calculated, and the difference between this measurement and the ground truth was studied. These measurements were repeated over several sessions.

Over most of the measurement sessions, no effect of interference was observed. Occassionally though, when the camera constellations are too similar to each other (i.e. when the cameras are facing the same target, from the close angles), some significant error in depth was observed. This is depicted in Figure

28 3.3 Setup of a Collaborative Environmment with a Large Vertical Display

(a) Sensors facing each other (b) Sensors facing the target

Figure 3.12: Sensor arrangements for inference measurement [58]

3.13. Even though sporadic and occassional, such error was observed multiple times, only when the sensors are both facing the target from similar distances and angles (in particular, angles between 10◦and 40◦).

Figure 3.13: Depth error between two Kinect V2s, when both face the same target [58]. The error was measured when the moving sensor was at distances of 0.5m, 1m, and 1.35 meter to the target. The distance between the target and the fixed sensor was 1345 mm.

Finally, another setup, common in collaborative environments, was analyzed, where depth cameras are installed facing the display. This arrangement is typically used for tracking the body of the user in front of the vertical display. When a single Kinect V2 is used, a black spot (indicating no depth information) is observed by the camera. This is the reflection of the infrared emitter, and due to its high intensity, its depth cannot be inferred by the camera. See Figure 3.14.

29 Chapter 3 Tracking In-Air Gestures in Collaborative Environments Using Commodity Hardware

Figure 3.14: A Kinect V2 facing a vertical display. A balck spot (indicated by a red rectangle), where Kinect V2 cannot infer any depth information, is caused due to the surface’s reflective material.[58]

When two Kinect V2s are used, in addition to the reflection of its own emitter, another black spot is occassionally observed, which can be attributed to the reflection of the other cameras’ infrared emitter. As seen in Figure 3.15, this back spot does not appears constantly, and its shape and size changes.

Since Kinect V2’s software and hardware are closed source, it is difficult to know the exact reason for either of these behaviours. But the most probable explanation, which stands for both the sporadic error in depth measurements, as well as the occasional observation of the second Kinect V2 by the first one, is that the cameras’ infrared emissions are occassionaly in sync, and thus cannot be filtered by the camera. This also can explain why this phenomenon only happens when the sensors are in a similar arrangement relative to the target, since if the angle of emission, or the phase shift, have large differences, Kinect V2’s software or hardware filters can filter them out.

3.4 Conclusion

This chapter presented solutions for setting up depth cameras in collaborative environments.

For horizontal displays, Kinect V1 installed over a Microsoft PixelSense causes major interference. It was shown than filtering using a linear polarization filter and controlling the ambient light’s infrared component can address this problem, and the resulting depth information is good enough for tracking palm and wrist, even though individual fingers can not be reliably tracked. For tracking articulated hands and individual fingers, using a carefully arranged multi-camera setup using Leap Motion devices is a viable solution.

For larger collaborative environments, which usually include a vertical display, using multiple Kinect V2 cameras is a viable solution, even though care needs to be taken to ensure the cameras are not too

30 3.4 Conclusion

Figure 3.15: Two Kinect V2s facing a vertical display. A black spot (indicated by a red rectangle), is always observed by the Kinect V2 used for measurement. Another black spot (indicated by a yellow rectan- gle) is occassionaly observed, with varying shapes and sizes, which can be attributed to the infrared emitter of the other camera. close to each other. Moreover, arranging two cameras in front of a vertical displays causes occasional disturbances due to reflections of the cameras’ infrared emitters.

In the next chapters, lessons learned in this chapter are used to setup collaborative environments capable of tracking and interpretting in-air gestures for enhancing collaboration.

31

Part II

Integrating and Interpreting In-Air Gestures for Enhanced Collaboration

33

4

An Accessible Environment to Integrate Blind Participants into Brainstorming Sessions

4.1 Introduction

In Chapter3, a number of setups for tracking in-air gestures in collaborative environments was proposed. Such setups can be used to enhance the quality of collaboration in different scenarios. One such scenario is when one or more participants are blind or visually impaired. During a typical digital collaborative work, such participants usually face serious challenges, since a significant portion of information ex- change during such meetings is visual: not only digital artifacts on the task space, but also much of the information in the communication space is visual.

Much of the research in accessibility for Blind and Visually Impaired (BVI) users focus on the accessi- bility of digital artifacts. Some of such research focus on presenting challenging textual context, such as mathematical notations [8], while others focus on making graphical information, such as line graphs [90] or Unified Modeling Language (UML) diagrams [51], accessible to the BVI users.

This chapter studies the feasibility and effectiveness of making in-air gestures accessible to the BVI participants in a brainstorming meeting using a mind map, a hierarchical diagram that is used for visual organization of information.

Much of the work in this chapter is the result of a collaboration with Dr. Klaus Miesenberger, Stephan Pölzer, and their group from Johannes Kepler University of Linz and Dr. Max Mühlhäuser, Dr. Dirk Schnelle-Walka, and their group from Technical University of Darmstadt, and their contributions are acknowledged by the author. Chapter 4 An Accessible Environment to Integrate Blind Participants into Brainstorming Sessions

4.2 System Setup

The presence of BVI participants in the room makes a tabletop setting more practical: users do not move around, which helps the BVI participants to be spacially aware of the position of other participants. As such, the setup from Section 3.2 is used to enable tracking of in-air gestures.

The BVI participant sits on a desk next to the touchscreen with his or her accessible laptop (Figure 4.1). This arrangement allows the BVI participant to comfortably follow and engage in the discussions while interacting with the mindmap application through his or her accessible laptop.

Figure 4.1: Sighted participants and the BVI participant collaborating using the realized setup. [88]

A software for creating, manipulating, and presenting the mindmap was also developed. The software works in two modes: a touch mode, used on the tabletop, which presents the mindmap visually and allows sighted participants to create and manipulate the mindmap using touch interaction, and a GUI mode, which allows the BVI participants to work with the mindmap using conventional GUI controls, which are made accessible to the BVI users using a keyboard and an accessible output (a Braille display or a screen reader).

4.3 Detection and Interpretation of Deictic Gestures

While the presented setup enables tracking hands and fingers, it is still necessary to properly detect the deictic gestures and interpret them. One obvious approach for detecting the deictic gestures is to detect the trajectory of the index finger and intersect its pointing line with the table. Using this approach, the errors in detecting the target of the pointing gestures not only depends o the inherent measurement error of the sensor, but also the distance of the finger to the tabletop (see Figure 4.3).

To mitigate this, a visual feedback is provided which indicates to the users how the system tracks their pointing gestures, so they can correct them. For each sensor, a feedback is provided in the form of a semi-transparent circle which highlights the target of the pointing gesture. Since the system should support simultaneous tracking of pointing gestures by multiple users, a unique color is assigned to each

36 4.3 Detection and Interpretation of Deictic Gestures

Figure 4.2: Graphical User Inteface for the BVI participants. BVI users can interact with this interface using keyboard as the input modality, and a Braille display or screen reader as the output modality. Mindmap events, such as adding or changing nodes, as well as selection of a node (with pointing gesture), appears as messages in the lower part of the interface. [87]

Figure 4.3: Error span of tracking [57]

37 Chapter 4 An Accessible Environment to Integrate Blind Participants into Brainstorming Sessions

sensor’s highlighter. It is important to note that, as explained in Chapter3, each sensor is dedicated to a user (see Figure 4.1). Thus, each user can follow the feedback of his or her pointing gesture by following the highlighter with the color assigned to them. This, in combination with Leap Motion device’s high measurement accuracy (±0.1 mm), provides a usable solution for tracking pointing gestures despite measurement and performance inaccuracies.

Moreover, it is necessary to distinguish deictic pointing gestures from simple pointing gestures. Since deictic gestures are co-verbal pointing gestures [21], it is possible to interpret any pointing gesture that happens simultaneously with words like “this”,“that”, “there”, ... as a deictic gesture. To achieve this, a microphone array is installed in the meeting room. As opposed to a single microphone, the array allows for better audio quality in a noisy environment. A speech recognizer is used to detect the characteristic words “this”,“that”, and “there”, and the blind user is notified if any of these words is detected ‘in parallel’ with a pointing gesture. ‘Parallel’ is defined according to [76], which defines parallel modalities as inputs that appear in the same time span.

In order to decide this time span, the transcript and video from 3.2 were used, and the durations of different types of gestures were measured. The duration is measured from the time the finger hovers above the target to the time that it is moved away from the target. The results are show in Figure 4.4.

Figure 4.4: Duration of gestures in a brainstorming meeting [57]. Relevent gestures are deictic, while irrelevant gestures are not.

The results showed that a pointing gesture can last for up to 3.5 seconds and that the characteristic words appear within this time window.

Thus, when a pointing gesture is detected, the system looks for characteristic words that appear in a 3.5-second window around it, and if so, it triggers a signal to the BVI. The trigger is both in the form of a simple audio signal that the BVI can hear using his headphone. Moreover, the node in the mindmap that the highlighter is close to, if any, is assumed to be the target of the deictic gesture and appears in the BVI user interface. Thus, upon hearing of the audio signal, the BVI user can activate their screen reader to hear the content of the node that the user was pointing at.

38 4.4 User Studies

4.4 User Studies

Using this setup, four user studies were performed. The main question that the user study tried to answer was whether such a setup enhances the experience of BVI participants when brainstorming with sighted participants.

Each user study involved a collaborative brainstorming meeting with BVI and sighted participants. Of the four groups, three had one BVI participant and two sighted participants, and one had one BVI participant and one sighted participant. One sighted participant, who was familiar with the system, participated in all of the meetings in order to offer guidance and answer other participants’ questions about the operation of the system, while the rest of the participants (sighted and BVI) were different for each meeting.

Of the four BVI participants, two were involved in the project and were familiar with the concept of mindmap as well as the user interface of the software, while these were completely new for the other two BVI participants. For these users, an initial introduction was made to familiarize them with the concept of a mindmap, as well as the software. The sighted users did not need an initial introduction to the concept of a mindmap, or to learn how to use the software, since they were familiar with the former, and the software user interface in the touch mode was easy to use and users were familiar with similar software.

For each meeting, a different topic was chosen by the participants: ‘study’, ‘holiday’, ‘restaurant delivery service’, and‘organization of the institute’s 25 year anniversary’. After the topic was chosen, participants were asked to develop ideas related to the topic of discussion in a 15-minute brainstorming session, and to use the mindmap software to help with the brainstorming process.

4.5 Evaluation and Results

In order to evaluate the effectiveness of the system in enhancing the experience of BVI participants, the BVI participants were asked to reflect on the following:

1. Effect and importance of communicating the pointing gestures

2. General idea of presenting the information

3. Understanding of the mindmap and the changes to it during the meeting

All the BVI participants appreciated that they have access to the non-verbal communications that happen among the sighted participants. Having access to a previously inaccessible form of communication helped them to have a better understanding of the mindmap. They especially liked that they do not need to only rely on verbal communication when a deictic gesture happens, and are supported by the fact that the target of the pointing gesture is available to them via the user interface.

All the BVI participants evaluated the general idea positively and thought it as a suitable system for enhancing their collaborative experience. This was also confirmed by observing that all the BVI par- ticipants immediately engaged in the idea generation process by adding ideas to the mindmap using the

39 Chapter 4 An Accessible Environment to Integrate Blind Participants into Brainstorming Sessions

provided software.

Another positive feedback was about the fact that the system allows the BVI participants to not only follow the meeting but also actively contribute to it by adding and editing nodes in the mindmap. All of the participants liked the fact that their participation was enhanced due to the fact that they can affect the discussion and meeting in real-time, so they did not feel that they slowed down the process.

Moreover, they did not perceive the mindmap as too complex. The BVI participants mentioned that they have an understanding of the mindmap, and were able to follow changes that happened to it during the meeting. This was confirmed by the fact the BVI participants were fully involved and integrated in generation and manipulation of the mindmap using the provided software.

4.6 Conclusion

This chapter showed how the tabletop setup presented in Chapter3 can be used to enhance the experience of the blind and visually impaired users. The performed user study showed that even a simple integration of in-air gestures, especially pointing gestures, can enhance the experience of BVI participants. As a result, the system was accepted by all the participants, and specifically helped the BVI participants to actively contribute to the generation of the mindmap, and also to follow the discussion that happens around it. In other words, the system reduced the information gap between BVI and sighted participants. Moreover, providing visual feedback in the form of a highlighter enabled the sighted users to direct their pointing gestures towards the topic of their attention, without being distracted by it, hence reducing the technical accuracy requirements of tracking the pointing gesture. In conclusion, the results showed that even a non-visual representation and interpretation of pointing gestures can enhance the engagement and satisfaction of participants in a brainstorming meeting with blind and visually impaired participants.

Building on the lessons learned in this chapter, Chapter5 will study how pointing gestures can be mini- mally represented in remote collaboration in order to enhance the experience of the remote participant.

40 5

Using In-Air Gestures for Enhancing Remote Collaboration in Immersive Environments

5.1 Introduction

Brainstorming is used as a form of creative problem solving, where participants collaborate to generate new ideas, processes, and products. Increasingly, one or more participants join such meetings remotely. In order to ensure a high-quality collaboration, it is important to facilitate such remote collaborations by ensuring that remote and collocated participants have a good understanding of task space and communi- cation space.

In addition to the importance of communicating deictic gestures, which was shown in the previous chap- ter, research indicates that in collaborative problem solving using multiple devices, having a separate device dedicated to providing a view of the environment helps with more effective decision making in the team [13]. Moreover, it is well known that when participants are allowed to freely manipulate the content of the shared screen, without a moderator, they will lose track of the task at hand [100, 38].

In this chapter a system for moderated remote collaboration, which communicates deictic gestures, as well as a view of the collaborative environment, is presented. The system is designed in a way that ensures the content on the shared display is not occluded by these layers of extra information. The system is inspired by ‘Metaplan’ [37], a well-known method for facilitated brainstorming, which uses notes for generating ideas, and dictates asymmetric roles among participants: a moderator who arranges and organizes notes on a shared board, and other participants who generate ideas on the notes and hand them to the moderator. Chapter 5 Using In-Air Gestures for Enhancing Remote Collaboration in Immersive Environments

5.2 Related Work

Many previous systems have tried to digitize collaborative brainstorming. Designers’ Outpost [53] for example is specifically designed for brainstorming with notes. The system allows the participants to use a normal pen and paper, and then digitize them, but only supports colocated collaboration. Other sys- tems, such as Firestorm [19], [38], and [28] also allow for colocated note-based meetings. IdeaVis [29] employed horizontal and vertical displays for colocated teams, where facilitators monitor the progress of the colocated team. An extension of Designers’ Outpost [22] was one of the earliest brainstorming so- lutions supporting distributed teams, but each team had to have identical setups (large vertical displays). TeleBoard [35] allowed each team member to use their personal computer to generate ideas, but also assumed similar setups (large displays) for the shared content among remote teams. CrowdBoard [5] allows remote crowd to generate ideas, which are then displayed on a large display.

Despite such systems, solutions for supporting remote and co-located teams with asymmetric roles are not well explored. Thus, a system that can support brainstorming with such settings can fill this gap.

5.3 Design

In order to address the shortcomings of the existing solutions, the following core requirements are defined for the system:

• R1: Support brainstorming with asymmetric roles (i.e. facilitated brainstorming) for remote and colocated participants.

• R2: Provide a full view of the collaborative environment.

• R3: Support deictic gestures.

As in previous chapters, since the goal is to use commodity, off-the-shelf hardware, the system uses mobile and tablet devices. Moreover, the collaborative environment uses a commodity vertical display, similar to the one used in3.

In order to ensure the remote and colocated participants have similar experience when generating ideas, all of the participants need to use their personal touch devices (mobiles or tablets) to generate ideas. They can then submit their ideas to be displayed on the shared screen, where the moderator will organize them (R1).

Additionally, the view of the shared screen is available on each personal device. Thus, participants can switch between this shared view and their personal view, used for generating ideas, in order to obtain an understanding of the collaborative environment. Moreover, they can imitate a deictic gesture by touching their personal display in the shared view, which causes a highlighter to appear in the corresponding location on the shared screen, which the moderator and colocated participants will see (see Figure 5.1). These partially satisfy R2 and R3. Still, there are two major shortcomings:

1. The provided shared view only shows what is on the shared screen, not the whole collaborative

42 5.3 Design

Figure 5.1: (a) Generating a note, (b) Overview of the shared screen. The user is using a pen to ‘point’ to a content on the shared screen. (c) A highlighter (red spot) indicates where the remote use is pointing. [60]

environment.

2. While collocated participants can understand the deictic gestures of the remote participant, deictic gestures of collocated participants are not communicated with the remote participant.

These problems are largely addressed by extending the collaborative environment using a dedicated device for capturing a full view of the environment. A tablet, equipped with a 360◦lense (Figure 5.2), positioned on the shared desk, captures a panoramic video of the collaborative environment. This setup is depicted in Figure 5.3.

Figure 5.2: A commodity tablet equipped with a 360◦lens [60]

The video is then projected into a cube, to resemble a room, and the content of the shared board is overlayed on the corresponding wall (see Figure 5.4).

In order to show the gestures of the facilitator, a Kinect V2 is installed next to the vertical display, which tracks the facilitator’s body. The skeletal output of Kinect is then used to control an avatar, which resembles the position and body pose of the facilitator (see Figure 5.5).

The user can also manipulate this view, by performing rotate and zoom gestures on the touch screen (see Figure 5.6).

43 Chapter 5 Using In-Air Gestures for Enhancing Remote Collaboration in Immersive Environments

Figure 5.3: Setup for capturing the full view of the environment [60]

(a) (b)

Figure 5.6: Remote participants can rotate and zoom the view for a more immersive experience [60].

Finally, the remote participant can switch to a ‘Through-glass’ view to see the collaborative environment while looking at the share screen content, by lowering the opacity of the shared screen. Using this mode, the remote participant can see the collocated collaborators.

44 5.4 Evaluation

Figure 5.4: Projection of the panoramic video into a cube. The content of the shared board is overlayed on the corresponding wall [60].

5.4 Evaluation

The developed system, called DigiMetaplan, was then evaluated to measure how well it supports fa- cilitated brainstorming. Twenty participants (13 male, 7 female, average age 30 ± 12.3) were divided into five groups. All participants were highly competent with using touch screens. In each group, the facilitator was selected by members of the group.

Each group brainstormed on a topic that they were interested in. Groups 1 and 5 brainstormed on the design of a futuristic kitchen, groups 2 and 3 on the design of a maker bike, and group 4 on marketing strategies for a university’s brand.

The user study resulted in 4 hours and 24 minutes of recorded video. Several factors, such as turn-taking and usage of different aspects of the system were measured. Of interest of this chapter is the usage of pointing highlighter. It was observed that the pointing highlighter feature was extensively used for remote participants. It was used to serve different purposes, such as highlighting a note that they were explaining or discussing, to point to a target region and suggest moving a note to, or to virtually draw an idea. What was surprising was that even collocated members, especially the ones from groups 1 and 3, also used the highlighter on their phone to perform such tasks, even though they could use physical gestures. They believed that this would help the moderator to understand the target of their pointing

45 Chapter 5 Using In-Air Gestures for Enhancing Remote Collaboration in Immersive Environments

Figure 5.5: Moderator’s avatar can be seen in remote participants device [60].

Figure 5.7: Through-glass view allows the remote participant to simultaneously see content of the shared screen, as well as the collaborative environment [60]

gesture easier.

46 5.5 Conclusion

5.5 Conclusion

This chapter presented a system for facilitated remote brainstorming using commodity hardware. The system allows remote and colocated participants to observe important in-air gestures of their counter- parts, without occluding the content of the shared screen. This is achieved by four means: visualizing the pointing gestures of remote partners using a small highlighter, showing a panoramic view of the collaborative environment to the remote partner, visualizing an avatar of the moderator for the remote partner, as well as providing a see-through view of the share screen, through which the room and colo- cated participants are visible, while the contents of the shared screen are also visible.

This concludes Part II, which investigated enhancing human-human communication in brainstorming using in-air gestures. Part III studies another use case of in-air gestures and investigates how they can be used for human-computer interaction in collaborative environments.

47

Part III

In-Air Gestures for Human-Computer Interaction in Collaborative Environments

49

6

Human-Computer Interaction and In-Air Gestures

6.1 Introduction

A User Interface (UI) is “the medium through which the communication between users and computers takes place. The UI translates a user’s actions and states (inputs) into a representation the computer can understand and act upon, and it translates the computer’s actions and states (outputs) into a representation the human user can understand and act upon” [42]. In other words, the user can ‘interact’ with a computer system through a user interface.

In the age of early computers, user interfaces were non-existent: users had to write their programs on punchcards, give it to the computer operator, where the operator fed them to the mainframe computer. Depending on the schedule of the mainframe, and the importance of the job, the user might got the results of the program back in days. And a small error in the program (i.e. the content of the punch cards) would necessitate repeating this whole process again, which might have lead to days of delay. This was quite likely, since the programs were written using an abstract language which made it difficult to comprehend and review. Thus, it was an error prone, ‘human-mediated’, non-instantaneous ‘interaction’ with a computer system.

The advent of time-sharing computers drastically changed the way users interacted with computers. Multiple users were able to use teletypes to interact with a computer system, shared among them, and ‘instantly’ see the results in a character display (such as a teleprinter). This was probably the earliest form of a real-time, direct human-computer interaction.

Studying how time-sharing and teletypes were combined to provide a new form of interaction reveals a common pattern: advances in computer science theory (in this case time-sharing algorithms), and improvements of computer hardware (in this case, increase in internal memories), allowed for a more Chapter 6 Human-Computer Interaction and In-Air Gestures

efficient, usable form of interaction.

These interactions might be based on old, familiar metaphors: teletypes, invented a century before com- puters (for enabling telegraph operators to send messages without knowledge of Morse code), were used as devices to enable users to give input to and receive output from computers. But many new applications and technologies requires completely new forms of interaction. For example, graphical user interfaces and touchscreens introduces completely new forms of interactions such as drag-and-drop and pinch-to- zoom. Thus, the success of a user interface is not merely dependent on its technological qualities and capabilities, but also on how well the user can interact with the computer system using the user interface.

Thus, an interactive system (that is, a computer system which allows interaction between the user and the computer system) is distinguished by its two major components:

• The medium (i.e. user interfaces) that are used to allow the user to interact with the system.

• The method that the interface is used for interaction with the interactive system. That is, the ways the user can give input to and receive output from the system.

6.2 Fundamental principles of interaction

The interaction between people and technology has been a subject of psychology as much as it has been a subject of technology. In order to understand how people interact with technology, one needs to understand the psychology of interaction.

Donald A. Norman, cognitive scientist and usability engineer, made important contributions to the study of interaction and interactive systems. In particular, his seminal books on usability, ‘Design of Every- day Things’ [77] and ‘User centered system design: New perspectives on human-computer interaction’ (with Stephen W. Draper) [79], provide a solid, widely adopted model for studying the psychology of interaction, and how humans interact with objects, as described in the following section.

6.2.1 Psychology of Interaction

When interacting with a system, the user has some specific goals in mind. These goals are formed in the user’s mind and are only expressed in psychological terms, that is, they exists as thoughts rather than physical or graphical objects. The system’s state, on the other hand, is presented in physical (or graphical) terms. This significant difference between goal and system state creates two well-known Gulf s that need to be bridged if the system is to be used [78]:

• Gulf of Execution, where the user tries to figure out how to operate a system.

• Gulf of Evaluation, where the user tries to understand what happened.

These gulfs are unidirectional: Gulf of execution is from the goals (i.e. the user) to the physical system, and Gulf of Evaluation from the system to the goals (see Figure 6.1).

These gulfs are bridged by seven stages of the Action Cycle [78]:

52 6.2 Fundamental principles of interaction

Goal rdeo evaluation of Bridge How do I What work with happened? this?

Is this what What can I wanted?

Bridge of execution I do?

World, Interactive System, ...

Figure 6.1: Gulfs of Execution and Evaluation are unidirectional

1. Goal

2. Plan

3. Specify

4. Perform

5. Perceive

6. Interpret

7. Compare

This is represented in Figure 6.2.

In this context, several properties of the system determine the quality of user interaction with the system:

• Affordance is the relationship between people (or any interacting agent) and objects. More con- cretely, affordance is a relationship between capabilities of an agent and properties of an object. A book affords carrying. That is, a person can carry a book. It also affords reading, but only if it is written in the language the agent can understand. It also might not afford carrying if it is too heavy or if the agent is too weak.

• Signifiers are signs that help an agent perceive affordances of an object. Thickness of a book is a signifier for the agent about its carrying affordance. The language of the book cover is a

53 Chapter 6 Human-Computer Interaction and In-Air Gestures

Goal

Plan Compare rdeo evaluation of Bridge

Specify Interpret Bridge of execution Perform Perceive

World, Interactive System, ...

Figure 6.2: Seven stages of the action cycle

signifier for its reading affordance. Buttons on a mouse signify its clicking affordance.A PUSH sign on a door signifies its pushing affordance. As can be seen in these examples, signifiers can be intentional (such as a "PUSH" sign), or unintentional (thickness of a book). Moreover, some objects do not provide any signifier for their affordances: a touch pad affords tapping, but it lacks a signifier (such as a button) for this affordance, while a ’s buttons are signifiers for its clicking affordance. Objects may even have contradicting signifiers: a door with a "PUSH" sign on it, but with a knob which can be grabbed and pulled.

• Constraints limit user’s interaction with an object. Screen borders offer a constraints for pointer movement. Thus, the user realizes that moving the mouse further in the same direction is not useful when the pointer reaches the screen border.

• Mapping is a relationship between members of two sets of entities. For example, keys on a key- board are mapped to specific characters. Computer mouse movements are mapped to movements of the mouse pointer. How these mappings are defined has an important effect on the usability of these devices: when the mouse is moved to the right, the pointer on the screen is also moved to the right, and when it is moved to the left, the pointer is also moved to the left. A different mapping, for example moving the pointer to the left when the mouse is moved to the right, would have an adverse effect on the quality of interaction of the computer mouse. Mappings are not always this obvious. For example, keys on a standard QWERTY keyboard have multiple mappings. Modifier

54 6.3 Analysing interaction properties of user interfaces

keys, such as Shift, can change the mapping of keyboard: the ’A’ key is mapped to either ’a’ or ’A’, depending on the state of the Shift key.

• Feedback is communicating results of an action. An instant, informative feedback helps the user to know about the success or failure of his or her action. At the same time, too much feedback can hinder the flow of interaction, as it might make the users unable to follow them, or to ignore the important ones. The click sound of the computer mouse, or highlighting the icons once selected, are forms of good, informative feedback.

• Conceptual Model is the understanding of the user about how the system works. Affordance, con- straints, etc. can help in shaping the conceptual model. Moreover, different users make different conceptual models of a system, and this affects how they interact with the system. A user familiar with capacitive touch screens remove her gloves before interacting with her phone’s touchscreen. An inexperienced, non-technical user might be surprised by the lack of a response, and just try to press harder on the screen.

Thus, a user interface which provides meaningful affordances, signifiers, constraints, mappings and feed- back, and help the user to form a proper conceptual model of the system, can provide a high quality interaction.

6.3 Analysing interaction properties of user interfaces

In this section, interaction properties of well-known user interfaces are presented: Command Line In- terfaces, Graphical User Interfaces, and touch interfaces. These are later contrasted with interaction properties of in-air gestures.

6.3.1 Command Line Interfaces

A Command Line Interface (CLI) is a text-based interface, where users type in a sequence of commands, and the computer system interprets and tries running those commands. In a typical CLI, the inputs are entered using a keyboard, the characters are displayed on the terminal as the user types them in, and the system either performs the command or shows an error on the screen in case of erroneous input. A text cursor (also known as caret) indicates the position of the next character. The cursor is typically shaped as an underscore, a vertical line, or a solid rectangle, and can be blinking or steady.

With respect to interaction properties, the CLI offers the following:

• Affordance: A CLI affords receiving text input from the keyboard, executing them and displaying the results on the screen.

• Signifiers: the prompt is a signifier for system readiness of receiving text input. A blinking caret also signifies the responsiveness of the system.

• Constraints: user input is constrained by the characters on the keyboard. New user inputs are constrained by the position of the caret and prompt: the user cannot enter inputs behind the prompt

55 Chapter 6 Human-Computer Interaction and In-Air Gestures

2 2 ~/test > lss▮ ~/test > ls▮

3 1 3 1

(a) Live feedback on erroneous input (b) Live feedback on expected input

Figure 6.3: CLI with live feedback, in two states: an unknown command (a) and a known command (b), indicated by the color of the command (1). Caret (2) shows where the next character will be entered, while the prompt (3) shows constraints and context of the user input.

sign, and since there is only one caret, the use can only type in one key at a time (i.e. no parallel

inputs).~/test The commands ▶ ls are also constrained by the~/test underlying software▶ (forls example bash or Windows Command Prompt).

• Mappings: each input is mapped to an action. For example, pressing letter ’a’ concatenates the current command with a letter ’a’. Pressing the return key executes the current command. Pressing the tab key completes the current command. Although certain actions can change the mode of the interaction (e.g. depending on the caps lock status, the keys are mapped to either capital or small letters), in each mode, there is a one to one mapping between input and output.

• Feedback: after every keystroke, the visual state of the terminal is updated (except when entering hidden inputs, such as passwords). After entering a letter, that letter is shown in the command line. Hitting return shows the result of the executed commands, if any, and shows a new, clean prompt after the command is finished executing, so the user knows the system is ready for new commands. On typing an unrecognized command, pressing the tab key does not complete the command, while if the command is recognized, the tab completes it.

All the above helped CLIs to be among the most ubiquitous user interfaces, still widely in use decades after their introduction.

6.3.2 Graphical User Interfaces

Graphical User Interfaces (GUIs) allow users to interact with a computer system via graphical metaphors. WIMP (Windows, Icon, Menu, Pointer), the most popular form of GUIs, allows the user to interact with the software using a mouse, by performing actions on icons (clicking, drag-and-drop, ...). Applications are represented as windows, and a pointer indicates the current position of the mouse, while menus and icons present shortcuts to an action (such as opening an application or executing a task). A WIMP GUI supports the following interaction properties:

• Affordance: Icons and menus afford ‘clicking’. That is, a user can click on them.

• Signifiers: Some WIMP systems signify clicking affordance by highlighting the item under the

56 6.3 Analysing interaction properties of user interfaces

pointer, or changing the pointer’s shape. For example, hovering the mouse pointer over a hyperlink signifies its ‘click’ affordance by changing the mouse pointer from an arrow shape to a hand shape.

• Constraints: The pointer movement is constrained by the screen border.

• Mappings: Clicking the left mouse button is typically mapped to the activate command. Double- clicking the left button is mapped to execute command. Clicking the right button is mapped to showing the context menu. Holding the buttons and moving the mouse arround is mapped to dragging icons and windows.

• Feedback: Movement of the pointer is used as a feedback for movement of the mouse. Lack of pointer movement, while the mouse is being moved, is an indicator of a technical error (discon- nected cable, low battery, system freeze, ...). When an icon is being clicked on, it is highlighted. When an object is being dragged, a lower opacity copy of the icon will be moved alongside the pointer.

As can be seen, both the software and hardware support a lot of interaction properties, leading to a smooth user experience. There are still usability challenges. For example, lack of signifiers for clicking affordance of the mouse wheel causes a lot of user to miss the affordance of middle-button click.

6.3.3 Touch User Interfaces

Touch-screens are among the most popular Natural User Interfaces (NUI), a type of user interface that enables users to interact with technology in a smooth, intuitive, and natural way, and with less visible hardware: to type without a physical keyboard, to draw without a mouse, and so on.

This is thanks to improvements in touchscreen technologies, especially capacitive touch-screens, inno- vations in touch interaction metaphors (such as pinch-to-zoom), as well as universal success of smart phones.

In this section, a successful, widely-adopted touch interaction, the pinch-to-zoom (Figure 6.4) interac- tion, is explored and its interaction properties are analyzed.

• Affordance: Modern, moderate to large size touch screens afford multi-touch interaction. Small touch-screens, like the ones found in smart-watches, cannot afford such interactions due to their small screen size.

• Signifiers: The size of the touchscreen may signify its multi-touch capabilities. Often, signifiers are learned from external sources not present on the device: advertisements, experience with devices, learning from more experienced users, etc. Sometimes a border around an image indicates that it affords being pinched and zoomed into.

• Constraints: Physical borders on the screen indicate a hard limit on the size of the gesture. Often, the underlying software limits the zoom factor of the image.

• Mappings: Pinching an image and increasing the distance between the touch points (typically between the thumb and pointing fingers) is mapped to zoom in. Decreasing the distance between

57 Chapter 6 Human-Computer Interaction and In-Air Gestures

Figure 6.4: Pinch to zoom

the touch points is mapped to zoom out. More accurately, the ratio between the initial touch points and the current touch points is mapped to the zoom factor.

• Feedback: Continuous feedback is probably the most crucial part of the pinch-and-zoom interac- tion. The user is indicated about the pinch-to-zoom affordance of the object by seeing the change in the object size. Lack of pinch-to-zoom affordance, or constraints on the zoom factor, is indi- cated by a lack of feedback. In a normal case, the object size is increased and decreased as the user is changing the size between the touch points.

6.4 Gestural Interactive Systems

A vision-based gestural interactive system, that is, an interactive system that uses in-air gestures as a form of input, consists of one or more cameras, a gesture recognizer, and an application that uses the output of the gesture recognizer as a form of input. Thus, the gesture recognizer is a core component of any gestural interface, and extensive research has been done on recognizing hand gestures and signs. The gesture recognizer is responsible for two tasks:

1. Gesture detection: detecting when a gesture is started and finished.

2. Gesture classification: classifying the type of the gesture.

6.4.1 Gesture Detection

Detecting start and end of a gesture is a problem that is specific to in-air gesture recognizers: for touch gestures, the beginning and the end of a gesture can be defined as the time when the finger is put on

58 6.5 State of the Art in Uninstrumented In-Air Gesture Recognizers

the touchscreen and the time it is lifted off the screen, respectively. In contrast, in-air gestures are in an always-on state, and the gesture recognizer should be able to distinguish between intentional gestures and unintentional hand movements. This problem is referred to as the ‘live mic’ problem 1 or the ‘Midas Touch’ 2.

The most common ways to address this problem are:

Using a reserved action In some systems, a reserved action is used to indicate the start or end of a gesture. For example, one might use pinching with index finger and thumb to indicate the start of a gesture.

Using a clutch A ‘clutch’ [113] action is used to transition the state of the system to a gesture- recording state, and leave the gesture recording state when the clutch is deactivated.

Multi-modal input Some systems use another input modality to indicate the start or end of a gesture. For example, when the users intend to start a gesture, they need to press (or press and hold) a button, and press it again (or releases it) when the gesture is finished.

It is important to note the difference between a reserved action and a clutch. Users need to perform the reserved action when they start and end each gesture, which can become distracting and reduces the usability of the system. A clutch, on the other hand, transitions the state of the system to gesture recording and stays in this state until the clutch is deactivated. For example, some systems that use depth cameras dedicate two imaginary planes, between which the user motions are recognized as gestures. This allows users to perform more than one action per clutch activation.

6.4.2 Gesture Classification

In addition to detecting that a gesture has happened, the gesture recognizer needs to classify it. Ideally, the gesture recognizer should report the the gesture in the gesture vocabulary that the user intended to perform.

Different gesture recognizers address these tasks differently. The following section presents the state of the art in in-air gesture recognizers, and elaborate how each of these are addressed.

6.5 State of the Art in Uninstrumented In-Air Gesture Recognizers

Developing a gesture recognizer starts with selecting the set of gestures that the recognizer should be able to detect and classify. That is, the gesture vocabulary of the interactive system. Then, samples for the gestures should be collected. This is usually done by asking a study group to perform the gestures

1The term ‘live mic’ is used since the always-on state of the in-air input device (typically a camera) resembles that of an always-on microphone in a live radio program, thus causing everything said by anyone near the microphone to be broad- casted, even if the speaker does not intend it to be. 2The term Midas Touch [52] refers to the Greek mythology of King Midas who obtained a golden touch, which would turn everything he touched into gold, even if he did not intend it to. This had unwanted consequences, including him starving to death, since every food and drink that he touched would turn into gold.

59 Chapter 6 Human-Computer Interaction and In-Air Gestures

multiple times. These collected data are then used to train a classifier, and are thus referred to as the training data.

The type of data and the classifier used for gesture classification has a profound effect on the performance and quality of the gesture recognizer as well as the overall interactive system.

Many vision-based gesture recognizers are trained with the raw data collected from the image. That is, the classifier’s task is to classify a sequence of images and decide whether they belong to any gesture in the gesture vocabulary, or that they do not represent any gesture.

Thus, most such gesture recognizers use advanced computer vision classification techniques to achieve a highly accurate classification of gestures. Many of these recognizers also use computer vision tech- niques for the task of gesture detection as well, while some use another input modality for detecting the beginning and end of a gesture.

For example, Ohn-Bar et al [80] use depth and RGB images to detect segments of the incoming data that contain a gesture (i.e. gesture detection). To do that, they assume that the mere presence of a hand in a frame indicates existence of a gesture and the absence of hand in a frame indicates that there is no gesture. That is, they use presence and absence of a hand in the input image as a clutch. Thus, they temporally segment the incoming video into frames with a hand in. A segmented sequence is then converted into a spatiotemporal descriptor, which is then passed to a Support Vector Machine (SVM) classifier. Figure 6.5 shows the main outline of this gesture recognizer.

Figure 6.5: Overview of the pipeline of a typical vision-based gesture recognizer [80]

On the other hand, [71] use a multi-sensor setup for gesture detection and classification. Using a short- range radar sensor, they detect video segments with significant hand motion, and assume that only those segments are gesture segments. Thus, they transition the system into a gesture recording state only after the detection of significant motion using the radar sensor, and turn off the gesture classification if no significant motion is detected. Similar to [80], frames of a segment are then combined together and passed to a classifier. They use a Three Dimensional Convolutional Neural Network (3D CNN) for the classification of gestures.

The use of 3D CNNs is a popular technique for gesture recognition, and more generally for human action recognition [98, 103, 45]. These Deep Neural Networks (DNN) consume a sequence of 2D images,

60 6.5 State of the Art in Uninstrumented In-Air Gesture Recognizers

which forms as a ‘volume’ (hence 3D), as their input. This allows them to take both spatial as well as temporal features of the input into account.

Molchanov et. al. [70] also use a 3D CNN to classify gestures. They use the VIVA dataset [80], which contains 19 hand gestures, being stored as sequence of depth and intensity images. The gestures are performed by 8 subjects under different illuminations and with different speeds. In order to account for this difference in speed, they first preprocess the segemnents and re-sample the data to reach normalized speed and illumination. These normalized data is then spatiotemporally augmented (using scaling, ro- tation, increasing or decreasing the temporal duration by 20%, ...) to reach a sufficiently large training data set, which the network is then trained on.

Since the dataset contains segmented gestures, the gesture recognizer does not need to address the prob- lem of gesture detection. Such segmented datasets are used by many other gesture recognizers, such as [1], which is also a deep learning-based gesture recognizer, and does not address the gesture detection problem.

Regardless whether using a segmented dataset or a multi-modal input, such gesture recognizers take a sequence of frames as their inputs. This sequence is the full sequence of frames for a ‘gesture candidate’, as presented in a dataset, or extracted from a stream of frames using a gesture detection mechanism (such as a multi-modal input), which indicates the beginning and end of the gesture candidate. That is, given a sequence of frames (of RGB, depth, or skeletal information) forming the whole gesture candidate, they report whether the gesture candidate belongs to any gesture class, and if so, which one.

Because these gesture recognizers need the whole frame sequence to be able to perform the classification task, these gesture recognizers are categorized as ‘offline’ gesture recognizers. The term ‘offline’ here refers to a category of algorithms that need the whole input data before being able to perform their calculations and generate a result. They are contrasted with ‘online’ algorithms, which can perform the calculations and generate an output with partial data. That is, online algorithms can produce a result as soon as a new data point comes in, and they do not require the whole set of the input data.

Neverova et. al. [74] propose a different approach for online gesture recognition. Using both RGB- D and skeletal information of each gesture (collected using a motion capturing device) from ChaLearn 2014 dataset [36], they first temporally scale the training data (i.e. resampling the frame sequence) with varying resampling factors, with the resample sequence always having 5 frames. They then train multiple classifiers for each temporal scale. This gives a degree of robustness to changes in the gesture performance speed.

During runtime, as soon as a new frame comes in, a buffer of the most recent 5 frames is created, and is passed to each classifier. This enables this gesture recognizer to run in an ‘online’ fashion. The outputs of each classifier (which is the probability of the gesture belonging to each class) is then fused using a simple voting strategy. That is, they calculate a weighted sum of the outputs of each classifier for each class, and report the class with the maximum score (the weights are found empirically).

An interesting consequence of the gesture recognizer being online is that it first classifies a gesture and then detect it. That is, they pass the gesture classification score to another classifier which reports when a gesture is started and when it is finished (see Figure 6.6). Only if the gesture is detected, the result of

61 Chapter 6 Human-Computer Interaction and In-Air Gestures

classification is considered valid. This is in contrast with most of the gesture recognizers, such as [71], which only runs the gesture classification after a gesture is detected (using its radar input device).

Figure 6.6: Gesture detection using the confidence of the prediction [74]

A rather similar approach is also used by Molchanov et. al. [84], which uses a Recurrent 3D CNN. Recurrent Neural Networks (RNN) are stateful, meaning that their result is not only dependent on the current input, but also on the previous ones.

In order to run the gesture classifier in an online fashion, they segment the input stream into 8-frame clips, and pass each clip into the classifier. Given that most gestures consist of three phases of preparation, nucleus, and retraction [25, 47], with nucleus being the most discriminative, they only assume a gesture nucleus to be detected when the classification confidence is above a certain thershold. This is very similar to Neverova et. al.’s approach.

Moreover, since the gesture usually spans multiple clips, it is possible to classify it before it is fully finished , because the classifier might already have enough clips of the gesture to be able to confidently classify it. [84]’s motivation for this minimal or even negative delay was for providing timely feedback, thus improving the usability of the system that uses this gesture recognizer.

Although recognizing gesture from sequences of images achieves good classification accuracy, espe- cially when DNNs are used for classification, they have several disadvantages for gesture recognition in interactive applications.

First, since every application needs a different set of gestures, one should either design a new network, or, more typically, tune an existing one for every new set of gestures. This process is time-consuming and computationally expensive. It also requires collecting a large amount of data (in the excess of thousands or tens of thousands of images), collected from a variety of users, to avoid overfitting the model. This is an expensive and time consuming task. Thus, extending such gesture recognizers to recognize new gestures is a very challenging task.

But more importantly, the output of these classifiers only output the probability that a gesture belongs to a class, and nothing else. Thus, their output is not insightful, and would not help the user to build a

62 6.5 State of the Art in Uninstrumented In-Air Gesture Recognizers

conceptual model of the system.

A different approach is to first pass the image stream to a vision-based skeletal tracker (also known as a pose estimator), and then use the output of the skeletal tracker to classify gestures. Even though the skeletal tracker is also inferred from the image stream, the inference uses a highly generalized classifier which is trained over millions of training data and is highly optimized. More importantly, building a gesture classifier based on the pose data will lead to simpler classifier.

For example, Microsoft Kinect uses multiple per-pixel classifiers to infer a body part map from the input depth-images and proposes a set of candidate full-body poses (see Figure 6.7). The pose classifiers use simple, randomly generated depth features, but it is the sheer amount of data they are trained on that makes Kinect skeletal tracking perform well. The training dataset consisted of around half a million frames, collected from hundreds of videos of people performing different activities. This dataset was then used to generate a larger set of synthesized depth images, to account for the variety in human body shapes and sizes [97].

Figure 6.7: Overview of Kinect skeletal tracking pipeline [97]

There is also other research, such as [102], which focuses on skeletal tracking of hands, using long-range depth cameras (i.e. Kinect) by incorporating a detailed smooth hand model for per-frame estimation of the posture. More recent approaches to articulated hand tracking use deep neural networks, such as [26, 72], which estimate hand skeletal model from a single depth image. OpenPose [15], an important contribution in this area, manages to estimate full-body and hand poses of multiple persons from a single 2D image in real-time, thus enabling the use of an inexpensive RGB camera for designing in-air user interfaces.

Gesture recognizers that use skeletal information still need to address the problem of gesture detection.

63 Chapter 6 Human-Computer Interaction and In-Air Gestures

Some, such as [20], manages to achieve a high classification accuracy (of 83%), but it is an offline classifier. That is, they had to either manually crop the gesture sequence before training and testing the classifier, or ask the users to perform an activation command (full open palm) before and after performing each gesture.

Other skeletal gesture recognizers, such as [17, 82] are also trained and tested on pre-segmented data. [82] for example asked the users to press a toggle button before starting a gesture and after finishing one.

Recently, many skeletal gesture recognizers try to adopt successful 2D gesture tracking algorithms, pri- marily used for touch interaction, and adopt them to 3D in-air recognition task. Particularly, an im- portant class of 2D gesture recognizers, $-family recognizers ($1 Recognizer [114], $N Recognizer [6], $P Recognizer [106], and recently $Q Recognizer [105]), which use template matching to recognize single-stroke and multi-stroke gestures on touch screens, have been adopted to 3D gestures [17].

One advantage of such gesture recognizers is that they are simple and easy to implement and adopt, and they require far less training data than image-based gesture recognizers. They still need to address the problem of gesture detection though. Similar to many image-based gesture recognizers, [17] assumes a pre-segmented sequence of skeletal information, thus avoiding the problem of gesture detection. Thus, it is an offline gesture recognizer.

There was only recently that the problem of online gesture recognition (i.e. detecting and classifying a gesture before it ends) gained some traction. Most noticeably, Eurographics 2019’s SHape Retrieval Contest (SHREC) [16] addressed this problem by running a contest on online gesture classification. While [17] tried to adapt their offline, template-based recognizer to an online one by introducing a window (effectively storing the trajectories as they come in and comparing them with the templates), all other contestants used deep neural networks to perform the task of single-trajectory, single-stroke gesture recognition. The target gestures, as well as samples of them being performed by a user, is presented in Figure 6.8 and Figure 6.9.

Figure 6.8: Gesture templates for SHREC [16]

Figure 6.9: SHREC gestures as performed by a user. Red indicates the part of the trajectory that should be detected as a gesture. [16]

64 6.6 Conclusion

6.6 Conclusion

This chapter presented an overview of main concepts of user interface and human-computer interaction, and used those principles to show why many in-air gestural interfaces cannot provide a good user expe- rience, mainly due to lack of timely or meaningful feedback. Moreover, examples of gestural interfaces were provided, and it was shown why lack of such feedback is rooted in the way that the in-air gesture recognizer works. Most gesture recognizers lack the ability to provide fast feedback before the gesture is finished, and even when they provide feedback, it provides only limited useful information to the user (for example, whether the gesture was successful or not).

These problems are addressed in the next chapter, where an online, articulation-free gesture recognizer is proposed, and it is shown how it can be used for developing better gestural interfaces.

65

7

Gesture Recognition for Human-Computer Interaction in Collaborative Environments

7.1 Introduction

A gesture recognizer is the core component of any gestural interface. This is particularly the case when the interactions are performed using semaphorical gestures. As described in Chapter6, any gesture rec- ognizer can provide information about the state of its classification after a gesture is completed, but using this information for providing feedback leads to late feedbacks. Since humans are extremely sensitive to response time of user interfaces, and lags more than 100 milliseconds are considered as annoying [18], such feedback will be more disturbing than useful. There have been efforts to provide feedback with zero or negative lag, most noticeably in [84], where a Convolutional Neural Network (CNN) gesture recog- nizer is trained to provide fast feedback before the gesture is finished. Such efforts are limited to giving a binary feedback to the user, that is, whether the gesture was successfully recognized or not. More- over, [84] requires an extensive training phase, like any other deep neural network, as it requires a large amount of training data. The training of the network can also take a significant amount of time. Other work, such as [50] also focus on tailoring a gesture recognizer for providing timely haptic feedback, but their feedback is also limited to whether the start or end of the gesture was detected, and whether the gesture could be successfully classified.

In this chapter, a gesture recognizer which provides fast, insightful feedback is presented. Using the hand’s trajectory as input, extracted from an skeletal tracker, the gesture recognizer provides continuous classification results of the gesture as they are being performed by the user, thus enabling the application to provide continuous feedback to the user. It uses the confidence of classification as a measure of detection, similar to [74] and [84].

Moreover, the gesture recognizer does not need extensive training and is capable of detecting and classi- Chapter 7 Gesture Recognition for Human-Computer Interaction in Collaborative Environments

fying gestures with a small set of training data. The classification accuracy of the presented framework is comparable to the state of the art gesture recognizers.

7.2 Design Goals

The design goals of the gesture recognizer are as follow:

1. Ease of defining new gestures without extensive data gathering or long training time.

2. Ability to continuously classify the gestures as they are being performed, so the user has the opportunity to correct the gesture while performing it.

3. Being independent of the speed of the gesture: this allows the user to perform the gestures at their own speed, thus having more opportunity for learning.

The focus of this design is on recognizing semaphorical dynamic and unistroke gestures, as they are the most used gestures [34].

7.3 Gesture Datasets

For evaluating the quality of the gesture recognizer that will be presented here, to inform the design of the algorithms, and to present ideas with concrete examples, it is necessary to decide on a set of gestures that will be used in this chapter. While it is possible to specifically collect gesture data for the purpose of this research, using public datasets is preferred. Apart from allowing to access a large number of recorded data by multiple users, using public gesture datasets enables comparison of the results with gesture recognizers which used the same dataset. This simplifies evaluation of the proposed method.

Among the publicly available gesture datasets, many are targeted towards gesticulation and activity recognition tasks, such as ChaLearn [36] which includes a set of Italian gesticulations, Microsoft Re- search Daily Activity (MSRDailyActivity3D) [110] which consists of depth and RGB images of a num- ber of daily activities, and [64], which consists of a set of depth and skeletal records for a subset of static American Sign Language gestures. Many datasets involving semaphorical gestures, such as [80], only consist of RGB images.

The most relevant dataset to the work of this chapter is the one collected for Eurographics 2019 SHape REtrieval Contest (SHREC) [16], which consists of five simple planar gestures (i.e. gestures that can be performed on a plane). This is an important observation, since all other gesture datasets, and most of the applications using semaphorical single stroke in-air gestures, also focus on planar gestures. Thus, it is possible to also use publicly available 2D gesture datasets as a source of extra data. Of the available 2D unistroke gesture datasets, the most widely used one is introduced by [114], which is used by many touch and in-air gesture recognizers, such as [17].

Thus, in this chapter, the datasets used by [16] and [114] are used. Figure 7.1 depicts the gestures included in these datasets.

68 7.4 An overview of the algorithm

(a) The gestures in the $-family dataset [114].

(b) The gestures in the SHREC contest dataset [16]. From left to right, these are referred to as X, O, V, ∧, 

Figure 7.1: Gesture datasets

7.4 An overview of the algorithm

The gesture recognition algorithm proposed here works in an online fashion, that is, as new tracking data comes in, the gesture tracker updates its state and if it leads to confident classification of a gesture, the result of the classification is returned.

To achieve this, the recognizer first preprocesses the received data points in two steps:

1. Creating a summary of observed points. The created summary is independent on the number of points, and its size only depends on the shape of the trajectory. This makes the algorithm suitable

69 Chapter 7 Gesture Recognition for Human-Computer Interaction in Collaborative Environments

for online execution (since otherwise, it needs to store a large memory of all the observed data points). This will be discussed in detail in section1).

2. Converting the summary to a feature vector (as it will be described in section 7.4.1.2).

The same preprocessing is also performed on the gestures in the training set. The resulting feature vectors are then used to train a classifier, using a distance function described in 7.5. The classification method is explained in 7.6.

Moreover, on arrival of every new data, a partial matching with every gesture in the gesture vocabulary is performed, and unmatched parts of the gestures are reported to the user as signifiers, and as a form of feedback.

Figure 7.2: Overview of the gesture recognizer

7.4.1 Preprocessing

Many gesture recognizers, such as the $-family of gesture recognizers [105, 114,6], and many other recognizers influenced by them (such as Protactor [63] , $N-Protracto [7], and !FTL [104] for 2D ges- tures, or the 3D recognizers such as [17]), need a preprocessing step on the gestures to be able to perform an efficient template matching, and to factor out the speed difference in performing gestures as well as sampling rate.

70 7.4 An overview of the algorithm

While different in details, all of these gesture recognizers perform similar preprocessing on the template, as well as on the user’s gesture:

1. Resample the target gestures into a fixed number of points N and create a template library of these resampled gestures.

2. Resample the user gestures into the same number of points N and compare it with the template library.

In Figure 7.3, an example of resampling a star sign using different values for N is shown.

Figure 7.3: An example of resampling in $1 recognizer. Many other recognizers are inspired by the $1 recognizer and follow similar resampling approaches.[114]

This approach works well for gestures on touch screens, because the beginning and end of each gesture is known. But, this is not the case with in-air gestures, due to them being in an always-on state. Thus, ap- plying the same approach to in-air gestures requires segmentation of the gestures using a reserved action, or a multi-modal gesture activator, to mark the start and end of each gesture, which is not desirable.

Thus, a preprocessing algorithm for an online in-air gesture recognizer cannot assume the beginning and the end of gestures if it means to be online.

In this section, first a preprocessing algorithm, Algorithm1, that works on segmented data points is presented. It is then converted to an on-line algorithm, as it will be presented in2, in order run on unsegmented data points.

7.4.1.1 Algorithm 1: Summarizing the point cloud

As a first step, the set of points representing the user’s hand positions is summarized using the algorithm introduced in Algorithm1. The algorithm takes a list of points and reduces it to a smaller subset of representative points. The intuition behind this algorithm is that the only points of significance are the ones which adding them to the list of points introduce a sharp turn in the stroke. A sharp turn is defined as an angle between consecutive vectors that is larger than a constant Θ. If a point does not cause a sharp turn, it should be merged into the previous ones. Figure 7.4 depicts this idea.

Applying the summarizing algorithm on a gesture results in a representative of a gesture, with all edges wider than Θ filtered out. To show the effect of the algorithm, it is applied to the same star stroke

71 Chapter 7 Gesture Recognition for Human-Computer Interaction in Collaborative Environments

Algorithm 1 Summarizing a list of points 1: function SUMMARIZE(points, Θ) 2: n ← len(points)

3: output1 ← points0 4: output2 ← points1 5: for i = 2..n − 1 do 6: m ← len(output) −→ 7: U ← outputm − outputm−1 −→ 8: V ← pointsi − resultm −→ −→

9: if ∠(U, V ) > Θ then 10: outputi ← pointsi 11: else

12: outputm−1 ← pointsi 13: end if 14: end for 15: return output 16: end function

Figure 7.4: Summarizing algorithm

(a) When the angle α is greater than Θ, the algorithm keeps the last point (P1) and adds the new one (P2) to the list of points. (b) When the angle α is less than or equal to Θ, the algorithm replaces the last point (P1) with the new point (P2). represented in Figure 7.3, with different Θ values, and the results are represented in Figure 7.5.

(a) Θ = 0, which results in the input point (b) Θ = π (c) Θ = π cloud 6.0 3.0

Figure 7.5: Applying the summarizing algorithm on a star point cloud

More importantly, the size of the output is independent of the size of the input and only depends on the

72 7.4 An overview of the algorithm

number of sharp turns in the point cloud. Moreover, the algorithm only depends on the last point in the list, and thus is suitable for online execution. In other words, it is not necessary to keep a memory of all the observed points. This algorithm can be rewritten as an online algorithm, as described in Algorithm 2.

Algorithm 2 Online summarization algorithm 1: function ONLINESUMMARIZE(summary, point, Θ) 2: n ← len(summary) 3: if n ≤ 1 then 4: return summary + [point] 5: end if −→ 6: U ← summaryn − summaryn−1 −→ 7: V ← point − summaryn −→ −→

8: if ∠(U, V ) > Θ then 9: summaryn+1 ← point 10: else

11: summaryn ← point 12: end if 13: return summary 14: end function

To better illustrate the working of the algorithm, the same star gesture as being performed online, and its summary as being calculated after the arrival of each data point is represented in Figure 7.6.

7.4.1.2 Scale and orientation-free representation caption While the summarization algorithm makes the stroke speed independent, the resulting summary is still sensitive to the scale and orientation of the strokes. This might pose a problem since the size of gestures is dependent on the user’s arm length and their distance from the tracking system. Similarly, the orientation of the gesture might be dependent on the relative position of the camera to the user’s hand. This is especially the case in a collaborative scenario, where a single camera is used for multiple users.

To address these problems, a scale-free and orientation-free representation of the gesture is necessary. To achieve this, the consecutive points of the calculated summary are first converted into vectors. These vectors are then normalized using the sum of their magnitudes and rotated based on the angle of the first vector. The algorithm presented in Algorithm3 generates such a representation.

This scale-free and orientation-free representation of the points can be used as the feature vector for the point cloud, which can then be used for classifying them.

It is noteworthy that Algorithm3 is dependent on the sum of all vectors’ length (denoted as L on line 7 of Algorithm3) and thus needs to hold on to the full list of points in the summary. While this might seem problematic (since number of points, as explained before, is dependent on user’s gesture speed and the gesture tracking rate), it needs to be pointed out that this algorithm is applied to the summary of the

73 Chapter 7 Gesture Recognition for Human-Computer Interaction in Collaborative Environments

(a) n=34, m=2 (b) n=34, m=2 (c) n=34, m=2 (d) n=34, m=2

(e) n=34, m=2

Figure 7.6: Running of the downsampling algorithm while a gesture is being performed. n is the number of points tracked by the tracking system (black). m is the number of points in the summary (red).

Algorithm 3 Scale-free and oriention-free representation of a set of points 1: function NORMALIZEANDROTATE(summary) 2: n ← len(summary) 3: for i = 1..n do −−−−−−−−−−−−−−−−−−−→ 4: vi ← summaryi − summaryi−1 5: end for 6: for i = 0..n do −→ 7: L ← L + |vi | 8: end for 9: for i = 0..n do −→ −−→ (vi −vi−1) 10: outputi ← L 11: end for 12: return output 13: end function

74 7.4 An overview of the algorithm

(a) Summary of the gesture

Magnitude 0.19 0.19 0.21 0.20 0.20

Angle 0 ◦ 140 ◦ 142 ◦ 140 ◦ 143 ◦

Visual

(b) Feature vector

Figure 7.7: A star gesture summary (a) and its time-, orientation-, and scale-invariant representation (b)

points, with its size being independent of the number of the originally tracked points. Thus, the size of the array summary is constant relative to the number of points and is directly related to the number of sharp corners in the gesture shape.

Figure 7.7 shows such representation for the star sign. One nice property of this feature vector is that it can be easily interpreted by humans. That is, not only this representation is useful for classifying the gestures, it can be used as a guideline for how to perform the gesture. For example, the representation in Figure 7.7 can be interpreted as a step by step instruction set for how to perform the star gesture:

1. Draw a line

2. Draw the next line with similar length, with an angle of 140◦ to the previous line.

3. Draw the next line with similar length, with an angle of 142◦ to the previous line.

4. Draw the next line with similar length, with an angle of 140◦ to the previous line.

5. Draw the next line with similar length, with an angle of 143◦ to the previous line.

This provides a good opportunity for providing live feedback to the user, as they can be provided to the user application and be visualized as the next necessary steps to finish a gesture. This will be addressed in detail in Section 7.7.

75 Chapter 7 Gesture Recognition for Human-Computer Interaction in Collaborative Environments

7.5 Distance Function

To decide whether two gestures are similar or not, a distance function needs to be defined. If properly defined, the distance between two similar gestures should be reported as very low (ideally 0) and the distance between two different gestures as very high.

Finding a distance function for a particular feature vector involves some experimentation. Closely ob- serving the intraclass variations of the feature vectors (i.e. the variation of feature vectors of the same class of gestures), such as the ones depicted in Figures 7.7a, 7.8b, and 7.9b, provides some important insights:

1. Similar gestures might result in feature vectors with different lengths.

2. The extra dimensions have very low magnitudes.

The appearance of these extra dimensions is due to the existence of many tiny sharp turns in the gesture, which is caused by the noise of the tracking device, trembling of the hand, and so on.

While it might be tempting to just filter out these tiny vectors, it is important to be reminded that the length of these vectors are normalized based on the total sum of all the vectors that currently exist in the summary and that the summary is updated online: as the user is moving their hands and performing new gesture, the old trajectories are dequeued from the summary and new ones are enqueued. The dequeued trajectories might have been some very large strokes which caused all other strokes to be normalized to vectors with smaller magnitudes.

Thus, any distance function defined for the feature vectors should be able to calculate the distance of feature vectors with varying lengths. Moreover, the existence of negligible elements in the feature vector should not significantly increase the distance between two different feature vectors.

A distance function that satisfies both of these requirements is Dynamic Time Warping.

7.5.1 Dynamic Time Warping

Dynamic Time Warping (DTW) is a dynamic programming algorithm which gives the distance of two sequences after they are optimally warped so that they match each other. Formally, the DTW distance of two sequences X = (x1, x2, ..., xn) and Y = (y1, y2, ..., yn) is defined as:

D(X,Y ) = f(n, m)  f(t, i − 1)  f(t, i) = d(xt, yi) + min f(t − 1, i)  f(t − 1, i − 1) f(0, 0) = 0, f(t, 0) = f(0, i) = ∞ where d(xt, yi) is the distance between the elements xt and yi. For the purpose of calculating the DTW

76 7.6 Classification

distance of two gesture summaries, a good distance function is simply the magnitude of the difference between each element (notice that each element in the feature vector is a vector).

d(−→x , −→y ) = |−→x − −→y |

Figures 7.8 to 7.11 show the DTW distance between a number of gestures and a template star gesture.

There exist an online DTW algorithm [94]. Thus, instead of rerunning the DTW algorithm on all the elements in the feature vector whenever a new point comes in, it is possible to simply update the matrix using the newest arrival. Experimentally, it showed that the constant factor in the online version of DTW, combined with the added complexity, outweighs the benefits given the low-dimensional feature vectors used here.

7.6 Classification

The introduced distance function is used for a k-Nearest Neighbor(k-NN) classifier. A k-NN classifier classifies its input into the same class that its k nearest neighbors belong to. The ‘nearest’ is calculated using the DTW distance function, and the inputs are the feature vectors based on the normalized (rotation- free and scale-free) representation of the trajectory.

Because gestures can have varying number of angles, the gesture summary and the resulting normalized version will be variable length. Since it is easier to work with fixed-length feature vectors, the feature vector is zero-padded to reach a fixed-length. To do this, an upper bound for the length of feature vector is set. To set this upperbound, the gestures in the training set are first summarized and normalized. The maximum size of the normalized vector, multiplied by two, is then used as the fixed size of the feature vector. For the SHREC dataset, this leads to 20-dimensional feature vectors.

To find k (the number of neighbors for k-NN classification), a grid search is performed: for every k between of 1 and 10, a classifier is trained on a random subset of the training dataset, and is evaluated on the remainder of the training dataset. The classifier with k = 3 leads to the highest classification accuracy (87.77%). As a result, a k-NN classifier with k = 3 is used for classifying the gestures.

7.7 Providing Feedback

The presented gesture recognition algorithm allows for giving continuous feedback to the users. In order to provide the feedback, first the end of the gesture, as stored by the DTW algorithm is restored. Thus,

77 Chapter 7 Gesture Recognition for Human-Computer Interaction in Collaborative Environments

the DTW algorithm is augmented, as follows:

D(X,Y ) = f(n, m) BestP ath(X,Y ) = g(n, m)  f(t, i − 1)  f(t, i) = d(xt, yi) + min f(t − 1, i)  f(t − 1, i − 1)  f(t, i − 1)  g(t, i) = argmin f(t − 1, i)  f(t − 1, i − 1) f(0, 0) = 0, f(t, 0) = f(0, i) = ∞ where ‘argmin’ returns the (t, i) pair that minimizes the DTW alignment at each step. After running DTW, the unmatched remainder of each gesture summary in the gesture vocabulary is returned and pre- sented to the user. This acts as both a signifier and as a feedback, since the user can use this information to understand the state of the gesture recognizer, and what needs to be performed next in order to complete a particular gesture. See Figure 7.12.

7.8 Evaluation and Results

The proposed gesture recognizer is evaluated on SHREC dataset [16]. The dataset contains recordings of 13 users, each performing five gestures (as seen in Figure 7.1b). Each user records each gesture 3 times. The whole dataset thus contains 195 recordings.

The gestures are recorded using a Leap Motion controller. Each recording contains user’s articulated hand skeletal data (positions of all the joints, and center of the palm) before, during, and after the gesture was performed. The recording has a constant frame rate of 20 frames per second.

The dataset is split into a training set (performed by 4 users, i.e. 60 recordings), and a test set, which consists of the remaining 135 recordings.

To train the classifier, the training set is used. Each recording is split into two parts: gesture and non- gesture. Thus, the data is labeled with six classes: 5 class correspond to each gesture type, and one corresponds to non-gestures. Each labeled data is then gone through the summarization and normaliza- tion steps and then used to train the k-NN classifier (with k = 3). Only trajectory of center of palm was used for both training and testing.

These results were then compared to gesture recognizers presented in [16]. It is important to mention that the competing gesture recognizers are not speed-, rotation-, or scale-invariant. Thus, to ensure a fair comparison, the normalization step was not used in this evaluation. The classifier was then evaluated on the test data, and the rate of correct classifications, false positives, mislabeling, and missed gestures were calculated:

78 7.8 Evaluation and Results

1. Correct Classification: Percentage of gestures that were correctly classified.

2. False Positives: Percentage of non-gestures (random trajectories of user’s hand in the air, i.e. non- gestures) incorrectly classified as a gesture

3. Mislabeling: Percentage of gestures that were incorrectly classified.

4. Missed: Percentage of gestures that were not recognized as a gesture (within a 2.5 seconds after the gesture was being performed)

Table 7.1 presents the result of this comparison.

Method Correctly Classified Mislabeled False Positive Missed

ALG 84.7% 9.9% 1.9% 3.5%

uDeepGRU2 85.2% 7.4% 3% 4.4%

uDeepGRU1 79.3% 8,1% 3% 9.6%

uDeepGRU3 79.3% 8.1% 2.2% 10.4%

SW 3-cent 75.6% 16.3% 2.2% 5.9%

DeA 51.9% 18.5% 25.2% 4.4%

AJ-RN 28.1% 43% 23% 5.9%

PI-RN 11.1% 39.3% 48.9% 0.7%

Seg.LST M1 11.1% 28.9% 60% 0%

Seg.LST M2 6.7% 25.2% 68.1% 0%

Table 7.1: Comparison of the algorithm presented here (denoted as ALG) with other gesture recognizers presented in [16].

As the results clearly show, the algorithm presented here performs better than all of the other methods except one, and performs almost as well as the top contender, uDeepGRU2 (at a 84.7% classification rate). Although the rate of misclassifications proved to be high.

To investigate this, the results of classification of each individual gesture was further investigated. While SQUARE and CIRCLE had a low false positive rates, V and CARET gestures showed a very high rate of false positives.

One observation is that gestures such as V and CARET have a very high likelihood of unintentionally appearing while the user is interacting with an in-air gesture recognition system. This is evident, as it also clearly described in the related publication: “gestures chosen were quite simple and that makes difficult to distinguish them from other hand actions classified as non-gesture.” [16]

Another observation is that these gestures can happen when one is performing other gestures. For exam-

79 Chapter 7 Gesture Recognition for Human-Computer Interaction in Collaborative Environments

ple, the X gesture can be seen as a combination of a V gesture and a CARET gesture. Because the gesture recognition happens online, it incorrectly reports two gestures: a V gesture, and a CARET gesture.

Thus, it can be argued that the high false positive rate of the gesture recognition is a consequence of bad gesture selection. That is, the set of gestures selected is not suitable for true online gesture recognition. That is, if an observed trajectory exists in the set of templates, it is reported as a gesture. This might lead to undesired effects if the gestures in the gesture set are subsequences of each other.

Most gesture recognizers, even online ones, address this problem by delaying reporting a gesture. This requires making assumptions about the gesture speed and tracking rate. Instead, it is possible to address this problem by the following a complementary two-step approach:

1. The designer should select a small set of unambiguous gestures. These gestures should not be subsequences of each other. Moreover, the gesture should be simple to perform and unlikely to happen unintentionally.

2. Use the selected simple gestures as a basis for more complicated gestures.

While the burden of the first approach is on the shoulders of the system designer and is dependent on the application and cultural context, it is relatively easy to manually select a small set of unambiguous gestures. The problem arises when the target application requires a large number of gestures. In the next section, a solution for addressing this problem is introduced.

7.9 Selecting unambiguous gestures

To be able to state the posed problems more clearly, it is easier to treat gesture as a language. This, of course, is not unprecedented. Research in psychology, linguistics, and neuroscience all testify to the inherent link between gestures and natural languages. [46], for example, concludes that "language must have evolved in the oral–aural and kinesic modalities together". While this supports the idea of treating gestures as a language, or at least a part of the language, here, the reasoning for treating gestures as a language is mostly practical: treating gesture as a language allows for using many of the algorithms, tools, and solutions that exist in the domain of languages. In particular, treating gestures as a language facilitates finding optimal solutions for these three common problems:

1. Selecting unambiguous gestures

2. Defining an optimal set of gestural commands

3. Detecting gestural commands

Specifically, through this section, the following terminology is used:

Gesture Vocabulary is the set of gestures that a user can use to perform operations.

Referent is an action or operation that a gesture should invoke.

There is typically a one-to-one relation between a gesture and a referent.

80 7.9 Selecting unambiguous gestures

Moreover, the following notations are used through this section:

A single gesture is denoted by a letter resembling the gesture trajectory, styled with an overline. For example O indicates a gesture with a circle trajectory. See Figure 7.1b.

A gestures sequence is denoted with a sequence of single gestures, styled with an overline. For example OX indicates drawing a O gesture immediately followed by a X gesture.

When designing an interactive system with in-air gestures, system developers face some common prob- lems. An initial main problem is which gestures to select for the target application.

This decision, of course, depends on soft (non-technical) factors such as the physical capabilities of the users, the dominant culture of the users, and also the appropriateness of the gestures to the command they are assigned to. What constitutes the appropriateness of a gesture to a command is dependent on the application as well as the target end-users.

Recently, the most common way of finding an appropriate gesture for a referent is by performing gesture elicitation studies, where users are presented by a set of referents and are asked to perform a gesture appropriate for each referent.

As the number of referents increases, the gesture elicitation process becomes more cumbersome: the di- versity of gestures proposed by users increases, and it will be more difficult to find consensus. Moreover, gesture studies tend to suffer from legacy bias [73], when users’ experience with other user interfaces and modalities affect their proposed gestures. This might be problematic because gestures from other user interfaces might not be appropriate or even applicable to the gestural interface in question. For example, many users might suggest a pinch-and-zoom gesture for the zoom action, while the tracking system is not capable of tracking individual fingers.

Moreover, designers should make sure that the set of proposed gestures are easy to learn and distinguish by the users, and that the gesture recognition system is capable of recognizing them accurately and unambiguously.

Since in-air gestures suffer from the live-mic problem (as discussed in Chapter6), this introduces another important question: how to distinguish between individual gestures without requiring the user to activate and deactivate an activation clutch.

Consider the following set of gestures: X, | and ∞. Even though each gesture is distinct enough to be remembered by the users correctly, a |X gesture, can look similar to an ∞ gesture (since drawing an ∞ in air would, inevitably, look like a continuous X).

Thus, the designers either need to design the system to expect a time gap between each gesture, or devise a gesture vocabulary which helps avoiding this problem. While the former solution is applicable, it reduces the interaction speed of the user and increases the rate of false negatives (since a user might start performing a gesture too early). The latter solution might lead to the design of overly complicated gestures. This section provides a method for automatically generating such a gesture vocabulary.

81 Chapter 7 Gesture Recognition for Human-Computer Interaction in Collaborative Environments

7.9.1 Problem Statement

Definition 7.9.1. (Unambiguous Set) Given a set of gestures G, and P being a set of all permutations of G, a set S ⊆ P , would be an unambiguous set if and only if its elements do not share any common prefix.

As an example, given G={X, }, set S={X, X} is an unambiguous set, while set T={X, X} is not, because X and X share a common prefix X. Problem 1. Given a gesture vocabulary G, construct an unambiguous set of gesture sequences S.

Definition 7.9.2. (Code, Codeword) Given a set of alphabet Σ = {σ1, . . . , σn} and a set of words ∗ W = {w1, . . . , wn}, a code C is a mapping from W to Σ . C(wj) is called the codeword for symbol wj.

Definition 7.9.3. (Prefix-free Set) Set S ⊂ Σ∗ is a prefix-free set if and only if none of its elements are prefix of another element.

Definition 7.9.4. (Prefix Code) A prefix code is a code that its set of codewords is prefix-free.

Thus, a gesture vocabulary is unambiguous if and only if it is a prefix-free set. Therefore, to construct an unambiguous gesture vocabulary from a set of gestures, it is enough to create a prefix code for the target actions (referents) using the set of gestures as alphabet.

Since a sequence of prefix codes is uniquely decodable (that is, words can be unambiguously identified without the need of a special marker between them), they are widely used in communication, as they reduce the number of required bits for sending the same amount of data. In order to further reduce the cost of communication, algorithms such as Huffman [43] minimize the length of the prefix codes by ensuring that the more frequent a codeword, the shorter it is.

Huffman algorithm achieves this by constructing a binary tree which puts the more frequent words closer to the root. See Figure 7.13 for an example.

0 1

w1 0 1

w2 0 1

w3 w4

Figure 7.13: A binary tree representing a prefix code. To assign a codeword to each word wj, one needs to traverse the tree from the root to the corresponding leaf node and concatenate the edge labels. The

tree presented here represent the following codewords: C(w1) = 0, C(w2) = 10, C(w3) = 110, C(w4) = 111

82 7.9 Selecting unambiguous gestures

Even though Huffman is mainly used for communication and thus operates on binary alphabets, it can be generalized to n-ary alphabets as well.

7.9.2 N-ary Huffman Codes

Let Σ be a set of alphabets {a1, a2, ..., an}, W be the set of words, {w1, w2, ..., wm}, F be the cor- responding words frequencies, i.e. F = {fi|fi = frequency(wi), 0 ≤ fi ≤ 1, 1 ≤ i ≤ m}, and P W f(w) = 1. An n-ary tree is built by following the algorithm below:

1. If necessary, add dummy symbols with frequency 0 to W , so that m%n − 1 = 0.

2. Create a leaf node for each symbol and add it to the priority queue.

3. While there is more than one node in the queue:

a) Remove the n nodes of highest priority (lowest frequency) from the queue.

b) Create a new internal node with these n nodes as children and with frequency equal to the sum of the n nodes’ frequencies.

c) Add the new node to the queue.

4. The remaining node is the root node and the tree is complete.

Generally, frequency function in Huffman coding is equal to the probability of each word, to guarantee a minimal total length of a coded sequence.

7.9.3 Using n-ary Huffman codes for selecting unambiguous gestures

Now, letting the set of alphabets Σ to be the set of gestures, and set of words W be the set of referents, it is easy to see the set of codewords constructed using the N-ary Huffman algorithm is an unambiguous gesture set.

It is important to note that the system designers need to provide the frequency of referents to be able to use the Huffman algorithm. While this should be done via measuring the frequency of referents used in real runs of the application by performing user studies, it is worthy to mention that, sometimes, the frequency of the referents (that is, the operations) is independent of the input modality of the application, and thus it is possible to measure operation frequency using other input modalities. This will make measuring the referent frequencies easier, since the designers can build prototypes, or use an available version of the same application, using conventional user interfaces (such as GUI or touch) to measure the frequency of operations and use it as a proxy for referent frequencies.

83 Chapter 7 Gesture Recognition for Human-Computer Interaction in Collaborative Environments

7.10 Conclusion

This chapter proposed a method for improving interaction properties of in-air interaction, by introducing an online in-airgesture recognizer. While the main design criteria for the gesture recognizer was the ability to provide continuous, interpretable feedback, the resulting gesture recognizer outperforms most of the state-of-the-art recognizers in terms of classification accuracy, even though it uses a much simpler approach for detecting and classifying the gestures.

While the algorithm presented in this chapter helps with fast classification of gestures, it has a major limitation: the proposed recognizer is greedy. That is, the gesture recognizer reports the detection of a gesture as soon as a gesture trajectory similiar to a gesture in the gesture vocabulary is observed. This causes problems when the gesture vocabulary is ambiguous. That is, when some of the gestures in the gesture vocabulary are subsequences of other gestures. While designing a small unambiguous gesture set is relatively easy, it gets challenging as the size of the vocabulary grows. Thus, another contribution of this chapter was a method for automatically generating a large unambiguous gesture vocabulary by using a smaller one as the basis.

84 7.10 Conclusion

(a) A star gesture

Magnitude 0.19 0.19 0.21 0.20 0.20

Angle 0 ◦ 140 ◦ 142 ◦ 140 ◦ 143 ◦

Visual

(b) Feature vector

(c) DTW Matrix

Figure 7.8: Matrix of DTW distance (c) between the star gesture in (a) and the gesture in Figure 7.7. The shades of gray indicate the difference between each vector element (the darker the shade, the less the distance). The red line indicates the path the minimum distance is found on. The distance is calculated as 0.0148

85 Chapter 7 Gesture Recognition for Human-Computer Interaction in Collaborative Environments

Magnitude 0.19 0.19 0.01 0.21 0.20 0.16 0.01

Angle 0◦ 140◦ 140◦ 116◦ 140◦ 144◦ 131◦

Visual

(a) Feature Vector

(b) DTW Matrix

Figure 7.9: DTW distance between a star gesture and another star gesture which includes a noise vector. The shades of gray indicate the difference between each vector element (the darker the shade, the less the distance). The red line indicates the path the minimum distance is found on. The distance is calculated as 0.0136

86 7.10 Conclusion

Magnitude 0.19 0.19 0.01 0.21 0.20 0.21 0.01 0.02

Angle 0 ◦ 140 ◦ 140 ◦ 142 ◦ 140 ◦ 143 ◦ 131 ◦ 131 ◦

Visual

Figure 7.10: DTW distance between a star gesture and another star gesture which includes a noise vector. The shades of gray indicate the difference between each vector element (the darker the shade, the less the distance). The red line indicates the path the minimum distance is found on. The distance is calculated as 0.0138

87 Chapter 7 Gesture Recognition for Human-Computer Interaction in Collaborative Environments

Magnitude 0.35 0.35 0.24 0.06

Angle 0 ◦ 138 ◦ 91 ◦ 80 ◦

Visual

Figure 7.11: DTW distance between a circle gesture and a star gesture. The shades of gray indicate the difference between each vector element (the darker the shade, the less the distance). The red line indicates the path the minimum distance is found on. The distance is calculated as 0.0879

88 7.10 Conclusion

Figure 7.12: Presenting partial matches of each gesture as a form of feedback to the user. Each feature in the feature vector is represented as a small square. Features of similar colors are sufficiently similar. Since each feature is a vector, sufficiently similar features are vectors with similar size and direction. Grayed features are features that are not yet matched. The unmatched features are then represented as feedback to the user. That is, the user needs to perform the presented sequence in order to finish the gesture.

89

8

Conclusion and Future Work

This thesis presented techniques, methods, and algorithms for enhancing digital collaboration by incor- porating in-air gestures. These methods addressed two separate use cases for in-air gestures in collabora- tive environments: to facilitate the communication between participants, and to facilitate the interaction between participants and the computer system.

First, this thesis evaluated the feasibility of using commodity hardware for tracking in-air gestures in different collaborative environments, where either a horizontal or vertical interactive display acts as the main input and output device. Particularly, this thesis investigated the problems that arise when depth cameras are used to track pointing gestures, which mainly arise due to interference, and provided solu- tions to overcome such problems. These solutions did not require any change of the hardware internals, and enable simple, quick, and low-cost setups which allow tracking in-air gestures in collaborative envi- ronments.

These setups were then used in two particular collaborative scenarios in order to enhance communication between participants:

• One scenario addressed the problem of communicating pointing gestures to blind and visually impaired participants in collaborative brainstorming using mindmaps. By detecting the target of pointing gestures on the mindmap, as well as detecting keywords such as ‘this’, ‘there’, and ‘that’, the system informed BVI participants about the incidence of a deictic gesture, as well as the con- tent it pointed to, thus giving them a better understanding of the non-verbal communication that happens among sighted participants. The outputs of the system were communicated with the BVI participant using a dedicated Braille display, as well as a screen reader. Using a number of user studies, it was shown that all the BVI participants positively evaluated this system.

• Another scenario investigated how to effectively communicate in-air gestures to remote partici- pants in remote collaborative brainstormings, where one or more users participate remotely using Chapter 8 Conclusion and Future Work

a handheld touch device, while others are located in a collaborative room with a vertical display. In such a scenario, tracking deictic gestures of all the colocated participants and communicating them with the remote participants is of little practical value, mainly because commodity hardware do not provide enough accuracy for tracking deictic gestures in a large room. Instead, a view of the interactive room was provided to the remote participants, which the remote participants could interact with. Remote participants could then ‘point’ to specific areas on the interactive display by touching the area of interest on their handheld device. This was then communicated with the collocated participants by highlighting the touched area. Interestingly, some collocated users also used this feature to perform deictic gestures, instead of simply pointing to the screen, because they found the former approach to better capture the attention of other collocated users.

Finally, another contribution of this thesis is a method for improving interaction properties of in-air in- teractions (i.e. when in-air gestures are used to communicate with the shared computer system), by introducing an online in-air gesture recognizer. While the main design criteria for the gesture recognizer was the ability to provide continuous, interpretable feedback, the resulting gesture recognizer outper- forms most of the state-of-the-art recognizers in terms of classification accuracy, even though it uses a much simpler approach for detecting and classifying the gestures. While most of the existing work treat in-air gesture recognizers as a special case of action recognition, this thesis looked at the gesture recog- nizer as a part of an interactive system, thus the gesture recognizer was designed with taking principles of interaction into account. This is further extended with the introduction of a gesture recommendation method, which takes usability into account, by proposing gesture combinations that minimize the amount of fatigue, a common problem in in-air interaction.

All of these contributions facilitate the design and usage of new in-air gestural interactive systems and allow for faster prototyping and iteration.

While these methods showed to work, there are still some limitations and shortcomings that require further research. Particularly, this thesis showed that further research in improving the usability of in-air gestures for human-computer interaction is of value.

In the following sections, three future research directions are discussed.

8.1 Optimizing In-Air Gestures for Minimum Fatigue

One of the main usability issues of in-air gestures is the problem of arm fatigue. Both static and dy- namic muscle contractions can result in local muscle fatigue when the endurance time for the muscle is exceeded. [9]

In the domain of user interfaces, arm fatigue was first studied in the context of interaction with vertical touch screens, when users experienced arm fatigue (referred to as the gorilla-arm effect [11]) when interacting with the screen.

Fatigue is defined as “an exercise-induced decrease in the maximal force produced by a muscle” [24], and has an inverse relation to endurance, “the maximum amount of time that a muscle can maintain a

92 8.2 Effects on gesture guessability and gesture elicitation studies

contraction level before needing rest” [41].

In order to optimize a gesture vocabulary for minimum fatigue, it is first necessary to quantify the fatigue. Quantifying muscular fatigue is a well-known topic in ergonomics and human factors research. For example, fatigue and endurance are used as metrics to evaluate the ergonomics of a task and to predict and prevent musculoskeletal disorders.

A classic study in fatigue and endurance was done by Rohmert [92], which resulted in a well-known intensity-endurance curve, now used as a baseline for further research in the study of muscular fatigue and endurance. Such curves are resulted from measuring the intensity of a task, expressed in percentage, and endurance time in seconds. These results are then used to form an empirical model, often expressed as an exponential or power function.

While many research use invasive measurement tools, such as EMG (electromyography) [96], simple, non-invasive measurements have provided good results for measuring fatigue in in-air gestures. As an example, Hincapié-Ramos et. al. [41] introduce a simple method for measuring endurance for in- air gestures based on Rohmert’s formulation of endurance (E, expressed in seconds) as a function of intensity (I)[92]:

c1 E(I) = c − c4 (8.1) (I − c2) 3

Where c1, c2, c3, and c4 are constants that are empirically found (c1 = 1236.5, c2 = 0.15, c3 = 0.618, c4 = 72.5)[41], and I is the normalized intensity, defined as “the ratio between the average torque applied in the interaction and the maximum torque” [41]. While the maximum shoulder torque can be found in the literature, a skeletal tracking system, such as Kinect is used to estimate the shoulder torque for each interaction. Finally, Consumed Endurance (CE) is defined as the ratio of the time of interaction to the endurance at the shoulder joint.

While [41] uses CE as a measure for designing gestures, expressing fatigue as a quantifiable measure allows for computationally generated large gesture sets. For example, such a measure can be used as a cost function for generating larger gesture sets from a smaller one using the Huffman algorithm with unequal letter costs [33, 30]. This can be a valuable future research direction, as it can simplify the manual task of designing large gesture vocabularies.

8.2 Effects on gesture guessability and gesture elicitation studies

The common practice in selecting a gesture vocabulary is via the gesture elicitation process, a user- centric approach that aims towards achieving a guessable and appropriate gesture set. As a consequence, many gesture datasets, which most gesture recognizers are designed for (and optimized for) the gestures that come out of these studies. One common theme in gesture elicitation studies is that they all assume a gesture to be performed without any feedback from the computer system. While varied in details and techniques, the core of this process is to present users with an effect (referent) and ask them to suggest a gesture for it.

93 Chapter 8 Conclusion and Future Work

One important direction of research is then to study how adding feedback and signifiers affect the selec- tion of gestures. In particular, how incorporating principles of interaction in a gesture elicitation study affects the suggested gestures. For example, will users come up with simpler gesture sets if they already see some kind of feedback?

A more fundamental question is how much value does gesture elicitation add if the interactive system is equipped with real-time feedback and appropriate signifiers. In other words, how important is guessabil- ity if the system is made discoverable by providing enough feedback, signifiers, and so on.

8.3 Benchmarks and datasets for gesture recognizers

Another direction is in developing better metrics for gesture recognizers. So far, most research in gesture recognizers is focused on the accuracy of classification. This is mainly due to the fact that in-air gesture recognition is approached as an action recognition problem. This ignores the use case of in-air gesture recognition: when used as an input modality for an interactive system, a user is ‘interacting’ with the gesture recognizer. For example, a gesture recognizer that has a high classification accuracy, but does not inform the user about the reason behind a false classification, might be worse than one with lower classification accuracy but capable of providing useful feedback. Similarly, a gesture recognizer with a higher false-negative rate might be better than one with a higher false-positive rate, because the latter requires users to perform an action to undo the false action and then redo the intended gesture, while the former does not require an undo step.

Consequently, almost all the publicly available datasets for in-air gestures are collected in isolation. A number of users are asked to perform a set of gestures in front of a camera or a tracking device.

One way for taking such effects into account is by defining benchmark applications and collect gesture datasets with users interacting with the application. Since most of the gesture recognizers are tested and trained on these datasets, such an approach can provide a meaningful improvement in the quality of gesture recognizers.

94 List of Figures

1.1 Digital collaboration spaces...... 2

2.1 Handheld in-air input devices...... 12 2.2 Different types of data gloves...... 12 2.3 FingARTips setup...... 13 2.4 FingARTips’ simplified model of the hand...... 13 2.5 Image-based hand glove using a camera and visual markers...... 14 2.6 A color data glove...... 15

3.1 Ratio of different gesture types to the total gestures...... 20 3.2 Different types of pointing gestures, and their ratio to the total number of pointing gestures 21 3.3 Design of the bird’s eyes view of Kinect over PixelSense...... 22 3.4 The realized setup of Kinect and PixelSense...... 23 3.5 How Kinect and PixelSense see a user and his hand...... 24 3.6 Size and duration of real and ghost touches...... 24 3.7 Linear polarization filter for attenuating Kinect’s effect on PixelSense...... 25 3.8 User’s hand, arm and fingers as seen by the depth camera...... 26 3.9 Multiple leap motions set up around a PixelSense tabletop...... 27 3.10 The touch position is inferred by the camera and the tabletop in different coordinate systems...... 27 3.11 A multiple Kinect V2 setup to extend the tracking range of users walking in a straight walkway...... 28 3.12 Sensor arrangements for inference measurement...... 29 3.13 Depth error between two Kinect V2s, when both face the same target...... 29 3.14 A Kinect V2 facing a vertical display...... 30 3.15 Two Kinect V2s facing a vertical display...... 31

95 List of Figures

4.1 Sighted participants and the BVI participant collaborating using the realized setup.... 36 4.2 Graphical User Inteface for the BVI participants...... 37 4.3 Error span of tracking...... 37 4.4 Duration of gestures in a brainstorming meeting...... 38

5.1 User interfaces for generating note, and the shared screen view...... 43 5.2 A commodity tablet equipped with a 360◦lens...... 43 5.3 Setup for capturing the full view of the environment...... 44 5.6 Remote participants can rotate and zoom the view for a more immersive experience... 44 5.4 Projection of the panoramic video into a cube...... 45 5.5 Moderator’s avatar can be seen in remote participants device...... 46 5.7 Through-glass view...... 46

6.1 Gulfs of Execution and Evaluation are unidirectional...... 53 6.2 Seven stages of the action cycle...... 54 6.3 CLI with live feedback...... 56 6.4 Pinch to zoom...... 58 6.5 Overview of the pipeline of a typical vision-based gesture recognizer...... 60 6.6 Gesture detection using the confidence of the prediction...... 62 6.7 Overview of Kinect skeletal tracking pipeline...... 63 6.8 Gesture templates for SHREC...... 64 6.9 SHREC gestures as performed by a user...... 64

7.1 Gesture datasets...... 69 7.2 Overview of the gesture recognizer...... 70 7.3 An example of resampling in $1 recognizer...... 71 7.4 Summarizing algorithm...... 72 7.5 Applying the summarizing algorithm on a star point cloud...... 72 7.6 Running of the downsampling algorithm while a gesture is being performed...... 74 7.7 A star gesture summary and its time-, orientation-, and scale-invariant representation.. 75 7.13 A binary tree representing a prefix code...... 82 7.8 Matrix of DTW distance between two star gestures...... 85 7.9 DTW distance between a star gesture and a noisy star gesture - 1...... 86 7.10 DTW distance between a star gesture and a noisy star gesture - 2...... 87 7.11 DTW distance between a circle gesture and a star gesture...... 88 7.12 Presenting partial matches of each gesture as a form of feedback to the user...... 89

96 List of Tables

3.1 Comparison of commodity depth cameras. H x V in Field of View column stands for Horizontal and Vertical, respectively...... 19

7.1 Comparison of the algorithm presented here (denoted as ALG) with other gesture recog- nizers presented in [16]...... 79

97

Bibliography

[1] M. Abavisani, H. R. V. Joze, and V. M. Patel, “Improving the performance of unimodal dynamic hand-gesture recognition with multimodal training,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019, pp. 1165–1174.

[2] H. Abdelnasser, M. Youssef, and K. A. Harras, “Wigest: A ubiquitous wifi-based gesture recog- nition system,” in 2015 IEEE Conference on Computer Communications (INFOCOM). IEEE, 2015, pp. 1472–1480.

[3] A. Alavi and A. Kunz, “In-air eyes-free text entry: A work in progress,” in 20th ACM Conference on Intelligent User Interfaces (ACM IUI 2015). ETH Zürich, 2015.

[4] A. Alavi and A. Kunz, “Tracking deictic gestures over large interactive surfaces,” Computer Sup- ported Cooperative Work (CSCW), vol. 24, no. 2-3, pp. 109–119, 2015.

[5] S. Andolina, H. Schneider, J. Chan, K. Klouche, G. Jacucci, and S. Dow, “Crowdboard: augment- ing in-person idea generation with real-time crowds,” in Proceedings of the 2017 ACM SIGCHI Conference on Creativity and Cognition, 2017, pp. 106–118.

[6] L. Anthony and J. O. Wobbrock, “A lightweight multistroke recognizer for user interface proto- types,” Proceedings of Graphics Interface 2010, 2010.

[7] L. Anthony and J. O. Wobbrock, “$ N-Protractor : A Fast and Accurate Multistroke Recognizer,” Proceedings of Graphics Interface 2012, 2012.

[8] D. Archambault, D. Fitzpatrick, G. Gupta, A. I. Karshmer, K. Miesenberger, and E. Pontelli, “Towards a universal maths conversion library,” in International Conference on Computers for Handicapped Persons. Springer, 2004, pp. 664–669.

[9] A. Bhattacharya and J. D. McGlothlin, Occupational ergonomics: theory and applications – p. 75. CRC Press, 2012.

99 Bibliography

[10] R. A. Bolt, “"put-that-there": Voice and gesture at the graphics interface,” in Proceedings of the 7th Annual Conference on Computer Graphics and Interactive Techniques, ser. SIGGRAPH ’80. New York, NY, USA: ACM, 1980, pp. 262–270. [Online]. Available: http://doi.acm.org/10.1145/ 800250.807503

[11] S. Boring, M. Jurmu, and A. Butz, “Scroll, tilt or move it: Using mobile phones to continu- ously control pointers on large public displays,” Proceedings of the 21st Annual Conference of the Australian Computer-Human Interaction Special Interest Group - Design: Open 24/7, OZCHI ’09, vol. 411, pp. 161–168, 2009. [Online]. Available: https://www.medien.ifi.lmu.de/pubdb/ publications/pub/boring2009ozchi/boring2009ozchi.pdf

[12] M. Brereton, N. Bidwell, J. Donovan, B. Campbell, and J. Buur, “Work at hand: An exploration of gesture in the context of work and everyday life to inform the design of gestural input devices,” in Proceedings of the Fourth Australasian User Interface Conference on User Interfaces 2003 - Volume 18, ser. AUIC ’03. Darlinghurst, , Australia: Australian Computer Society, Inc., 2003, pp. 1–10. [Online]. Available: http://dl.acm.org/citation.cfm?id=820086.820088

[13] F. Brudy, J. K. Budiman, S. Houben, and N. Marquardt, “Investigating the role of an overview device in multi-device collaboration,” in Proceedings of the 2018 CHI Conference on Human Factors in Computing Systems, 2018, pp. 1–13.

[14] V. Buchmann, S. Violich, M. Billinghurst, and A. Cockburn, “FingARtips - Gesture based direct manipulation in augmented reality,” Proceedings GRAPHITE 2004 - 2nd International Confer- ence on Computer Graphics and Interactive Techniques in Australasia and Southeast Asia, pp. 212–221, 2004. [Online]. Available: http://www.cosc.canterbury.ac.nz/andrew.cockburn/papers/ fingartips.pdf

[15] Z. Cao, G. Hidalgo, T. Simon, S.-E. Wei, and Y. Sheikh, “OpenPose: realtime multi-person 2D pose estimation using Part Affinity Fields,” in arXiv preprint arXiv:1812.08008, 2018.

[16] F. M. Caputo, S. Burato, G. Pavan, T. Voillemin, H. Wannous, J. P. Vandeborre, M. Maghoumi, and E. M. T. Ii, “SHREC 2019 Track : Online Gesture Recognition,” 2019. [Online]. Available: https:// www.maghoumi.com/wp-content/uploads/2019/08/SHREC2019-OnlineGestureRecognition.pdf

[17] F. M. Caputo, P. Prebianca, A. Carcangiu, L. D. Spano, and A. Giachetti, “Comparing 3d trajec- tories for simple mid-air gesture recognition,” Computers & Graphics, vol. 73, pp. 17–25, 2018.

[18] S. K. Card, G. G. Robertson, and J. D. Mackinlay, “The information visualizer, an information workspace,” in Proceedings of the SIGCHI Conference on Human factors in computing systems. ACM, 1991, pp. 181–186.

[19] A. Clayphan, A. Collins, C. Ackad, B. Kummerfeld, and J. Kay, “Firestorm: a brainstorming application for collaborative group work at tabletops,” in Proceedings of the ACM international conference on interactive tabletops and surfaces, 2011, pp. 162–171.

[20] Q. De Smedt, H. Wannous, and J. P. Vandeborre, “Skeleton-Based Dynamic Hand Gesture Recog- nition,” in IEEE Computer Society Conference on Computer Vision and Pattern Recognition Work- shops, 2016.

100 Bibliography

[21] J. Eisenstein and R. Davis, “Visual and linguistic information in gesture classification,” in Pro- ceedings of the 6th International Conference on Multimodal Interfaces, ser. ICMI ’04. New York, NY, USA: ACM, 2004, pp. 113–120. [Online]. Available: http://doi.acm.org/10.1145/1027933. 1027954

[22] K. M. Everitt, S. R. Klemmer, R. Lee, and J. A. Landay, “Two worlds apart: bridging the gap between physical and virtual media for distributed design collaboration,” in Proceedings of the SIGCHI conference on Human factors in computing systems, 2003, pp. 553–560.

[23] A. Gams and P.-A. Mudry, “Gaming controllers for research robots: controlling a humanoid robot using a {WIIMOTE},” Proc. Of the 17th Int. Electrotechnical and Computer Science Conference (ERK) 08), pp. 191–194, 2008. [Online]. Available: http://www.ieee.si/erk08/

[24] S. Gandevia, G. M. Allen, J. E. Butler, and J. L. Taylor, “Supraspinal factors in human muscle fatigue: evidence for suboptimal output from the motor cortex.” The Journal of physiology, vol. 490, no. 2, pp. 529–536, 1996.

[25] D. M. Gavrila, “The visual analysis of human movement: A survey,” Computer vision and image understanding, vol. 73, no. 1, pp. 82–98, 1999.

[26] L. Ge, H. Liang, J. Yuan, and D. Thalmann, “3D convolutional neural networks for efficient and robust hand pose estimation from single depth images,” Proceedings - 30th IEEE Confer- ence on Computer Vision and Pattern Recognition, CVPR 2017, vol. 2017-January, pp. 5679– 5688, 2017. [Online]. Available: http://openaccess.thecvf.com/content{_}cvpr{_}2017/papers/ Ge{_}3D{_}Convolutional{_}Neural{_}CVPR{_}2017{_}paper.pdf

[27] D. J. Geerse, B. H. Coolen, and M. Roerdink, “Kinematic validation of a multi-kinect v2 in- strumented 10-meter walkway for quantitative gait assessments,” PloS one, vol. 10, no. 10, p. e0139913, 2015.

[28] F. Geyer, U. Pfeil, A. Höchtl, J. Budzinski, and H. Reiterer, “Designing reality-based interfaces for creative group work,” in Proceedings of the 8th ACM conference on Creativity and cognition, 2011, pp. 165–174.

[29] F. Geyer, J. Budzinski, and H. Reiterer, “Ideavis: a hybrid workspace and interactive visualization for paper-based collaborative sketching sessions,” in Proceedings of the 7th Nordic Conference on Human-Computer Interaction: Making Sense Through Design, 2012, pp. 331–340.

[30] Y. Gheraibia, S. Kabir, A. Moussaoui, and S. Mazouzi, “Optimised cost considering huffman code for biological data compression,” International Journal of Information and Communication Technology, vol. 13, no. 3, pp. 275–290, 2018.

[31] O. Glauser, S. Wu, D. Panozzo, O. Hilliges, and O. Sorkine-Hornung, “Interactive hand pose estimation using a stretch-sensing soft glove,” ACM Transactions on Graphics, vol. 38, no. 4, pp. 1–15, 2019. [Online]. Available: http://dl.acm.org/citation.cfm?doid=3306346.3322957

[32] D. Glove, “Historical development of hand gesture recognition,” Cognitive Science and Tech- nology, vol. PartF1, pp. 5–29, 2014. [Online]. Available: https://pdfs.semanticscholar.org/3416/

101 Bibliography

9a33e0666bb82fbe927f5d6020e2f28bef96.pdf

[33] M. J. Golin, C. Kenyon, and N. E. Young, “Huffman coding with unequal letter costs,” Confer- ence Proceedings of the Annual ACM Symposium on Theory of Computing, pp. 785–791, 2002. [Online]. Available: https://www.cs.ucr.edu/{~}neal/Golin02Huffman.pdf

[34] C. Groenewald, C. Anslow, J. Islam, C. Rooney, P. Passmore, and W. Wong, “Understanding 3d mid-air hand gestures with interactive surfaces and displays: A systematic literature review,” in Proceedings of the 30th International BCS Human Computer Interaction Conference: Fusion!, ser. HCI ’16. Swindon, UK: BCS Learning & Development Ltd., 2016, pp. 43:1–43:13. [Online]. Available: https://doi.org/10.14236/ewic/HCI2016.43

[35] R. Gumienny, L. Gericke, M. Quasthoff, C. Willems, and C. Meinel, “Tele-board: Enabling effi- cient collaboration in digital design spaces,” in Proceedings of the 2011 15th International Con- ference on Computer Supported Cooperative Work in Design (CSCWD). IEEE, 2011, pp. 47–54.

[36] I. Guyon, V. Athitsos, P. Jangyodsuk, and H. J. Escalante, “The ChaLearn gesture dataset (CGD 2011),” Machine Vision and Applications, vol. 25, no. 8, pp. 1929–1951, 2014.

[37] N. Habershon, “Metaplan (r): Achieving two-way communications,” Journal of European Indus- trial Training, 1993.

[38] O. Hilliges, L. Terrenghi, S. Boring, D. Kim, H. Richter, and A. Butz, “Designing for collaborative creative problem solving,” in Proceedings of the 6th ACM SIGCHI conference on Creativity & cognition, 2007, pp. 137–146.

[39] O. Hilliges, S. Izadi, A. D. Wilson, S. Hodges, A. Garcia-Mendoza, and A. Butz, “Interactions in the air: adding further depth to interactive tabletops,” Proceedings of the 22nd annual ACM symposium on User interface software and technology, pp. 139–148, 2009. [Online]. Available: http://portal.acm.org/citation.cfm?doid=1622176.1622203

[40] O. Hilliges, S. Izadi, A. D. Wilson, S. Hodges, A. Garcia-Mendoza, and A. Butz, “Interactions in the air: Adding further depth to interactive tabletops,” in Proceedings of the 22Nd Annual ACM Symposium on User Interface Software and Technology, ser. UIST ’09. New York, NY, USA: ACM, 2009, pp. 139–148. [Online]. Available: http://doi.acm.org/10.1145/1622176.1622203

[41] J. D. Hincapié-Ramos, X. Guo, P. Moghadasian, and P. Irani, “Consumed endurance: A metric to quantify arm fatigue of mid-air interactions,” Conference on Human Factors in Computing Sys- tems - Proceedings, pp. 1063–1072, 2014. [Online]. Available: http://hci.cs.umanitoba.ca/assets/ publication{_}files/Consumed{_}Endurance{_}-{_}CHI{_}2014.pdf

[42] D. Hix and H. R. Hartson, Developing User Interfaces: Ensuring Usability Through Product &Amp; Process. New York, NY, USA: John Wiley & Sons, Inc., 1993.

[43] D. A. Huffman, “A method for the construction of minimum-redundancy codes,” Proceedings of the IRE, vol. 40, no. 9, pp. 1098–1101, 1952.

[44] M. Karam and m. c. Schraefel, “A Taxonomy of Gestures in Human Computer Interactions,” Technical Report, Eletronics and Computer Science., pp. 1–45, 2005. [Online]. Available: http:

102 Bibliography

//eprints.soton.ac.uk/261149/

[45] A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar, and L. Fei-Fei, “Large-scale video classification with convolutional neural networks,” in Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, 2014, pp. 1725–1732.

[46] A. Kendon, “Reflections on the “gesture-first” hypothesis of language origins,” Psychonomic bul- letin & review, vol. 24, no. 1, pp. 163–170, 2017.

[47] A. Kendon, “Current issues in the study of gesture,” The biological foundations of gestures: Motor and semiotic aspects, vol. 1, pp. 23–47, 1986.

[48] G. D. Kessler, L. F. Hodges, and N. Walker, “Evaluation of the CyberGlove as a Whole-Hand Input Device,” ACM Transactions on Computer-Human Interaction (TOCHI), vol. 2, no. 4, pp. 263–283, 1995. [Online]. Available: http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1. 499.9332{&}rep=rep1{&}type=pdf

[49] S. Kettebekov, “Exploiting prosodic structuring of coverbal gesticulation,” in Proceedings of the 6th International Conference on Multimodal Interfaces, ser. ICMI ’04. New York, NY, USA: ACM, 2004, pp. 105–112. [Online]. Available: http://doi.acm.org/10.1145/1027933.1027953

[50] K. Kim, J. Kim, J. Choi, J. Kim, and S. Lee, “Depth camera-based 3D hand gesture controls with immersive tactile feedback for natural mid-air gesture interactions,” Sensors (), vol. 15, no. 1, pp. 1022–1046, 2015.

[51] A. King, P. Blenkhorn, D. Crombie, S. Dijkstra, G. Evans, and J. Wood, “Presenting uml software engineering diagrams to blind people,” in International Conference on Computers for Handi- capped Persons. Springer, 2004, pp. 522–529.

[52] R. Kjeldsen and J. Hartman, “Design issues for vision-based computer interaction systems,” ACM International Conference Proceeding Series, vol. 15-16-November-2001, p. 27, 2001. [Online]. Available: http://echo.iat.sfu.ca/library/kjeldsen{_}01{_}visionBased{_}interaction{_}systems. pdf

[53] S. R. Klemmer, M. W. Newman, R. Farrell, M. Bilezikjian, and J. A. Landay, “The designers’ outpost: a tangible interface for collaborative web site,” in Proceedings of the 14th annual ACM symposium on User interface software and technology, 2001, pp. 1–10.

[54] A. Kunz, T. Nescher, and M. Küchler, “CollaBoard: A novel interactive electronic whiteboard for remote collaboration with people on content,” Proceedings - 2010 International Conference on Cyberworlds, CW 2010, pp. 430–437, 2010.

[55] A. Kunz, A. Alavi, and P. Sinn, “Integrating pointing gesture detection for enhancing brainstorm- ing meetings using kinect and pixelsense,” Procedia CIRP, vol. 25, pp. 205–212, 2014.

[56] A. Kunz, K. Miesenberger, M. Mühlhäuser, A. Alavi, S. Pölzer, D. Pöll, P. Heumader, and D. Schnelle-Walka, “Accessibility of brainstorming sessions for blind people,” in International Conference on Computers for Handicapped Persons. Springer, Cham, 2014, pp. 237–244.

103 Bibliography

[57] A. Kunz, D. Schnelle-Walka, A. Alavi, S. Pölzer, M. Mühlhäuser, and K. Miesenberger, “Making tabletop interaction accessible for blind users,” in Proceedings of the Ninth ACM International Conference on Interactive Tabletops and Surfaces, 2014, pp. 327–332.

[58] A. Kunz, L. Brogli, and A. Alavi, “Interference measurement of kinect for one,” in Proceed- ings of the 22nd ACM Conference on Virtual Reality Software and Technology. ACM, 2016, pp. 345–346.

[59] J. J. LaViola, Jr., “A survey of hand posture and gesture recognition techniques and technology,” Providence, RI, USA, Tech. Rep., 1999.

[60] K.-D. Le, M. Fjeld, A. Alavi, and A. Kunz, “Immersive environment for distributed creative col- laboration,” in Proceedings of the 23rd ACM Symposium on Virtual Reality Software and Technol- ogy. ACM, 2017, p. 16.

[61] LeapMotion.

[62] M. Lee, R. Green, and M. Billinghurst, “3D natural hand interaction for AR applications,” in 2008 23rd International Conference Image and Vision Computing New Zealand, IVCNZ, 2008.

[63] Y. Li, “Protractor: a fast and accurate gesture recognizer,” in Proceedings of the SIGCHI Confer- ence on Human Factors in Computing Systems. ACM, 2010, pp. 2169–2172.

[64] G. Marin, F. Dominio, and P. Zanuttigh, “Hand gesture recognition with leap motion and kinect devices,” in 2014 IEEE International Conference on Image Processing (ICIP). IEEE, 2014, pp. 1565–1569.

[65] A. Mewes, B. Hensen, F. Wacker, and C. Hansen, “Touchless interaction with software in inter- ventional radiology and surgery: a systematic literature review,” International Journal of Com- puter Assisted Radiology and Surgery, vol. 12, no. 2, pp. 291–305, Feb 2017. [Online]. Available: https://doi.org/10.1007/s11548-016-1480-6

[66] N. Michinov, “Is electronic brainstorming or brainwriting the best way to improve creative per- formance in groups? an overlooked comparison of two idea-generation techniques,” Journal of Applied Social Psychology, vol. 42, pp. E222–E243, 2012.

[67] Microsoft, “Hololens @ www.microsoft.com,” 2015. [Online]. Available: https://www.microsoft. com/microsoft-hololens/en-us (Accessed: 12.08.2019)

[68] Microsoft Kinect, “Microsoft Kinect.” [Online]. Available: https://developer.microsoft.com/ en-us/windows/kinect (Accessed: 2019.08.12)

[69] Microsoft PixelSense, “Microsoft PixelSense,” 2013. [Online]. Available: http://www.microsoft. com/enus/PixelSense/default.aspx (Accessed: 23.10.2013)

[70] P. Molchanov, S. Gupta, K. Kim, and J. Kautz, “Hand gesture recognition with 3d convolutional neural networks,” in Proceedings of the IEEE conference on computer vision and pattern recog- nition workshops, 2015, pp. 1–7.

[71] P. Molchanov, S. Gupta, K. Kim, and K. Pulli, “Multi-sensor system for driver’s hand-gesture

104 Bibliography

recognition,” in 2015 11th IEEE international conference and workshops on automatic face and gesture recognition (FG), vol. 1. IEEE, 2015, pp. 1–8.

[72] G. Moon, J. Yong Chang, and K. Mu Lee, “V2v-posenet: Voxel-to-voxel prediction network for accurate 3d hand and human pose estimation from a single depth map,” in The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2018.

[73] M. R. Morris, A. Danielescu, S. Drucker, D. Fisher, B. Lee, M. C. Schraefel, and J. O. Wobbrock, “Reducing legacy bias in gesture elicitation studies,” Interactions, vol. 21, no. 3, pp. 40–45, 2014. [Online]. Available: http://faculty.washington.edu/wobbrock/pubs/interactions-14.pdf

[74] N. Neverova, C. Wolf, G. W. Taylor, and F. Nebout, “Multi-scale deep learning for gesture de- tection and localization,” in European Conference on Computer Vision. Springer, 2014, pp. 474–490.

[75] M. Nielsen, M. Störring, T. B. Moeslund, and E. Granum, “A procedure for developing intu- itive and ergonomic gesture interfaces for HCI.” [Online]. Available: http://citeseerx.ist.psu.edu/ viewdoc/download?doi=10.1.1.62.4970{&}rep=rep1{&}type=pdf

[76] L. Nigay and J. Coutaz, “A generic platform for addressing the multimodal challenge,” in Pro- ceedings of the SIGCHI conference on Human factors in computing systems, 1995, pp. 98–105.

[77] D. Norman, The design of everyday things: Revised and expanded edition. Basic books, 2013.

[78] D. A. Norman, “Cognitive engineering,” User centered system design, vol. 31, p. 61, 1986.

[79] D. A. Norman and S. W. Draper, User centered system design: New perspectives on human- computer interaction. CRC Press, 1986.

[80] E. Ohn-Bar and M. M. Trivedi, “Hand gesture recognition in real time for automotive interfaces: A multimodal vision-based approach and evaluations,” IEEE transactions on intelligent trans- portation systems, vol. 15, no. 6, pp. 2368–2377, 2014.

[81] V. F. Pamplona, L. A. F. Fernandes, J. Prauchner, L. P. Nedel, and M. M. Oliveira, “The image- based data glove,” in Proceedings of X Symposium on Virtual Reality (SVR’2008), João Pessoa, 2008. Anais do SVR 2008, SBC. Porto Alegre, RS: SBC, 2008, pp. 204–211.

[82] G. Paraskevopoulos, E. Spyrou, D. Sgouropoulos, T. Giannakopoulos, and P. Mylonas, “Real-time arm gesture recognition using 3D skeleton joint data,” Algorithms, vol. 12, no. 5, pp. 1–17, 2019.

[83] J. Pauchot, L. Di Tommaso, A. Lounis, M. Benassarou, P. Mathieu, D. Bernot, and S. Aubry, “Leap motion gesture control with carestream software in the operating room to control imaging: Installation guide and discussion,” Surgical innovation, vol. 22, no. 6, pp. 615–620, 2015.

[84] J. K. N. Pavlo Molchanov (NVIDIA), Xiaodong Yang (NVIDIA), Shalini Gupta (NVIDIA), Kih- wan Kim (NVIDIA), Stephen Tyree (NVIDIA), “Online Detection and Classification of Dynamic Hand Gestures with Recurrent 3D Convolutional Neural Networks,” IEEE Conference on Com- puter Vision and Pattern Recognition (CVPR) 2016, 2016.

[85] T. Piumsomboon, G. A. Lee, J. D. Hart, B. Ens, R. W. Lindeman, B. H. Thomas, and

105 Bibliography

M. Billinghurst, “Mini-me: An adaptive avatar for mixed reality remote collaboration,” in Pro- ceedings of the 2018 CHI conference on human factors in computing systems, 2018, pp. 1–13.

[86] Playstation, “Playstation.com,” 2019. [Online]. Available: https://www.playstation.com/en-gb/ explore/accessories/playstation-move-motion-controller/ (Accessed: 2019.08.12)

[87] S. Pölzer, D. Schnelle-Walka, D. Pöll, P. Heumader, and K. Miesenberger, “Making brainstorming meetings accessible for blind users.”

[88] S. Pölzer, A. Kunz, A. Alavi, and K. Miesenberger, “An accessible environment to integrate blind participants into brainstorming sessions,” in International Conference on Computers Helping Peo- ple with Special Needs. Springer, Cham, 2016, pp. 587–593.

[89] F. Quek, D. McNeill, R. Bryll, S. Duncan, X.-F. Ma, C. Kirbas, K. E. McCullough, and R. Ansari, “Multimodal human discourse: Gesture and speech,” ACM Trans. Comput.-Hum. Interact., vol. 9, no. 3, pp. 171–193, Sep. 2002. [Online]. Available: http://doi.acm.org/10.1145/568513.568514

[90] R. Ramloll, W. Yu, S. Brewster, B. Riedel, M. Burton, and G. Dimigen, “Constructing sonified haptic line graphs for the blind student: first steps,” in Proceedings of the fourth international ACM conference on Assistive technologies, 2000, pp. 17–25.

[91] S. Ren, H. Wang, B. Li, L. Gong, H. Yang, C. Xiang, and B. Li, “Robust contactless gesture recog- nition using commodity wifi,” in Proceedings of the 2019 International Conference on Embedded Wireless Systems and Networks. Junction Publishing, 2019, pp. 273–275.

[92] W. Rohmert, “Determination of the recovery pause for static work of man,” Internationale Zeitschrift Fur Angewandte Physiologie, Einschliesslich Arbeitsphysiologie, vol. 18, pp. 123–164, 1960.

[93] D. Rubine, “Specifying gestures by example,” in Proceedings of the 18th Annual Conference on Computer Graphics and Interactive Techniques, ser. SIGGRAPH ’91. New York, NY, USA: ACM, 1991, pp. 329–337. [Online]. Available: http://doi.acm.org/10.1145/122718.122753

[94] Y. Sakurai, C. Faloutsos, and M. Yamamuro, “Stream monitoring under the time warping dis- tance,” in 2007 IEEE 23rd International Conference on Data Engineering. IEEE, 2007, pp. 1046–1055.

[95] D. Schnelle-Walka, A. Alavi, P. Ostie, M. Mühlhäuser, and A. Kunz, “A mind map for brain- storming sessions with blind and sighted persons,” in International Conference on Computers for Handicapped Persons. Springer, Cham, 2014, pp. 214–219.

[96] D. Seth, D. Chablat, F. Bennis, S. Sakka, M. Jubeau, D. Seth, D. Chablat, F. Bennis, S. Sakka, M. Jubeau, D. Seth, D. Chablat, F. Bennis, S. Sakka, M. Jubeau, and A. Nordez, “Validation of a New Dynamic Muscle Fatigue Model and DMET Analysis,” 2016. [Online]. Available: https: //hal.archives-ouvertes.fr/hal-01420684/document

[97] J. Shotton, A. Fitzgibbon, M. Cook, T. Sharp, M. Finocchio, R. Moore, A. Kipman, and A. Blake, “Real-time human pose recognition in parts from single depth images,” Studies in Computational Intelligence, vol. 411, pp. 119–135, 2013.

106 Bibliography

[98] K. Simonyan and A. Zisserman, “Two-stream convolutional networks for action recognition in videos,” in Advances in neural information processing systems, 2014, pp. 568–576.

[99] J. Song, G. Sörös, F. Pece, S. R. Fanello, S. Izadi, C. Keskin, and O. Hilliges, “In-air gestures around unmodified mobile devices,” Uist 2014, pp. 319–329, 2014. [Online]. Available: http://dx. doi.org/10.1145/2642918.2647373{%}5Cnhttp://dl.acm.org/citation.cfm?id=2642918.2647373

[100] M. Stefik, G. Foster, D. G. Bobrow, K. Kahn, S. Lanning, and L. Suchman, “Beyond the chalk- board: computer support for collaboration and problem solving in meetings,” Communications of the ACM, vol. 30, no. 1, pp. 32–47, 1987.

[101] S. Tan and J. Yang, “Wifinger: leveraging commodity wifi for fine-grained finger gesture recogni- tion,” in Proceedings of the 17th ACM international symposium on mobile ad hoc networking and computing. ACM, 2016, pp. 201–210.

[102] J. Taylor, L. Bordeaux, T. Cashman, B. Corish, C. Keskin, T. Sharp, E. Soto, D. Sweeney, J. Valentin, B. Luff, A. Topalian, E. Wood, S. Khamis, P. Kohli, S. Izadi, R. Banks, A. Fitzgib- bon, and J. Shotton, “Efficient and precise interactive hand tracking through joint, continuous optimization of pose and correspondences,” ACM Transactions on Graphics, vol. 35, no. 4, 2016. [Online]. Available: https://www.microsoft.com/en-us/research/wp-content/uploads/2016/ 07/SIGGRAPH2016-SmoothHandTracking.pdf

[103] D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri, “Learning spatiotemporal features with 3d convolutional networks,” in Proceedings of the IEEE international conference on com- puter vision, 2015, pp. 4489–4497.

[104] J. Vanderdonckt, P. Roselli, and J. L. Pérez-Medina, “!ftl, an articulation-invariant stroke gesture recognizer with controllable position, scale, and rotation invariances,” in Proceedings of the 20th ACM International Conference on Multimodal Interaction, ser. ICMI ’18. New York, NY, USA: ACM, 2018, pp. 125–134. [Online]. Available: http://doi.acm.org/10.1145/3242969.3243032

[105] R.-d. Vatavu and J. O. Wobbrock, “$ Q : A Super-Quick , Articulation-Invariant Stroke-Gesture Recognizer for Low-Resource Devices,” 2018. [Online]. Available: http://faculty.washington.edu/ wobbrock/pubs/mobilehci-18.pdf

[106] R.-D. Vatavu, L. Anthony, and J. O. Wobbrock, “Gestures as point clouds: A $P recognizer for user interface prototypes,” Proceedings of the 14th ACM international conference on Multimodal interaction, 2012.

[107] Vive, “vive.com.” [Online]. Available: https://www.vive.com/eu/accessory/controller/ (Accessed: 2019-08-12)

[108] J. B. Walther, J. F. Anderson, and D. W. Park, “Interpersonal effects in computer-mediated interac- tion: A meta-analysis of social and antisocial communication,” Communication research, vol. 21, no. 4, pp. 460–487, 1994.

[109] H. Wang, M. M. Ullah, A. Klaser, I. Laptev, and C. Schmid, “Evaluation of local spatio-temporal features for action recognition,” BMVC 2009 - British Machine Vision Conference, pp. 124.1–

107 Bibliography

124.11, 2009. [Online]. Available: http://hal.inria.fr/inria-00439769

[110] J. Wang, Z. Liu, J. Chorowski, Z. Chen, and Y. Wu, “Robust 3D action recognition with random occupancy patterns,” in Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), vol. 7573 LNCS, no. PART 2, 2012, pp. 872–885.

[111] M. Wenzel and C. Meinel, “Full-body webrtc video conferencing in a web-based real-time collab- oration system,” in 2016 IEEE 20th International conference on computer supported cooperative work in design (CSCWD). IEEE, 2016, pp. 334–339.

[112] A. Wexelblat, “Research challenges in gesture: Open issues and unsolved problems,” in Pro- ceedings of the International Gesture Workshop on Gesture and Sign Language in Human- Computer Interaction. Berlin, Heidelberg: Springer-Verlag, 1998, pp. 1–11. [Online]. Available: http://dl.acm.org/citation.cfm?id=647590.728557

[113] D. Wigdor and D. Wixon, Brave NUI World: Designing Natural User Interfaces for Touch and Gesture, 1st ed. San Francisco, CA, USA: Morgan Kaufmann Publishers Inc., 2011.

[114] J. O. Wobbrock, A. D. Wilson, and Y. Li, “Gestures without libraries, toolkits or training: a $1 recognizer for user interface prototypes,” in Proceedings of the 20th annual ACM symposium on User interface software and technology - UIST ’07, 2007.

[115] J. Zillner, C. Rhemann, S. Izadi, and M. Haller, “3d-board: a whole-body remote collaborative whiteboard,” in Proceedings of the 27th annual ACM symposium on User interface software and technology, 2014, pp. 471–479.

108 List of Publications

Journal Papers

• A. Kunz, K. Miesenberger, M. Mühlhäuser, A. Alavi, S. Pölzer, D. Pöll, P. Heumader, and D. Schnelle- Walka, “Accessibility of brainstorming sessions for blind people,” in International Conference on Computers for Handicapped Persons. Springer, Cham, 2014, pp. 237–244

• A. Alavi and A. Kunz, “Tracking deictic gestures over large interactive surfaces,” Computer Sup- ported Cooperative Work (CSCW), vol. 24, no. 2-3, pp. 109–119, 2015

Conference Papers

• A. Kunz, D. Schnelle-Walka, A. Alavi, S. Pölzer, M. Mühlhäuser, and K. Miesenberger, “Making tabletop interaction accessible for blind users,” in Proceedings of the Ninth ACM International Conference on Interactive Tabletops and Surfaces, 2014, pp. 327–332

• A. Alavi and A. Kunz, “In-air eyes-free text entry: A work in progress,” in 20th ACM Conference on Intelligent User Interfaces (ACM IUI 2015). ETH Zürich, 2015

• S. Pölzer, A. Kunz, A. Alavi, and K. Miesenberger, “An accessible environment to integrate blind participants into brainstorming sessions,” in International Conference on Computers Helping Peo- ple with Special Needs. Springer, Cham, 2016, pp. 587–593

• A. Kunz, L. Brogli, and A. Alavi, “Interference measurement of kinect for xbox one,” in Proceed- ings of the 22nd ACM Conference on Virtual Reality Software and Technology. ACM, 2016, pp. 345–346

• K.-D. Le, M. Fjeld, A. Alavi, and A. Kunz, “Immersive environment for distributed creative collab- Bibliography

oration,” in Proceedings of the 23rd ACM Symposium on Virtual Reality Software and Technology. ACM, 2017, p. 16

• D. Schnelle-Walka, A. Alavi, P. Ostie, M. Mühlhäuser, and A. Kunz, “A mind map for brain- storming sessions with blind and sighted persons,” in International Conference on Computers for Handicapped Persons. Springer, Cham, 2014, pp. 214–219

• A. Kunz, A. Alavi, and P. Sinn, “Integrating pointing gesture detection for enhancing brainstorm- ing meetings using kinect and pixelsense,” Procedia CIRP, vol. 25, pp. 205–212, 2014

110