TOWARDS A ROBUST FRAMEWORK FOR VISUAL HUMAN-ROBOT INTERACTION

Junaed Sattar

School of Computer Science McGill University, Montr´eal

November 2011

A Thesis submitted to McGill University in partial fulfillment of the requirements of the degree of Doctor of Philosophy

c Junaed Sattar, MMXI ABSTRACT

This thesis presents a vision-based interface for human-robot interaction and control for autonomous robots in arbitrary environments. Vision has the advantage of being a low-power, unobtrusive sensing modality. The advent of robust algorithms and a significant increase in computational power are the two most significant reasons for such widespread integration. The research presented in this dissertation looks at visual sensing as an intuitive and uncomplicated method for a human operator to communicate in close-range with a mobile robot. The array of communication paradigms we investigate includes, but are not limited to, visual tracking and ser- voing, programming of robot behaviors with visual cues, visual feature recognition, mapping and identification of individuals through gait characteristics using spatio- temporal visual patterns and quantifying the performance of these human-robot in- teraction approaches. The proposed framework enables a human operator to control and program a robot without the need for any complicated input interface, and also enables the robot to learn about its environment and the operator using the visual interface. We investigate the applicability of machine learning methods – super- vised learning in particular – to train the vision system using stored training data. A key aspect of our work is a system for human-robot dialog for safe and efficient task execution under uncertainty. We present extensive validation through a set of human-interface trials, and also demonstrate the applicability of this research in the field on the Aqua amphibious robot platform in the under water domain. While our framework is not specific to robots operating in the under water domain, vision under water is affected by a number of issues, such as lighting variations and color degra- dation, among others. Evaluating the approach in such difficult operating conditions provides a definitive validation of our approach.

ii RESUM´ E´

Cette th`esepr´esentera une interface bas´eesur la vision qui permet l’int´eractionentre humains et robots et aussi le control de robots autonomes parcourant des environ- ments ind´efinis. La vision `al’avantage d’ˆetreune modalit´esensorielle discr`eteet `a faible puissance. La probabilit´ed’algorithmes complexes et une hausse significative de puissance computationelle sont deux des raisons les plus importantes d’en faire une int´egrationsi r´epandue.La recherche pr´esent´edans cette dissertation ´evalue la d´etectionvisuelle comme m´ethode simple et intuitive pour un op´erateurhumain de communiquer `acourte port´eeavec un robot mobil. L’ensemble des mod`elescom- municationnels ´etudi´esinclus, sans tous les nomm´es,la localisation et l’inspection visuelle, l’utilisation de signaux visuels pour la programmation comportemental de robots, la reconnaissance visuelle, la reconnaissance d’individus par leurs mouve- ments corporels caract´eristiquesutilisant des motifs visuels spatio-temporels tout en quantifiant la performance de cette approche `al’int´eractionentre humains et robots. La structure propos´eepermet `al’op´erateur humain de programmer et contˆolerun robot sans la n´ecessit´ed’une interface `aentr´eede donn´eescomplexe. Cette struc- ture permet aussi au robot de reconnaˆıtredes caract´eristiques cl´esde son environ- ment et de son op´erateurhumain par l’ulisation d’une interface visuelle. L’´etude de l’appplication possible des m´ethodes d’apprentissage ulitilis´eespar certaines ma- chines, toujours sous supervision, permet d’entraˆınerle syst`emevisuel `autiliser ses bases de donn´ees. Un aspect important de cette recherche est l’´elaboration d’un syst`emede dialogues entre humains et robots permettant l’ex´ecutions´ecuritaireet efficace de tˆaches aux d´elimitationsincertaines. On pr´esente une ample validation `a travers de nombreux essais utilisant notre interface avec l’aide de cobayes humains. On d´emontre aussi les applications possibles de cette recherche au sein des utilisa- tions aquatiques du Aqua, robot amphibien `aplateforme. Alors que notre structure de recherche ne se sp´ecialisepas dans la robotique aquatique, la vision sous l’eau est toujours affect´eepar de nombreux facteurs, notamment la lumunosit´evariante et la d´egradationde couleur. L’´evaluation de l’approche n´ecessairedans de telles conditions op´erationnellesdifficiles cr´eeune validation d´efinitive de notre recherche.

iv ACKNOWLEDGEMENT

The author would like to gratefully acknowledge the contribution and support of colleagues, friends and family, without which, this thesis would not have been pos- sible. First and foremost, I would like to express my gratitude to Gregory Dudek – supervisor, friend and mentor for years. Without his infectious enthusiasm and steadfast faith in my abilities, two theses, more than 10 field trials and a variety of stimulating research projects would not have seen the light of day. I would not be pursuing a doctoral degree, let alone in the field of Robotics, if it were not for Greg. Rarely am I at a loss of words to describe my gratitude, but I no amount of words would ever do justice to the inspiration he has provided. I am also thankful to the Dudek family – Nick, Natasha and Krys, for treating me as one of their own, and for unhesitatingly helping out on numerous research field trials, often without being asked. My appreciation also extends towards the various members of my PhD com- mittee, and faculty members at the School of Computer Science and Center for Intelligent Machines (CIM). In particular, I am thankful to Joelle Pineau, Tal Arbel, Doina Precup, Godfried Toussaint, Sue Whitesides, and Frank Ferrie, who have pro- vided valuable guidance during the course of my PhD. I would like to thank Michael Langer, for it was his course in Computational Perception that spiked my interest in machine vision all those years ago. Meyer Nahon and Inna Sharf have assisted with ideas regarding the Aqua platform, and in various field trials over the years in Barbados. I also acknowledge the input provided by Nicholas Roy and Nando de Freitas. The staff at CIM and the School of Computer Science, both past and present, have made my life easier in so many ways, and my appreciation goes to them as well: Cynthia Davidson, Marlene Gray, Jan Binder, Patrick McLean, Diti Anastasopoulos, Sheryl Morrissey and Vanessa Bettencourt-Laberge. All of this research have been validated on-board the Aqua platform, which has proven to be an amazingly robust and versatile robotic platform to build my research upon. For that, I am thankful to my friend and colleague Christopher Prahacs, the man behind the design and construction of the Aqua robots. Without his unwavering dedication, impeccable work ethic and extreme patience, there would be not even one Aqua robot, let alone three. I am grateful for having the opportunity to work alongside and learn from Chris, lessons very few textbooks can teach, if any. My lab-mates and friends at the Mobile Robotics Lab have provided the best support any doctoral student can ask for. I would like to thank Eric Bourque for his advice and amazing insights into programming in general and the world of open- source software. Ioannis Rekleitis gets a big thank-you for providing video and photographic support for almost all the Aqua trials. Malika Meghjani, Gabrielle Charette, Yogesh Girdhar, Anqi Xu, Florian Shkurti, Nicolas Plamondon, Olivia Chiu, Matt Garden, Dave Meger, Dimitri Marinakis and Bir Bikram Dey have all played key roles in assisting with robot trials and providing stimulating discussions towards my research. Special gratitude goes to Philippe Gigu`ere,for sharing the

vi task of playing “robot parent” with Chris Prahacs and myself over the years. Travis Thomson and Erica Dancose take credit for translating the thesis abstract in French. Last but not least, an enormous amount of gratitude goes to my family. No words would suffice to express the love and support Rafa has given me all these years. She has unwaveringly seen me through the toughest of times, and I am thankful to have her beside me. As she approaches the end of her own doctoral journey, I hope I can be for her what she has been for me. This thesis is for Nadyne, my little angel, and my father, the late M. A. Sattar.

vii

TABLE OF CONTENTS

LIST OF FIGURES ...... xv

LIST OF TABLES ...... xxi

CHAPTER 1. Introduction ...... 1 1.1. A Framework for Visual Human-Robot Interaction ...... 1 1.2. Visual Human-Robot Interaction ...... 3 1.3. Overview of Approach ...... 6 1.3.1. Visual Programming of a Mobile Robot ...... 7 1.3.2. People Tracking ...... 8 1.3.3. Risk-Uncertainty Assessment ...... 8 1.4. Motivation ...... 9 1.5. Contributions ...... 11 1.6. Statement of Originality ...... 12 1.7. Document Outline ...... 13

CHAPTER 2. A Framework for Robust Visual HRI ...... 15 TABLE OF CONTENTS

2.1. Related Work ...... 16 2.1.1. Visual Tracking ...... 16 2.1.2. Distribution Similarity Measures ...... 19 2.1.3. Visual Servoing ...... 19 2.1.4. Biological Motion Tracking ...... 20 2.1.5. Fiducials for Augmented Reality ...... 22 2.1.6. Machine Learning in Computer Vision ...... 25 2.1.7. Human-Robot Interaction and Human-Robot Dialogs ...... 26 2.2. A Framework for Visual Human-Robot Interaction ...... 28 2.2.1. High-frequency Methods ...... 31 2.2.2. Intermediate-frequency Methods ...... 33 2.2.3. Low-frequency Methods ...... 36 2.3. Conclusion ...... 39

CHAPTER 3. A Visual Language for Robot Programming ...... 41 3.1. Introduction ...... 42 3.2. Methodology ...... 46 3.2.1. RoboChat grammar and syntax ...... 47 3.3. RoboChat Gestures ...... 50 3.4. Fourier Tags ...... 52 3.5. Conclusion ...... 55

CHAPTER 4. Tracking Biological Motion ...... 57 4.1. Introduction ...... 58 4.2. Methodology ...... 61 4.2.1. Fourier Tracking ...... 63 4.2.2. Multi-directional Motion Detection ...... 64 4.2.3. Position Tracking Using an Unscented Kalman Filter ...... 65 x TABLE OF CONTENTS

4.2.4. Parameter Tuning ...... 69 4.3. Conclusions ...... 69

CHAPTER 5. Machine Learning for Robust Tracking ...... 71 5.1. Introduction ...... 72 5.2. Methodology ...... 73 5.2.1. Visual Tracking by Color Thresholding ...... 74 5.2.2. Ensemble Tracking ...... 76 5.2.3. Choice of trackers ...... 78 5.3. Conclusion ...... 81

CHAPTER 6. Confirmations in Human-Robot Dialog ...... 83 6.1. Introduction ...... 84 6.2. Methodology ...... 87 6.2.1. Uncertainty Modeling ...... 89 6.2.2. Cost Analysis ...... 91 6.2.3. Confirmation Space ...... 94 6.3. Conclusions ...... 97

CHAPTER 7. The Aqua Family of Amphibious Robots ...... 99 7.1. Computing ...... 100 7.2. Vision Hardware ...... 103 7.3. Software Architecture ...... 104 7.3.1. Operating Systems ...... 104 7.3.2. Robot control ...... 105 7.3.3. Implementation of vision-interaction algorithms ...... 105 7.3.4. Vision-guided autonomous control ...... 107 7.4. Conclusion ...... 107

xi TABLE OF CONTENTS

CHAPTER 8. Experiments and Evaluations – Explicit Interactions ..... 111 8.1. RoboChat Human-Interface Trials ...... 112 8.1.1. Human Interaction Study ...... 112 8.1.2. Criteria ...... 118 8.1.3. Results: Study A ...... 120 8.1.4. Results: Study B ...... 121 8.1.5. RoboChat field trials ...... 124 8.2. Quantitative Analysis of Confirmation in Dialogs ...... 125 8.2.1. Language Description ...... 126 8.2.2. User Study ...... 126 8.2.3. User Interface ...... 129 8.2.4. Field Trials ...... 131 8.2.5. Results ...... 132 8.3. Conclusion ...... 138

CHAPTER 9. Experiments and Evaluations – Implicit Interactions ..... 141 9.1. Evaluations of the Spatio-Chromatic Tracker ...... 142 9.1.1. Training ...... 142 9.1.2. Tracking Experiments ...... 143 9.1.3. Experimental Snapshots ...... 143 9.1.4. Tracking results ...... 147 9.2. Experiments with the Fourier Tracker ...... 147 9.2.1. Experimental Setup ...... 149 9.2.2. Results ...... 150 9.2.3. Performance Evaluation ...... 153 9.3. Conclusion ...... 160

CHAPTER 10. Conclusions ...... 163 xii TABLE OF CONTENTS

10.1. Summary ...... 163 10.2. Future Directions ...... 165 10.3. Final Thoughts ...... 168

xiii

LIST OF FIGURES

1.1 The Aqua amphibious robot coordinating with a diver during a tetherless open-water deployment...... 4

1.2 Different stages of a human-robot interaction scenario...... 7

2.1 Different existing planar markers...... 23

2.2 The three interaction layers in our proposed Visual-HRI framework. .... 29

2.3 An abstract breakdown for the proposed Visual-HRI framework...... 30

2.4 “Blob” tracker tracking a target underwater...... 32

2.5 Outline of the process of tracking a diver using flipper oscillations...... 34

2.6 Diver using RoboChat to communicate with the Aqua robot...... 35

2.7 Performance data from human-interface trials of the RoboChat input scheme. 37

3.1 A diver controlling the Aqua robot using visual cues...... 44

3.2 Comparison of C (left) and RoboChat (right) syntax...... 45

3.3 RoboChat BNF grammar...... 49 LIST OF FIGURES

3.4 A human-interface trial in progress...... 49

3.5 Bottom left : the frequency domain encoding of 210; Top left: the corresponding spatial signal. Right: The Fourier tag of the number 210 constructed from the signal shown on the top-left...... 53

3.6 Fourier tags underwater on the sea bed and with a diver...... 54

4.1 External and robot-eye-view images of typical underwater scenes during target tracking by an autonomous underwater robot...... 58

4.2 Effects of light in light rays in underwater domains...... 60

4.3 Outline of the Directional Fourier motion detection and tracking process. The Gaussian-filtered temporal image is split into sub-windows, and the average intensity of each sub-window is calculated for every timeframe. For the length of the filter, a one-dimensional intensity vector is formed, which is then passed through an FFT operator. The resulting amplitude plot can be seen, with the symmetric half removed...... 61

4.4 Directions of motion for Fourier tracking, also depicted in 3D in a diver- swimming sequence...... 64

4.5 Conic space covered by the directional Fourier operators...... 66

4.6 A schematic showing the RunLength and BoxSize parameters for the Fourier tracker...... 68

5.1 A color blob tracker tracking a red-colored object...... 74

5.2 Outline of the ensemble tracker using AdaBoost...... 76

5.3 Examples of the different tracker types used during Boosting...... 80

6.1 The Aqua robot being operated via gestures formed by fiducial markers. . 87

6.2 Control flow in our risk-uncertainty model...... 88 xvi LIST OF FIGURES

6.3 The role of assessors in risk/cost estimation...... 92

6.4 A pictorial depiction of the set of programs Pi = {p1, p2, . . . , p13} in the Safety- Likelihood graph. Programs in the non-shaded areas are in the Confirmation Space...... 95

6.5 Pictorial depiction of the task selection process from a range of likely alternative programs. The straight line indicates the average cost of likely programs...... 96

7.1 A cutaway annotated diagram of the Aqua robot, showing salient hardware components. Image courtesy of C. Prahacs...... 101

7.2 The hardware schematic of the Aqua robot...... 102

7.3 Aqua remote operator console, working in simulation mode...... 106

7.4 Aqua robot functional schematic diagram...... 109

8.1 Different ARTag structures used during the experiments...... 115

8.2 A subset of gestures presented to participants in study B...... 117

8.3 Studies A and B: Average time taken per command using ARTag markers and hand gestures...... 119

8.4 Study B: Average time taken per command using ARTag markers and hand gestures. In both plots, user 4 is the “expert user”...... 121

8.5 Study B: Trial progression using tags...... 122

8.6 Study B: Trial progression using hand gestures...... 123

8.7 An example set of tags used in field trials to command the robot, which corresponds to program # 1 shown in Tab. 8.3...... 128

8.8 Mapping of mouse gestures to robot commands for the user trials. On the left, the command mapping can be seen. The figure on the right demonstrates

xvii LIST OF FIGURES

a user trial in progress, with the straight lines depicting a user’s attempts to issue the commands shown on the left...... 129

8.9 Field trials of the proposed algorithm on board the Aqua robot...... 133

8.10Results from user studies, timing 8.10(a) and confirmations 8.10(b). .... 135

8.11Error filter rate plot over all user studies data...... 136

8.12Mistakes with respect to program length, all users combined...... 137

8.13The Aqua robot with ARTag markers used for RoboChat token delivery. . 138

9.1 Overall system architecture of the Spatio-temporal tracker, showing the off-line components and the PID controller...... 144

9.2 Comparison of the non-boosted color segmentation tracker (top row) with the boosted version (bottom row) on a sequence of four images. The target in the last frame is missed by the non-boosted tracker, but the boosted tracker is able to detect it...... 145

9.3 Top row: Output of non-boosted color segmentation tracker on a video sequence of a diver swimming, shown by the yellow squares. Bottom row: Output of boosted color segmentation tracker of the diver tracking sequence, shown by the white squares...... 145

9.4 Tracking results showing accuracy(top) and false positives(bottom) and superiority of the boosted tracker (blue bars) as compared to the non-boosted tracker (red bars)...... 146

9.5 Percentage of each type of tracker (shown left) chosen by AdaBoost on the diver following dataset shown in Fig. 9.3...... 147

9.6 Top row: Two sample testing images from Dataset 1, with yellow targets. Bottom row: Output of one of the trackers chosen by AdaBoost. While the xviii LIST OF FIGURES

target is still detected by the tracker, so are a variety of background objects, which affects tracking accuracy...... 148

9.7 Snapshot image showing direction of a swimmer motion (in green) and an arbitrary direction without a diver (in red)...... 150 9.8 Contrasting frequency responses for directions with and without diver motion in a given image sequence...... 151 9.9 Frequency responses for two different directions of diver motion in a single image sequence...... 153 9.10Effectof diver’s distance from camera on the amplitude spectra. Being farther away from the camera produces higher energy responses (Fig. 9.10(b)) in the low-frequency bands, compared to divers swimming closer (Fig. 9.10(d)). . 154

9.11Additional instantaneous amplitude responses at various times during diver tracking. Figures 9.11(a) and 9.11(b) are Fourier signatures of the diver’s flippers, whereas Figures 9.11(c) and 9.11(d) are examples of random (i.e., non-diver) locations...... 155

9.12Effect of the λ and κ parameters on tracker timing, with time taken shown per detection...... 158 9.13Effect of the RunLength and BoxSize parameters on tracker accuracy. ... 159

10.1The Aqua robot swims under an Autonomous Surface Vehicle. The robot operator can be seen swimming on the surface...... 169

xix

LIST OF TABLES

8.1 Tasks used in Study A...... 114 8.2 Example of a long command used in Study B...... 116 8.3 Programs used in the user study...... 128

8.4 Programs used in the field trials in Lake Ouareau, Qu´ebec...... 131 8.5 Programs used in the open-ocean field trials in Barbados...... 132

9.1 Low-frequency amplitude responses for multiple motion directions...... 152

CHAPTER 1

Introduction

1.1. A Framework for Visual Human-Robot Interaction

This thesis presents a framework for Human-Robot Interaction (HRI) for mobile robots in arbitrary environments, with vision as the primary sensing modality. In particular, the work presented here enables mobile robots to accompany a human operator and act as his assistant, using the robots’ on-board visual sensors to com- municate (both explicitly and implicitly) with its human operator. Such a scenario can easily be envisioned in many robot-assisted tasks, including but not limited to search-and-rescue, remote surveillance, domestic assistance, medical aid, and a vari- ety of maintenance operations. The ability to communicate with a human operator in an error-free and efficient manner is of utmost importance in all of these application domains. For example, consider an autonomous underwater vehicle (i.e., an AUV) attempting to learn the procedures of ship inspection task from a human scuba diver, CHAPTER 1. INTRODUCTION so that future operations can be performed automatically by the robot without any involvement of the diver. Without a suite of algorithms for visually following the diver, understanding the diver’s instructions, and executing the instructions safely and reliably, it would be difficult for the robot to carry out such a task. As such, this thesis delves into the realms of vision-based gesture recognition, visual track- ing, machine learning for enhanced parameterized target tracking, and quantitative modeling of human-robot dialogs, independent of the actual communication medium. The research presented in this thesis focuses primarily on the creation and tax- onomy of a suite of vision-based techniques to assist in human-robot interaction, accompanied by a mathematical model of human-robot dialogs with the goal of safe task execution in the presence of uncertainty and high cost of commanded operations. The term robust is applied not particularly in the context of robust statistics, but more in the context of safe, accurate and error-tolerant operations. Throughout the duration of this research, we have investigated problems in several sub-areas, such as visual biometrics for detecting human motion, structured languages for human-robot communication, supervised learning for object tracking, and probabilistic modeling of human-robot dialogs in the presence of non-trivial uncertainty. Substantial human- interface trials have been performed, and validation of our contributions have taken place on-board a legged amphibious vehicle, in a variety of natural environments. We dedicate one chapter in this thesis to describe the significant system development that has taken place towards this validation, and also outline the human-interface trials in greater detail in Chapter 8. This introduction begins by describing the discipline of human-robot interaction, and the repercussions of coexistence of robots with humans in a generally-ubiquitous manner. We turn our focus into specifics, outlining the particular issues we are interested in and provide a high-level overview of the techniques we intend to apply. This is followed by the motivations behind this work. We conclude the chapter

2 1.2 VISUAL HUMAN-ROBOT INTERACTION with an outline of the entire document and some statements on the originality of contributions.

1.2. Visual Human-Robot Interaction

Robots are becoming a more common presence in our everyday lives, and is play- ing a significant role in improving the quality of life through their applications in a variety of domains. A number of these applications, some of which we have men- tioned previously, require a robot to interact closely with one or more persons, often in a predominantly human-centric (i.e., social) context. The advent of faster comput- ing equipment and sensors, along with state-of-the-art algorithms are contributing to the rapid adoption of robotic technologies. It comes as no surprise that a prin- cipled, well-formed approach for human-robot interaction is of utmost importance for robotics to have an even widespread acceptance. An intuitive, natural (in the context of human-to-human communication) interaction medium would significantly contribute towards blurring the communication gap that currently exists between man and machine. Vision is a one such sensing medium which satisfies these crite- ria, and thus exists a compelling motivation to use machine vision for human-robot interaction. In this work, the term Visual Human-Robot Interaction is used to describe the scenario where a mobile robot is working in close proximity with a human opera- tor (see Figure 1.1), and instructions from the human operator are parsed by the robot using its visual sensors (i.e., cameras). Various forms of human-robot com- munication have been addressed in the literature, and a number of systems have found implementations in a variety of robotic platforms (e.g.,[56, 57] ). We focus particularly on a variation of cognitive HRI, with vision as the interaction sensing modality. This thesis presents a framework for vision-based HRI which incorporates a suite of algorithms in different stages of communication, to enable a mobile robot

3 CHAPTER 1. INTRODUCTION to track, follow, understand gestures, and assess uncertainty and risks in a continual basis throughout the duration of the interaction. We consider the latter algorithm necessary to ensure the safe and error-free behavior of the robot while communicating with a human operator.

Figure 1.1. The Aqua amphibious robot coordinating with a diver during a tetherless open-water deployment.

We consider interactions between a human and a robot to take either an explicit or implicit form, regardless of the actual interaction modalities being used. In explicit interactions, the robot is given instructions by the operator, using a well-defined input modality (vision in our case). Speech, gestures, touch are to name a few methods of such explicit interactions. Implicit interactions will enable the robot to passively communicate with the operator, by tracking his movements, learning about appearance changes, assessing the effects of executing certain instructions and so on. The instructions given through explicit interactions rely heavily on the successful execution of these implicit interactions. We consider algorithms that communicate with a human as an interaction algorithm. Thus, a visual localization and mapping

4 1.2 VISUAL HUMAN-ROBOT INTERACTION algorithm will not be considered as an interaction algorithm, but an algorithm that performs human motion detection will be classified as such. Continuing along the same vein, to be able to coordinate with a human, a robot requires abilities to uniquely identify the operator. Often, it might be desirable to have the robot follow the human operator, in order to accomplish some particular task. In many cases, especially true of outdoor environments, tracking algorithms can suffer from errors arising from ambiguous features. For example, if a visual tracker was tuned to track objects based on the presence of a specific color, it may be confused if there are multiple objects in the scene with similar hues. Hence it is important for a mobile robot to focus on distinctive characteristics of human motion to be able to consistently track its operator. In recent times, mobile robotics and particularly machine vision have benefited from the application of machine learning algorithms, in many instances outperform- ing existing algorithms. Viola and Jones, for example, have demonstrated the ap- plicability of ensemble learning in face detection tasks with a high degree of ac- curacy [99]. We have applied learning to achieve more robust visual tracking, in an attempt to address the issues in changing appearances that arise from lighting variations and color changes. For explicitly communicating with a robot, there must exist some language that is understood by the robot, and in our case, that language needs to be visual. While free-form gestures are the most widely used form of visual communication (sign language as an example), current algorithms for detecting gestures in natural envi- ronments lack accuracy and robustness necessary for application in HRI [19]. We look at a visual programming language for explicit HRI, which relies on the robot detecting unique “tokens” as they are being sent by the operator. Instead of hand gestures, we have used a set of Fiducials [28] to visually program the robot, with

5 CHAPTER 1. INTRODUCTION each individual fiducial marker (i.e., “tag”) corresponding to a unique language to- ken. By mapping these gestures to discrete commands, the robot is able to execute the commanded task. The next section describes this concept in more detail along with a broader overview of our HRI framework.

1.3. Overview of Approach

In this section, we provide an overview to illustrate the particular nature of the problem we are addressing in this thesis. We assume a mobile robot is deployed as an assistant to a human operator, often accompanying the operator as he carries out a set of tasks. Also, the robot is capable of autonomously executing a set of instructions when commanded by the operator. We also assume there exists a language via which the human operator can send explicit instructions to the robot. The robot will have some form of feedback mechanism available, not necessarily visual, to establish bi-directional dialog with the human. After the sequence of instructions has been given, the robot estimates the degree of uncertainty in the input (i.e., probability of the input that was “observed” by the robot), and also estimates the cost of execution, given the most likely of all possible commands. When the uncertainty in input is low and the task cost is within a tolerance level, the robot will execute the given program. Otherwise, a feedback will be requested by the robot to confirm the command, or reprogram in case an error was observed. Our problem is to address each of these specific requirements along the chain of interaction to enable the robot to communicate robustly with the operator, and provide a model for risk-uncertainty assessment so that task execution occurs only when a guarantee of safety is achieved. Figure 1.2 outlines a rather typical human-robot interaction scenario. Such a scenario can be envisioned, for example, in a human-assisted inspection task, where a human observer might instruct a robot to carry out potentially risky inspections,

6 1.3 OVERVIEW OF APPROACH and thus a high degree of certainty is essential for the safety of both the operator and the robot.

Figure 1.2. Different stages of a human-robot interaction scenario.

1.3.1. Visual Programming of a Mobile Robot. Our approach towards allowing a human to program a robot relies on a set of discrete gestures, formed by using fiducials, that are mapped one-to-one to a set of robot commands. Fiducials are widely used in Augmented Reality applications, and by design are easily detectable by a machine vision system. These special markers do not require sophisticated camera systems, and often work without camera calibration data. The latter feature is often useful, as calibration data is not available for many off-the-shelf cameras, and this would also require a recalibration every time robot cameras are replaced (e.g for upgrades or maintenance). Using fiducials also ensures a high recognition rate, and reduces deployment overhead, as only a limited number of cue cards need to be carried by the operator. When a discrete gesture is “drawn” by the operator, we are able to use the gesture itself, and a variety of other parameters associated

7 CHAPTER 1. INTRODUCTION with the gesture formation process to represent a large number of language tokens. These parameters include but are not limited to the specific fiducials used to create the gesture, the direction of the gesture and distance from the robot. Just using these three features, and at most using two fiducials at a time to create the gestures, the maximum possible number of individual language tokens that can be represented are

N  N = N × D × Gesture (1.1) T okens directions distance 2 Equation 1.1 highlights the fact that the space of total possible language tokens can be made arbitrarily large.

1.3.2. People Tracking. As stated in the preceding sections, robots require a robust, accurate person tracker to be able to accompany a human operator. In a vision-based HRI framework, we envision a visual tracker to fill that role, and we demonstrate the effectiveness of such a technique through a visual diver-following algorithm for underwater robots. Called Fourier Tracking, this algorithm detects the oscillations of a diver’s flippers to detect his position in the robots field of view, and then applies a variant of the Kalman filter to accurately track his position. Capable of detecting and tracking motion simultaneously in a number of directions, this algorithm can track multiple divers in the scene. While such spatio-temporal features have been used in the past (e.g., in [59]; the reader is referred to Chapter 2 for more references), this method is unique in the range of target directions it can account for during tracking (including motion directly away from the camera) and in the way it accounts for both diver and vehicle motion.

1.3.3. Risk-Uncertainty Assessment. Uncertainty in the communication between a robot and a human operator can lead to unexpected behavior from the robot, and in the worst case, can be hazardous for people and the robot itself. To

8 1.4 MOTIVATION handle uncertainty and high-risk task execution scenarios by a mobile robot, we pro- pose an algorithm that asks for confirmation from the human to prevent robots from behaving erroneously. Specifically, given an instruction (with non-zero uncertainty), our approach searches the space of all possible commands and their associated costs, to determine whether a confirmation is required. In case of misinterpretation of commands, or a mistake by the operator, this model ensures that high-risk tasks are not executed without confirmation. We detail this approach in greater detail in Chapter 6.

1.4. Motivation

Recent advances in hardware and algorithms have made Robotics one of the fastest growing disciplines of the 21st century [2]. Along with rapid developments in research, robotics industries are also on the rise, with a variety of target domains, including but not limited to areas in domestic aid, entertainment, education, mining and military applications. Many of these applications require robots to coexist in a predominantly human-centric environment, assisting, or even replacing humans from hazardous work environments. It is of our opinion that Robotics, and Human- Robot Interaction in particular, is thus a domain of research that deserves increasing attention. A specific application that motivates our work stems from the Aqua underwa- ter research project [23], and its applications to marine biology in particular. A common application in marine surveillance is the task of Scene Acquisition and Site Re-inspection (SASR), where a marine biologist identifies areas of the underwater ecosystem (e.g., coral reefs) that require long-term monitoring. This is required to analyze certain species of marine life that live in that particular area, so that the effect of environmental changes on their behavioral pattern can be firmly established. This monitoring task is repeated on a daily basis, over a period often spanning months at

9 CHAPTER 1. INTRODUCTION a time. For a human diver to monitor an underwater scene for an extended duration is naturally hazardous, and entails non-trivial risk. Also, studies have shown that presence of a diver disturbs the natural behavior patterns of marine species. An underwater vehicle, on the other hand, can replace the diver in such a surveillance task, and provide more reliable data for the environment in question. This requires the robot to have the ability to understand the diver’s commands, follow him as he swims to the specific area, and eventually return to “home base” at the end of the task. With data collected from the first day’s task, the robot should be able to do all of the above autonomously, for the duration of the surveillance project. Our investigations in HRI has its roots in this application, and the research presented in this thesis addresses some specific areas which enables a robot to carry out tasks such as the SASR. In a more general view, this thesis looks at a particular approach to HRI prob- lems, where vision sensing is used as the communication modality. While speech has been the predominantly natural method for communication, certain application ar- eas do not allow speech to be used. Space and the underwater domains are two prime examples, and they are also application domains which would benefit from robotic technologies. The previous scenario also serves as an example where vision is the applicable natural communication protocol. Gestures are commonly used by scuba divers to communicate with each other, and therefore also incur a low learning cost (for the operators) for human-robot communication. Machine vision carries the ben- efit of being a low-power, unobtrusive sensing technology, and also is an established and active area of research. Part of the work presented in this dissertation involves research in human track- ing, and that alone has a potentially large application base. Pedestrian tracking for smart road vehicles, security surveillance applications, social robots, caregiver robots, smart homes etc. are to name a few of these areas where tracking human

10 1.5 CONTRIBUTIONS motion is a fundamental requirement. In our work, the Fourier Tracking provides the people tracking component. One can envision similar techniques being applied to areas such as those enumerated above, by focusing on domain-specific attributes of human motion.

1.5. Contributions

In this section, we identify the intellectual contributions made in this thesis. The core research presented here contributes various techniques towards the devel- opment of a vision-based Human-Robot Interaction framework. Namely, the major contributions are as follows:

• A human-robot dialog model that evaluates the need for interaction based on utility and risk [83]. • A visual language for programming robotic systems using gestures, and the human-interface studies towards quantifying its performance [24]. • A visual biometric system for detecting and tracking multiple scuba divers in the underwater domain [80]. • A framework combining the major contributions listed above and some of the minor contributions listed below to create an operational HRI system for an amphibious robot that includes gesture input, risk assessment, visual tracking, and general vision-based navigation and guidance capability [81].

Additional, minor contributions are: • A machine learning algorithm to learn spatial color distribution of objects to achieve robust tracking under variable lighting and color distortion [79]. • A Graphical State-Space Programming (GSSP) paradigm, where a hu- man operator can program a mobile robot using a graphical user interface

11 CHAPTER 1. INTRODUCTION

to perform certain tasks based on its (either two-dimensional or three- dimensional) position [87]. • Creation of a new class of fiducials, where information is stored in the frequency spectrum, as opposed to spatial distribution of the image bits. Called Fourier Tags, these fiducials exhibit graceful degradation properties, where information content has a gradual attenuation with distance, as opposed to having a sudden cutoff [75]. • Implementation of the above-mentioned suite of algorithms in the form of a software framework for human-robot interaction. Coded in C++ and spanning over 90, 000 lines of code, this software framework incorporates all the major and minor contributions listed so far, and enables both real- time operation on-board the Aqua robot and off-board rapid prototyping of new algorithms.

1.6. Statement of Originality

Parts of the research presented in this thesis have already been published [76, 85, 75, 24, 77, 105, 84, 78, 81, 80, 79, 87, 82] or are in the process of being submitted for publications to peer-reviewed conferences or journals. A number of colleagues, other than my supervisor and PhD committee members, have made contributions to this work. Anqi Xu assisted in the work presented in Chapter 3. Dr. Philippe Gigu`ere collaborated on the system development work relating to robot control for the Aqua platforms, although the systems development relating to the HRI framework was the author’s own contribution. Dr. Nicholas Roy provided valuable guidance, along with Dr. Joelle Pineau and Dr. Gregory Dudek, in the work that lead to the quantitative modeling of human-robot dialog, presented in Chapter 6.

12 1.7 DOCUMENT OUTLINE

1.7. Document Outline

In this thesis, we describe a complete framework for machine vision-based Human- Robot Interaction and a suite of algorithms that contribute to this framework. These algorithms, in most cases, are concerned with communications between a human and a robot, in both implicit and explicit ways. This introductory chapter presented an overview of our problem and discussed some motivations for pursuing this line of research. We structure this thesis in two distinct parts. In the first, we describe the HRI framework, and the associated features and algorithms that fill the different roles required by the framework. In the second, we present experimental evaluations of these algorithms and the framework as a whole. The evaluations are performed both in controlled environments, and in outdoor field trials on-board the Aqua vehicle. Because of the nature of this research, a significant degree of human interface trials were performed. The experimental results themselves are presented in two parts – one part looking into the human interface trials into the class of explicit interaction algorithms, and the other presenting evaluation results from techniques that address the class of implicit interactions. We dive into the rest of the thesis by discussing related work in Chapter 2. Namely, we present past and related work in the specific areas that contribute to our framework, and briefly discuss some of the theoretical approaches we have relied upon in our research. Also in this chapter, we explain in detail the visual-interaction framework we propose to characterize, along with a review of existing research in the relevant fields. In particular, we discuss related work in human-robot interaction, visual tracking, spatio-temporal filtering, machine learning and dialog systems, as they form the core of our proposed framework. The concept of the Graphical State- Space Programming, or GSSP, is also introduced in this chapter.

13 CHAPTER 1. INTRODUCTION

The rest of the document is structured as follows. In Chapters 3 to 6, we look at the specific algorithms we developed as part of this larger framework. Chap- ter 3 describes the visual programming language we created to communicate explicit commands to a mobile robot. An algorithm we developed to uniquely identify and track scuba divers in underwater environments by using spatio-temporal signatures is described in Chapter 4. The Spatio-Chromatic tracker, a machine learning-based approach towards robust tracking is presented in Chapter 5. A quantitative analysis of human-robot interaction systems in arbitrary domains can be found in Chap- ter 6, where we perform a probabilistic analysis of interaction under uncertainty, and suggest actions to avoid dangerous task execution under severe uncertainty and high task cost. The validation and testing process required a significant effort in systems development, both in the hardware and software domains. We present the fundamentals of the Aqua robot platform, from both hardware and software perspec- tives in Chapter 7. The experimental results are presented across two chapters, with Chapter 8 describing our evaluation results of the explicit communication algorithms and Chapter 9 presenting results from the implicit interaction techniques. Lastly, we present concluding remarks and lay out possible directions of future research in Chapter 10.

14 CHAPTER 2

A Framework for Robust Visual HRI

In this chapter, we formally introduce our framework for vision-based Human-Robot Interaction, and explain the various components that constitute this framework. In particular, we outline the different layers of interaction we envision in this framework, from the perspectives of both execution time and computational cost. The class of algorithms that plugs into the different layers are reviewed in brief, as a precursor to later chapters that are dedicated to more detailed discussions. In the context of this framework, we also cover past research in the various disciplines which this work encompasses. The first half of this chapter presents a review of related work, which also motivates the construction of the visual-HRI framework. Our approach to Human-Robot Interaction looks collectively at a set of “tasks” being performed by the robot to facilitate communication with a human operator in order to execute a set of given commands. The specific set of interaction “be- haviors” or algorithms performed by the robot is task (and in some cases, domain) CHAPTER 2. A FRAMEWORK FOR ROBUST VISUAL HRI dependent, but the general functioning of the framework remains unchanged. As described later in the chapter, our visual-HRI framework classifies this set of tasks, from the robot’s perspective, according to the computational load and rate of invo- cation. The naturally-occurring, inversely-correlated relationship between these two seemingly contradictory properties give rise to a coherent stratification which forms the backbone of our framework. Section 2.2 provides a detailed discussion on this architecture.

2.1. Related Work

To visually interact with a human operator, a robot needs to have a set of abilities that would enable it to work in concert with the operator. These capabilities could include but would not be limited to accept communications, to identify and follow the operator, or learn from existing data to further enhance the visual perception subsystem. Given an instruction from the operator, the robot should also be able to assess the uncertainty in perception and the risk involved in carrying out the task, before any actual task execution is performed. These abilities are incorporated in our framework using algorithms on visual tracking, visual servoing, spatio-temporal and steerable filters, fiducial markers, gesture recognition and human-robot dialog management. As such, components of our proposed visual HRI framework have seen extensive research in the fields of Mobile Robotics, Machine Vision, Human-Machine Interaction, Machine Learning and Augmented Reality.

2.1.1. Visual Tracking. Visual tracking is the process of repeatedly detect- ing a feature or sets of features in a sequence of input images. Choosing features to track can be a complicated problem, since noise in the sensor (i.e., camera), lighting and visibility changes, refraction and appearance of multiple similar objects in the

16 2.1 RELATED WORK image frame, among others, can create unforeseen problems. Since tracking is pri- marily an on-line, real-time application of vision, a tracking algorithm must be fast, as well as accurate. A tracker also needs to be robust, so that effects of false targets and occlusions are minimized. Choosing a feature to track is an important step in tracking algorithm design, as already stated above. Over the years, a large amount of work has been done on tracking algorithms that track features ranging from shape and motion to color and gray-scale intensities. Techmer [95] uses object contours to detect motion and track targets. Freedman and Brandstein [32] have investigated detecting object contours in cluttered environments. Isard and Blake [40] introduced the “CONDENSATION” or Conditional Density Propagation algorithm for stochastically tracking curves or contour shapes in a clutter. This is also known as tracking using Particle Filters. Tracking contours usually involves an iterative scheme that converges to the shape being tracked after a finite number of iterations, and uses a probabilistic approach to converge to the best solution at that instant. While they can be quite accurate, con- tour trackers rely heavily on a clear view of the target that shows object boundaries distinctly compared to the background. Using color features in visual tracking is an attractive option because of its simplicity and robustness under partial occlusion, depth and scale changes [47]. Nevertheless, there exist some significant problems that need to be addressed in order to design a robust and accurate color tracker. The most severe problem existing with color cues is color constancy [30], which is defined as the removal of color bias due to effect of illumination. Issues like shadows, change in illumination and camera characteristics affect the phenomenon of color constancy. Since we are considering color trackers suitable for real-time applications (such as autonomous underwater or aerial vehicles), we seek a robust and efficient representation of the object colors, resulting in faster and accurate computation. The color space [47][45] plays an

17 CHAPTER 2. A FRAMEWORK FOR ROBUST VISUAL HRI important role in computational accuracy and robustness. Tracking algorithms used in this framework all operate in the normalized-RGB (i.e., hue) space, which is obtained by dividing individual RGB values of each pixel by the sum of the values in the R,G and B channels. The simplest approach to color based tracking is using a segmentation algorithm to detect objects of interest using their color features (our specific tracker that works by learning spatial color distribution makes use of this approach). The output of the segmentation algorithm is a set of (possibly disconnected) regions in a binary image that match the color properties being tracked. These regions are termed ‘blobs’, and hence the approach is known as Color Blob Tracking. We attempt to form these blobs through a thresholding process. By thresholding, we refer to the operation where pixels are selected in the output if and only if their color values fall within a certain range. These blob trackers use the average normalized-RGB values in a fixed-size window to set the low and high thresholds for the segmentation process. The color histogram [94] tracker works by first creating a color histogram of a fixed sub-region of the image which we refer to as the target model histogram; this model is created presumably in the immediate neighborhood of the target to be tracked. During the tracking stage, every incoming frame from the camera is divided into rectangular regions and their histograms are calculated. The similarities between the new candidate histogram and the target model histogram is calculated according to a distribution similarity measure. The sub-window with the highest match is chosen as the probable sub-window containing the target. The pattern of scanning the image for the target can be done sequentially, or in a spiral pattern starting from the location of target found in the previous frame. Depending on the application, the size and shape of the sub-window can also be made to change dynamically, although that makes the tracker computationally more expensive, which may affect real-time performance.

18 2.1 RELATED WORK

Mean-shift tracking [18] performs visual tracking by attempting to maximize the correlation between two statistical models of the underlying color distribution in the image. The correlation between the two distributions is expressed as a measurement derived from the Bhattacharyya coefficient [97]. Mean-shift trackers have been used to track objects based on color or texture, by building a statistical distribution of the feature being tracked. In effect, the mean-shift tracker relies on the mean-shift vector [17] to detect the direction of the change in gradient and correspondingly point to the (possible) new location of the target being tracked.

2.1.2. Distribution Similarity Measures. The tracking algorithms de- pend on a statistical measurement of similarity between two histograms to detect a possible match between the target and a candidate location. We assume two his- tograms (i.e., distributions) H and K, both having the same number of bins, N.

The measurements compare hi and kj for i = j, where hi and kj are the i-th and j-th bins of histograms H and K respectively. 2.1.2.1. The Bhattacharyya Distance Measure. The Bhattacharyya coefficient [9] has a direct geometric interpretation with respect to two distributions; for two m- dimensional unit vectors p and q, it is equal to the cosine of the angle between them. The Bhattacharyya distance between two histograms can be found using the following expression:

m X p ρBhattacharyya(H,K) = kihi (2.1) i=1 It has been shown in [17] that this measure is near-optimal and possesses scale invariant properties.

2.1.3. Visual Servoing. Visual servoing is a well-studied area of robot- ics [39], one which combines the theories of active vision into practices in real-world

19 CHAPTER 2. A FRAMEWORK FOR ROBUST VISUAL HRI applications. Robotic assembly lines, autonomous aircraft control (i.e., landing, hovering etc), robot-assisted surgery [52], navigation and guidance of underwater robots [103] are applications where visual servoing has been highly successful. Visual servoing algorithms are generally classified as either image-based or position-based, where in the former tracking and manipulator control calculations are performed in the image space, and in the latter it is performed in the robot manipulator config- uration space. A proportional-integral-derivative (PID) controller [53] is commonly found in visual servo systems to control motion of the robot or the manipulator . The gains of PID controllers are usually manually found, although sophisticated sys- tems exist which apply adaptive control to automatically adjust the controller gains.

2.1.4. Biological Motion Tracking. A key aspect of our work is a filter- based characterization of the motion field in an image sequence. This has been a problem of longstanding relevance and activity, and were it not for the need for a real-time low-overhead solution, we would be using a full family of steerable filters, or a related filtering mechanism [31, 33]. In fact, since our system needs to be deployed in a hard real-time context on an embedded system, we have opted to use a sparse set of filters combined with a robust tracker. This depends, in part, on the fact that we can consistently detect the motion of our target human from a potentially complex motion field. Tracking humans using their motion on land, in two degrees of freedom, was examined by Niyogi and Adelson [59]. They look at the positions of head and ankles, respectively, and detect the presence of a human walking pattern by looking at a “braided pattern” at the ankles and a straight-line translational pattern at the position of the head. In their work, however, the person has to walk across the image plane roughly orthogonal to the viewing axis for the detection scheme to work.

20 2.1 RELATED WORK

There is evidence that people can be discriminated from other objects, as well as from one another, based on motion cues alone (although the precision of this discrim- ination may be limited). In the seminal work using “moving light displays”, Rashid observed [69] that humans are exquisitely sensitive to human-like motions using even very limited cues. Several researchers have looked into the task of tracking a person by identifying walking gaits. Recent advancements in the field of Biometrics also have shown promise in identifying humans from gait characteristics [58]. It appears that different people have characteristic gaits and it may be possible to identify a person using the coordinated relationship between their head, hands, shoulder, knees, and feet. Particularly in the context of biometric person identification, automated analysis of human motion or walking gaits [91, 90] have yielded interesting results. In a similar vein, several research groups have explored the detection of humans on land from either static visual cues or motion cues. Such methods typically assume an overhead, lateral or other view that allows various body parts to be detected, or facial features to be seen. Notably, many traditional methods have difficulty if the person is walking directly away from the camera. In contrast, this chapter proposes a technique that functions without requiring a view of the face, arms or hands (either of which may be obscured in the case of scuba divers). In addition, in our particular tracking scenario the diver can point directly away from the robot that is following him, as well as move in an arbitrary direction during the course of the tracking process. While tracking underwater swimmers visually has not been explored in great depth in the past, some prior work has been done in the field of underwater visual tracking and visual servoing for autonomous underwater vehicles. Naturally, this is closely related to generic servo-control. In that context, on-line real-time performance is crucial. On-line tracking systems, in conjunction with a robust control scheme, provide underwater robots the ability to visually follow targets underwater [86].

21 CHAPTER 2. A FRAMEWORK FOR ROBUST VISUAL HRI

Previous work on spatio-temporal detection and tracking of biological motion under- water has been shown to work well [77], but only when the motion of the diver is directly towards or away from the camera. Our current work looks at motion in a variety of directions over the spatio-temporal domain, incorporates a variation of the Kalman filter and also estimates diver distance and is thus a significant improvement over that particular technique. In terms of the tracking process itself, the Kalman filter [42] is, of course, the preeminent classical methodology for real-time tracking. It depends, however, on a linear model of system dynamics. Many real systems, including our model of human swimmers, are non-linear and the linearization needed to implement a Kalman filter needs to be carefully managed to avoid poor performance or divergence. The Unscented Kalman Filter [41] we deploy was developed to facilitate non-linear control and tracking, and can be regarded as a compromise between Kalman Filtering and fully non-parametric Condensation [40].

2.1.5. Fiducials for Augmented Reality. Systems that use fiducial mark- ers can be decomposed into an operational chain composed of three phases: 1) the design and synthesis of the fiducial itself, 2) the detection and extraction of the fidu- cial in the image, and 3) the decoding of the payload from the marker image. In many systems the fiducials carry little semantic information, and may not even be labeled and so the third stage is missing. Markers of this type include indistinguishable light emitting diodes [102]. A number of planar marker systems exists that embed information in encoded patterns, and a few are designed explicitly for localization as required for applications such as augmented reality (AR) [61]. Among some general purpose marker schemes, the US Postal Service uses MaxiCode (Fig. 2.1(a)) to encode shipping information in packages, while QuickResponse and Data Matrix (Fig 2.1(b)) are also used for

22 2.1 RELATED WORK

(a) MaxiCode (b) DataMatrixSymbol (c) ARToolkit (d) ARTag

Figure 2.1. Different existing planar markers. containing information needed for part labeling. These encoding methods all use error correction schemes to recover the encoded information in cases where part of the information retrieved from the symbols can be corrupted. Although useful for encoding information with more sophistication than standard commercial bar-codes, these systems are not very useful as fiducials. There are two reasons for this: first, they can be hard to detect in a large field of view with perspective distortion and second, they have to occupy a large portion of the image to be detected robustly. The latter requirement severely limits the range at which these markers can be detected. ARToolKit [43] and ARTag markers (Figs. 2.1(c) and 2.1(d) respectively) are designed to be used as fiducial markers for augmented reality applications (hence the AR prefix). Both tag systems are bi-tonal systems using markers made up of black and white square patterns, which seeks to limit the effects of lighting varia- tions on the tags detection process. ARToolKit encodes a feature vector of fixed length (usually 256 or 1024 elements) inside a black quadrilateral pattern outline. This vector is compared using correlation to a library of known markers and the presence of a tag can be detected by thresholding on the confidence factor output by the system (Owen [61] has modified the ARToolkit markers to include Fourier encodings). ARTag, on the other hand, applies digital techniques to encode and match patterns inside the square boundary of the tags and also uses Forward Error

23 CHAPTER 2. A FRAMEWORK FOR ROBUST VISUAL HRI

Correction (FEC) and Cyclic Redundancy Check (CRC) methods to encode 10-bit information in a 36-bit binary sequence. The ARTag system relies on quadrilateral detection to identify the four corners of the tag boundary and then samples the im- age around the detected region using a 6 × 6 grid. This sampled interior is then checked with the aforementioned CRC and FEC codes to retrieve the actual 10-bit binary sequence embedded within the tag. An additional bonus with both these tag libraries is the ability to estimate orientation, which makes them suitable for camera calibration or pose estimation. CyberCode [70] uses two-dimensional bar codes as a basis for a marker system, placing particular emphasis on phase 3 of the operational chain described above. These can carry multi-bit labels, but their detection depends on their having good visibility in the image. However, if resolution is insufficient, the data may not be intelligible even if the side bars that aid detection are found. CyberCode also has drawbacks as a fiducial marker system because it provides three salient points, and hence once the sidebars are located can only correct for affine, not perspective, dis- tortion. One method that uses a multi-scale pattern is that of Cho and Neumann [15] whose targets are composed of a small set of colored concentric circles, with a slight resemblance to a marker scheme we developed as part of the research presented in this thesis. Their approach uses a fixed set of colored rings arranged in a square. Color is employed to ease detection but should also, in theory, allow for greater information encoding as compared to achromatic targets. On the other hand, printing costs, control of printer gamut mapping, and color constancy issues might make colored fiducials unattractive for many applications. Claus and Fitzgibbon [16] developed a system of fiducials that employs ma- chine learning to optimize the markers with respect to lighting conditions. This approach appears promising but is both computationally quite demanding, and also

24 2.1 RELATED WORK fails non-gracefully as resolution drops off. Circular markers are also possible in fiducial systems, as demonstrated by the PhotoModeler “Coded Marker Module” system [55]; the downside of using circular markers are the high rate of false posi- tives and negatives, inter-marker confusion, and difficulty in pose estimation from a single visible marker. That said, circular markers can be used to encode gradually decaying information, an example of which we discuss in Chapter 3.

2.1.6. Machine Learning in Computer Vision. Tools from Machine Learning have been widely adopted in many different applications in computer vision in recent years. Learning is one of the current frontiers for computer vision research and has been receiving increased attention. Machine learning technology has strong potential to contribute to the development of flexible and robust vision algorithms that promises to improve the performance of practical vision systems with a higher level of competence and greater generality, and the development of architectures to speed up system development and provide better performance. Ensemble methods [34] such as Boosting or Bagging combine several hypotheses into one prediction. They work better than the best individual hypothesis from the same class because they reduce bias or variance (or both). Boosting is a method of finding a highly accurate hypothesis (classification rule) by combining many “weak” hypotheses, each of which is only moderately accurate. The notion of the weak hypothesis is formalized as follows: a “weak” learning algorithms perform better than random guessing; i.e., with a error rate of less than 50 per cent. Typically, each weak hypothesis is a simple rule which can be used to generate a predicted classification for any instance. Theoretical analysis has shown [88] that Boosting does not suffer from over-fitting issues, unlike other classification methods. The AdaBoost algorithm [35], first introduced by Freund and Schapire, is a fast and robust Boosting algorithm. AdaBoost works by maintaining a set of weights over the

25 CHAPTER 2. A FRAMEWORK FOR ROBUST VISUAL HRI training set. Initial weights are set uniformly, but as training progresses, the weights of the wrongly classified training examples are increased to force the subsequent weak learners to focus more on the ‘hard’ problems and boost overall accuracy. Viola and Jones’ seminal paper [99] on object detection and recognition introduced a feasible application of Boosting algorithms as well as the concept of cascaded detectors. Avidan [5] has implemented an Ensemble Tracker using AdaBoost, using a collection of “weak trackers” to come up with a robust estimate of the target location in video sequences. We have used AdaBoost to create robust trackers for arbitrary domains and objects, based on color cues (see Chapter 5). Mobile robot localization [96], gait selection [14] and environmental estimation problems [36] have also seen applications of various other machine learning techniques. Many other variations of the AdaBoost and other Boosting algorithms exist, for multi-class problems (AdaBoost M2, as an example) and regression, although in this thesis we only use the original AdaBoost algorithm for classification.

2.1.7. Human-Robot Interaction and Human-Robot Dialogs. A core component of our research uses a gesture-like interface to accomplish human-robot interaction, and this is somewhat related to visual programming languages. The inference process we use is based on a Markovian dialog model. Hence, we briefly comment on prior work, necessarily in a rather cursory manner, on a variety of disparate and rich domains. As this particular research builds on our past work in vision-based human-robot interaction, we briefly revisit those in this section as well. Our previous work looked at using visual communications, and specifically vi- sual servo-control with respect to a human operator, to handle the navigation of an underwater robot [86]. In that work, while the robot follows a diver to maneuver, the diver can only modulate the robot’s activities by making hand signals that are interpreted by a human operator on the surface. Application of that work, where

26 2.1 RELATED WORK robot control was purely “open-loop”, motivated our work on explicit human-robot interaction. Visual communication has also been used by several authors to allow communication between systems, for example in the work of Dunbabin et al.[25] The work of Waldherr, Romero and Thrun [100] exemplifies the explicit com- munication paradigm in which hand gestures are used to interact with a robot and lead it through an environment. Tsotsos et al.[98] considered a gestural interface for non-expert users, in particular disabled children, based on a combination of stereo vi- sion and keyboard-like input. As an example of implicit communication, Rybski and Voyles [73] developed a system whereby a robot could observe a human performing a task and learn about the environment. Gesture-based robot control has been considered extensively in Human-Robot Interaction (HRI). This includes explicit as well as implicit communication frame- works between human operators and robotics systems. Several authors have consid- ered specialized gestural behaviors [46] or strokes on a touch screen to control basic robot navigation. Skubic et al. have examined the combination of several types of human interface components, with special emphasis on speech, to express spatial relationships and spatial navigation tasks [93]. Vision-based gesture recognition has long been considered for a variety of tasks, and has proven to be a challenging problem examined for over 20 years with diverse well-established applications [27][63]. The types of gestural vocabularies range from extremely simple actions, like simple fist versus open hand, to very complex lan- guages, such as the American Sign Language (ASL). ASL allows for the expression of substantial affect and individual variation, making it exceedingly difficult to deal with in its complete form. For example, Derpanis et al.[19] considered the interpre- tation of elementary ASL primitives (i.e., simple component motions) and achieved 86 to 97 per cent recognition rates under controlled conditions. While such rates are good, they are disturbingly low for open-loop robot-control purposes.

27 CHAPTER 2. A FRAMEWORK FOR ROBUST VISUAL HRI

While our current work looks at interaction under uncertainty in any input modality, researchers have investigated uncertainty modeling in human-robot com- munication with specific input methods. For example, Pateras et al. applied fuzzy logic to reduce uncertainty in high-level task descriptions into robot sensor-specific commands in a spoken-dialog HRI model [62]. Montemerlo et al. have investigated risk functions for safer navigation and environmental sampling for the Nursebot robotic nurse in the care of the elderly [57]. Bayesian risk estimates and active learning in POMDP formulations in a limited-interaction dialog model [20] and spo- ken language interaction models [21] have also been investigated in the past. Re- searchers have also applied planning cost models for efficient human-robot interaction tasks [48][49]. One particular approach towards task execution with communications under un- certainty involves modeling the dialog mechanism using a Bayes’ risk model. In this model, a Decision Function creates a mapping from an observation to an action. The optimal value of the Decision Function (i.e., one that minimizes the total risk) is referred to as the Bayes’ risk. In a human-robot dialog scenario, the observations are the inputs observed by the robot’s sensors, and the actions are commands corre- sponding to the observed input statement. To create a risk minimizing formulation, the Bayes’ risk model requires a set of prior probabilities over the space of all possible input tokens, given a language. For a general-purpose dialog model, this requirement alone creates an insurmountable difficulty. As such, our policy is to take a different approach from the Bayes’ risk model to eliminate the need for prior probabilities altogether.

2.2. A Framework for Visual Human-Robot Interaction

This section discusses the framework we propose for visual human-robot interac- tion. We look at the different visual sensing algorithms discussed in the background

28 2.2 A FRAMEWORK FOR VISUAL HUMAN-ROBOT INTERACTION

Figure 2.2. The three interaction layers in our proposed Visual-HRI framework.

section above, and discuss the applications and integration of these approaches to yield a robust, consistent and useful control-interface scheme. The key focus in this thesis is a hierarchical breakdown of vision algorithms in a three-layer interaction framework. The hierarchy arises from two aspects – the frequency at which inter- action is required, and the computational cost incurred to perform the visual task. Thus, our interaction framework classifies vision algorithms from high-to-low inter- action frequencies, or conversely, from low-to-high computational cost as depicted in Fig. 2.2. The three different categories arise naturally from the need to acquire informa- tion for coherent interaction. Capabilities such as visual tracking are semantically low-level, but require frequent updates for successful operation. Visually command- ing the robot is not necessary at such a high frequency, and thus calls for a lower frequency interaction. Object recognition, contextual content analysis are approaches that are computationally expensive, but are performed even at a coarser time scale. Our current work has addressed techniques belonging to the individual levels of this hierarchy. A detailed view of the interaction scheme can be found in Fig. 2.3.

29 CHAPTER 2. A FRAMEWORK FOR ROBUST VISUAL HRI

Figure 2.3. An abstract breakdown for the proposed Visual-HRI framework.

Another classification of these interaction algorithms is one that arises from the nature of the interaction itself, independent of the computational cost and invocation rate. In this thesis, we call Explicit Interactions those algorithms which involve a human operator interacting directly to the robot; usually, methods that enable a human operator to send commands to the robot, or conversely, methods that enable a robot to acknowledge receipt of command or send feedback to the operator, are classified as such. Implicit Interactions, on the other hand, are algorithms that contribute to human-robot interaction in less direct manners, and do not involve communicating directly with the operator. From the perspective of the classical operating system architecture, implicit interaction algorithms take up the role of the daemon processes, working in the background to support tasks communicated by the

30 2.2 A FRAMEWORK FOR VISUAL HUMAN-ROBOT INTERACTION explicit interaction methods (i.e., foreground processes). This thesis looks at a visual programming language and the paradigm of Graphical State-Space Programming as two examples of explicit interactions. People following, Spatio-Chromatic tracking and task cost assessments are examples of implicit interaction algorithms. While the secondary classification based on the directness of the interaction is an important consideration, we concentrate on the three-layer architecture for our framework, as it leads to a natural separation between the algorithms, and also strengthens the inter-layer synergy, as will be explained later in this chapter (and also later in this thesis). The following sections describe the purposes of the individual layers of interaction, and concludes with a discussion of the overall working of the integrated framework as a whole.

2.2.1. High-frequency Methods. We aim to construct a vision-based in- teraction scheme having components with variable frequencies of interactions. High- frequency methods are executed frequently, and this layer contains the least computation- intensive algorithms. The following subsections discuss some of these algorithms and their utilities. 2.2.1.1. Visual Tracking and Servoing. For visually tracking targets in ques- tion, we use the color information of the target object. An array of trackers is used, ranging from a thresholding-based color segmentation (a.k.a.“blob”) tracker to a more sophisticated kernel-based mean-shift tracker, as discussed in the previous section. While the color segmentation tracker is computationally less expensive, the mean-shift tracker has more accurate tracking performance at the cost of more com- putational power. The mean-shift tracker is also better equipped to track objects that have diverse color characteristics, compared to the blob tracker, which performs well when tracking monochromatic objects. Figure 2.4 shows a blob tracker tracking a pink target. Figure 2.4(a) shows the input image, while Fig. 2.4(b) shows the

31 CHAPTER 2. A FRAMEWORK FOR ROBUST VISUAL HRI normalized-RGB view of the input image to the tracker. Using the normalized-RGB colorspace makes the tracking algorithms more robust to lighting variations [92], as the output shows in 2.4(c).

(a) Blob Tracker tracking a (b) The normalized-RGB im- (c) Image after segmenta- pink target. age. tion, with the tracker output marked with a cross-hair.

Figure 2.4. “Blob” tracker tracking a target underwater.

To further enhance tracking performance, particularly keeping in mind varying lighting and visibility conditions, we have designed a tracker which learns the spatial distribution of hues for an object of interest, and uses a bank of “weak” trackers (i.e, trackers that are computationally inexpensive but also do not exhibit high accuracy) to localize the target in the image sequence. Using an approach known as “Ensemble Learning” in the Machine Learning literature, this algorithm creates a more accurate tracker by combining the output of a set of such weak trackers. Robust to changes in illumination, occlusion and able to track objects with a range of color characteristics, this tracker has significantly improved performance over the other trackers we have deployed. We name this algorithm the Spatio-Chromatic Tracking, and we discuss it in depth in Chapter 5. 2.2.1.2. Spatio-temporal Feature Tracking. The presence of multiple targets with similar features often confuses visual trackers, and using color features alone for tracking people and other objects could very well lead to ambiguity. For a robot to keep track of a human operator, it is imperative that the visual tracking algorithms

32 2.2 A FRAMEWORK FOR VISUAL HUMAN-ROBOT INTERACTION are able to remove such ambiguity by selecting features that are unique to biological entities. We have chosen to use motion as a unique signature in this framework. Biological motion is inherently periodic in the low frequencies, as has been found in previous research (see Sec. 2.1.4). We focus on extracting periodic gait information from human operators, by analyzing a spatial signal in the XYT domain. That is, a location in the image space is analyzed across successive frames for periodicity, which according to our previous assumption is not only a potential candidate location for a human being in the image, but also can be used to uniquely identify which per- son we are looking at. The intensity signal thus extracted is subjected to a Fourier transform, and low frequency responses are extracted. Image locations that exhibit high responses in the low frequencies are taken as potential locations. We have vali- dated this approach on video footage of scuba divers swimming underwater [77][80], by exploiting the undulating motion of the diver’s flippers. The person in question does not have to travel in any particular direction from the camera’s perspective for this scheme to work. Figure 2.5 summarizes the process for tracking a diver using the above-mentioned approach. Chapter 4 describes the theoretical aspects of this tracker in more detail, with experimental results presented in Chapter 8.

2.2.2. Intermediate-frequency Methods. Intermediate-frequency meth- ods lie between the High- and Low-frequency methods in terms of invocation rate and computational load. As such, their primary use is in establishing the between the methods in these two layers by executing tasks that require a moderate degree of interaction between the robot and the human. Robust and efficient interaction is essential for successful task coordination between a robot and the human operator. To that end, we propose a communication and programming paradigm with the aid of artificially engineered markers for communicating with and configuring robots. The prerequisites of such a system are the ease of use, ease of deployment, and the

33 CHAPTER 2. A FRAMEWORK FOR ROBUST VISUAL HRI

Figure 2.5. Outline of the process of tracking a diver using flipper oscillations. easy portability and learning of the scheme by the operators themselves. This par- ticular approach towards human-robot communication belongs in the intermediate frequency layer of our framework. To this extent, we have created and evaluated in both artificial and natural en- vironment trials a visual programming language called RoboChat. RoboChat is a stack-based language which takes numeric tokens as input, where the tokens them- selves can have both atomic and complex semantic meanings. The constructs of the underlying robot programming language embedded in RoboChat are mapped into individual tokens (i.e., marker IDs). RoboChat is independent of the method of

34 2.2 A FRAMEWORK FOR VISUAL HUMAN-ROBOT INTERACTION

(a) Diver instructing the robot using RoboChat (b) Robot’s camera view

Figure 2.6. Diver using RoboChat to communicate with the Aqua robot. delivery of the numeric tokens; this fact enables us to use the scheme with differ- ent visual symbolic markers with their own corresponding detection and decoding libraries. We have used ARTag [29] and ARToolkitPlus [65] markers as front-ends to the RoboChat language, which are detected robustly in both underwater and ter- restrial domains. An example interaction scenario with the RoboChat scheme can be seen in Fig. 2.6, where a scuba diver is giving instructions to the Aqua robot using a set of fiducials. We have also devised an extension to this scheme, where an operator performs discrete geometric gestures with a pair of such markers. Called RoboChat Gestures, this scheme reduces the number of markers an operator has to carry, as the command space expressed by performing gestures with a pair of markers can contain a large number of language constructs. To evaluate the utility of the RoboChat scheme compared to natural gestures, we perform human-interface studies to measure the usability and accuracy of our proposed system. The ARTag-based system is compared to a hand gesture system, as competing input devices in environments unsuitable for the use of conventional input devices in two different experiments. In the first, we investigate the perfor- mance of the two systems under a stressful environment, similar to the one scuba

35 CHAPTER 2. A FRAMEWORK FOR ROBUST VISUAL HRI divers must face underwater, for example. The second study aims to compare the two input mechanisms in the presence of different and significantly larger vocabu- lary sizes, without any simulated distractions. The main task in both studies is to input a sequence of action commands, with the possibility of specifying additional parameters. Each command in our experiments was a sequence of atomic commands to be sent to the robot, although the hand gesture system is interpreted by a human observer with significant expertise in the system. We experiment with command sequences from a fixed vocabulary, and also look at the effect of varying vocabulary sizes on the performance of both systems. Not surprisingly, the hand gesture system performs better than the RoboChat scheme with a fixed vocabulary size(Fig. 2.7(a)) but the data suggests that the marker scheme is capable of matching half the speed of the hand gestures. The data from the second study as shown in Fig. 2.7(b) sug- gests that the two input interfaces have very similar performances under the new constraints. Major contributing factors include the increase in the vocabulary size and the inclusion of many abstract action tokens. This variation takes away the crucial advantage hand gestures had in the former study, as participants are forced to search through the gesture sheet rather than remembering all the hand gestures. Details of the RoboChat language is presented in Chapter 3, with user studies and field trial results appearing in Chapter 8.

2.2.3. Low-frequency Methods. The most computationally expensive, and thus the least-frequently invoked algorithms in our framework reside in this layer. Functionally opposite to the algorithms residing in the High-Frequency layer, the methods in this level work with information gathered over a period of time, and often the output of the algorithms of the previous layers are given as input. In this thesis, we created a quantitative modeler of task cost and input uncertainty as an

36 2.2 A FRAMEWORK FOR VISUAL HUMAN-ROBOT INTERACTION

(a) Small vocabulary with distractions:Average time taken per com- mand using ARTag markers and using hand gestures.

(b) Large vocabulary without distractions:Average time taken per command using ARTag markers and using hand gestures.

Figure 2.7. Performance data from human-interface trials of the RoboChat input scheme.

instantiation of a low-frequency algorithm, an overview of which is presented below. We defer in-depth discussion of this algorithm until Chapter 6.

37 CHAPTER 2. A FRAMEWORK FOR ROBUST VISUAL HRI

2.2.3.1. Quantitative modeling of risk and uncertainty in human-robot dialogs. In an arbitrary human-robot interaction scenario, a human operator gives instruc- tions to a robot through a specific modality, which includes but is not limited to gestures, speech and tactile input. In almost every such scenario, the uncertainty in input is non-trivial (in an average-case model). In other words, the possibility of having a fully error- and uncertainty-free communication protocol is low. Thus, there will always exist some uncertainty in the input (i.e., instructions) given to the robot. The source of such input noise could be imperfect sensing, errors made by the users while programming the robot, or even environmental disturbances (e.g., loud background noise in case of speech or poor visibility in case of visual gestures) Given such noisy input, a robot must evaluate possible task costs and assess the risks involved in executing each possible set of instructions, as perceived by the robots in- put sensor. The goal, given the input and the risk assessment results, is to execute a high-risk command if and only if the operator truly asked for it. To that effect, we have devised a quantitative model for assessing risk-uncertainty in human-robot dialogs. The essence of our approach is to model the input using a Hidden Markov Model and generate a set of possible alternative inputs, and then assess the cost of all possible input commands using a task simulator. Given this input belief tracking and set of task costs, we use a Decision Function to decide whether a confirmation should be requested from the user. To evaluate this scheme, on-the-bench user studies have been performed, and a working implementation is currently on-board the Aqua robot. We have conducted a number of field trials of the system in underwater environments, where a scuba diver communicated with the robot using the RoboChat scheme, and the dialog engine asked confirmations, if required, before executing the instructed command.

38 2.3 CONCLUSION

2.3. Conclusion

This chapter introduced our vision-based framework for human-robot interac- tion, and looked at previous research in the different domains this framework en- compasses. We have defined the role of Interaction algorithms, and have presented the classification of these algorithms based on their invocation frequencies and com- putational costs. The role of implicit and explicit interactions were also discussed. Note, however, that all three categories of interaction algorithms according to the three-layer model can map into either the implicit or explicit interaction classes. In our framework, the invocation rate of the algorithms are assessed during a human-robot interaction scenario, and we do not take into account processing or computations performed a priori during this assessment. An example case would be the Spatio-Temporal Tracker we introduced in Sec. 2.2.1.2. The ensemble tracker requires off-line training before it is deployed on-board the robot, and this off-line training time is not considered for its classification as a high-frequency algorithm. On the other hand, if the algorithm was learning during tracking by using real-time Boosting, it would be classified as both a high- and medium-frequency (or even, low-frequency, depending on the time-scale granularity of learning), with the tracker working at a higher invocation rate than the “real-time” learning. The purpose of our framework is to facilitate close interaction between a robot and a human operator; that is, we envision scenarios where there will be a dialog- based interaction targeted towards the execution of a well-defined task. One can imagine a lesser level of involvement between a human operator and a robot, where the operator is involved with a higher-level design of a mission that the robot has to carry out. In this “macro level” task description, the mission is described as a whole based on major “waypoints”, and the details of the tasks are described for each such waypoint. Our work on Graphical State-Space Programming (GSSP) looks into

39 CHAPTER 2. A FRAMEWORK FOR ROBUST VISUAL HRI such a level of interaction. While not a core component of the research presented in this thesis, the GSSP paradigm provides a higher-level, graphical method to robot mission programming, without being concerned with the more traditional textual programming approach. In this thesis, we shall not delve into the details of the GSSP approach or the experimental evaluations performed towards assessing its usability, which can be found in [87]. The next chapter discusses the details of RoboChat, a visual programming lan- guage for robots. We also briefly present an extension to the RoboChat language called RoboChat Gesture. Alongside our investigation into gesture-based robot pro- gramming, we looked into the creation of robust fiducials as a tool to visually com- municate with a robot. The Fourier Tag series of fiducials are a direct offshoot of this research, and we briefly present the inner workings and benefits of these tags, in the context of visual robot programming. Subsequent chapters describe the Fourier and the Spatio-Chromatic trackers, and a generative model of confirmations in human- robot dialogs.

40 CHAPTER 3

A Visual Language for Robot Programming

Robust and efficient interaction is essential for successful task coordination between a robot and the human operator. A human operator must be able to communicate instructions to the robot in as clear and unambiguous a manner as possible. To that end, we propose a communication and programming paradigm with the aid of artificially engineered markers for communicating with and configuring robots. The prerequisites of such a system are the ease of use, ease of deployment, and the easy portability and learning of the scheme by the operators themselves. This chapter describes an interaction paradigm for controlling a robot using hand gestures. In particular, we are interested in the control of a mobile robot by an on-site human operator in arbitrary environments. Under this context, vision- based control is very attractive, and we propose a robot control and programming mechanism based on visual symbols. A human operator presents engineered visual targets to the robotic system, which recognizes and interprets them. We describe the CHAPTER 3. A VISUAL LANGUAGE FOR ROBOT PROGRAMMING approach behind this visual language and propose a specific gesture language called “RoboChat”. RoboChat allows an operator to control a robot and even express complex programming concepts, using a sequence of visually presented symbols, encoded into fiducial markers. In Chapter 8, we evaluate the efficiency and robustness of this symbolic communication scheme by comparing it to traditional gesture-based interaction involving a remote human operator.

3.1. Introduction

The core of the research presented in this thesis deals with the interaction be- tween a robot and a human operator. We are interested in domains where the human and the robot work together at the same location to accomplish various tasks. In our particular domain instance, this research deals with human-robot interaction underwater, where the available modes of communication are highly constrained and physically restricted. This chapter describes an approach to control a robot by using visual gestures from a human operator (Fig. 3.1): for example to tell the robot to follow the operator, to acquire photographs, to go to a specified location, or to execute some complex procedure. In general the term gesture refers to free-form hand motions, but in this work we use to term gesture to refer to manual selection of symbolic markers. In our current application, a scuba diver underwater is joined by a semi-autonomous robot acting as an assistant. This application context serves as a motivation for the general communications problem. Conventionally, divers converse with each other using hand signals as opposed to speech or writing. This is because the aquatic environment does not allow for simple and reliable acoustic and radio communication, and because the physical and cognitive burdens of writing or using other similar communication media are generally undesirable. On the other hand, visual gestures do not rely on complicated or exotic hardware, do not require strict environmental settings, and can convey a wide range of information with minimal

42 3.1 INTRODUCTION physical and mental effort from the user. Furthermore, by combining spatial gestures with other visual communication modes, a large and expressive vocabulary can be obtained. Thus, gestures provide a natural mechanism for the diver to use in commu- nicating with the robot. In fact, our prior work involving human-robot interaction for underwater robots has already exploited the use of gestures for communication, although it is mediated by a human operator on the surface who interprets the ges- tures [86]. Alternative methods of controlling a robot in arbitrary environments could include a keyboard or some other tactile device, but such methods can be un- appealing since they entail costly fortification, requires physical contact between the operator and the robot, or necessitates some supplementary communications scheme between a remote device and the robot. In contrast, this vision-based communication scheme can easily be materialized using any inexpensive printable medium (such as laminated paper), functions through passive sensing, and provides a direct interface between the user and the robot. While our approach was motivated by the desire to control a robot underwater, the methodology is more generally applicable to human-robot interaction scenarios. Traditional methods for human-robot interaction are based on speech, the use of a keyboard, or free-form gestures. Even in terrestrial environments each of these interfaces has drawbacks, including interference from ambient noise (either acoustic or optical), the need for proximity and physical contact, or the potential ambiguity in the transduction and interpretation process (i.e., both speech and gesture recognition systems can be error-prone). As such, our approach to robust non-contact visual robot control may have applications in diverse environments that are far more prosaic than the deep undersea. Free-form hand gestures are clearly the most natural and most expressive non- verbal means of communication. Whether they are initiated using hands, facial features, or the entire body, the benefits of using gestures in comparison with other

43 CHAPTER 3. A VISUAL LANGUAGE FOR ROBOT PROGRAMMING

Figure 3.1. A diver controlling the Aqua robot using visual cues. media such as speech or writing comes from the vast amount of information that can be associated with a simple shape or motion. The difficulty with natural gestures is that they are very hard to interpret. This difficulty stems from several factors, which include the repeatability of the gestures, the need to identify the operator’s body parts in images with variable lighting and image content, and the possibility that the operators themselves are inconsistent with the gestures being used. These and other factors have made gesture interpretation a stimulating research area for over a decade (see Chapter 2), but the unresolved challenges make it problematic for robust robot control. Our approach is to use symbolic targets manipulated by an operator to affect robot control. By using carefully engineered targets, we can achieve great robustness and accuracy while retaining a large measure of convenience. In this chapter, we de- scribe our approach towards creating this symbolic language, and in Chapter 8, we evaluate the precise extent to which it remains convenient and usable, and attempt to measure and quantify the loss of convenience relative to natural free-form gestures

44 3.1 INTRODUCTION between humans. The symbolic tokens expressed by the targets are used to compose “command lists”, or more accurately program fragments, in a robot control language we have developed called “RoboChat”. One challenge has been to select an appro- priate level of abstraction for RoboChat which is both expressive yet convenient, a domain-specific trade-off which is faced by conventional programming language designers as well. To extend the functionality and increase user convenience, we extended the RoboChat paradigm to include simple geometric hand-gestures formed by two symbolic markers. The benefit of this scheme is a reduction in number of markers needed to be carried by the operator, while maintaining a comparable degree of expressiveness. Although the system is to be used on an underwater robot vehicle, we report a human interface performance evaluation conducted on dry land. This evaluation compares different operating modes employed to send commands in a simulated ro- botics context and, in some cases, includes a distractor task to replicate the cognitive load factors that arise underwater. We also report qualitative results from a test of the system with a fully deployed underwater robot, but due to the logistic constraints involved we were not able to run an actual multi-user performance evaluation under- water.

Figure 3.2. Comparison of C (left) and RoboChat (right) syntax.

45 CHAPTER 3. A VISUAL LANGUAGE FOR ROBOT PROGRAMMING

3.2. Methodology

The RoboChat language has two major components, that differ basically in the mode of delivery. The original RoboChat mechanism defined the language, the mode of delivery and the language parser. RoboChat Gestures, additionally, defines the rules for instruction delivery using free-form gestures and also describes the gesture detection and interpretation algorithm. This section describes the original RoboChat scheme in detail. Our approach to robot control using visual signaling is to have an operator de- scribe actions and behaviors to the robot by holding up a sequence of engineered targets. Each target represents one symbolic token, and the sequence of tokens con- stitutes a program to modulate the robots behavior. Because the user may be subject to conflicting demands (i.e., they may be cognitively loaded and/or distracted), we use only the token ordering to construct commands, while disregarding the timing of the presentation, the delay between tokens, and the motion of the targets within the field of view. The sequence of markers constitutes utterances that express “commands” to the robot. While we refer to the utterances as commands, the semantics of the individual statements in RoboChat do not individually have to imply a change of program state. In this chapter, we define the RoboChat language which is, in turn, used to control the robots operation by providing parameter input to RoboDevel, a generic robot control architecture [74]. RoboChat can be used to alter a single parameter in RoboDevel once, or to iteratively and regularly alter parameters based on ongoing sensor feedback. In order to assess the RoboChat paradigm, we describe in Chapter 8 a series of human interaction studies, as well as a qualitative validation in the field (i.e., under water). The human interaction studies compare the performance of a human operator using RoboChat with a human operator controlling the robot using

46 3.2 METHODOLOGY conventional hand gestures interpreted by a human assistant. Factors that may relate to the usability of such signaling system are the extent to which the operator is distracted by other stimuli and activities, and complexity of the vocabulary being used to control the robot. Thus, we have performed some of our usability studies in the presence of a distractor task, and have examined performance as a function of changing vocabulary since. While we considered several alternative distractor tasks, we eventually settled on the game of “Pong” [44] which must be successfully played while the robot control session is underway. A suitable distractor must be fairly accessible to all users, continually demanding of attention, yet allow the core task to still be achievable (and, of course, it needs to be safe and not too disagreeable to the subjects). Pong was decided to best fit the above set of requirements, and was thus chosen for our experiments.

3.2.1. RoboChat grammar and syntax. Constructing an appropriate language for gestural control of a robot involves several competing priorities: the language must be abstract enough to be succinct, low-level enough to allow the tweaking of detailed parameters, reasonably easy to use for users with limited or no programming background, and flexible enough to allow for the specification of unanticipated behaviors by technically-oriented users conducting experiments. Two of these priorities dominate the RoboChat design: maintaining a minimal vocabulary size and allowing commands to be specified using as few markers as possible, even though commands may have optional parameters. Because this language is designed specifically to control robots, movement and action commands are treated as atomic structures in RoboChat. Different com- mands may require different arguments: for example, the MOVE_FORWARD command needs DURATION and SPEED arguments among other possible ones. Arguments

47 CHAPTER 3. A VISUAL LANGUAGE FOR ROBOT PROGRAMMING are implemented as shared variables, thus after a parameter has been set once, all subsequent commands requiring this parameter will automatically acknowledge the previous value by default. Although the structure of RoboChat is well-defined, the specific command vocab- ulary is task-dependent and can be expanded or substituted if necessary. However, RoboChat does have a core set of basic tokens, including numerical digits, arithmetic operators, and relational operators. Additionally, RoboChat defines a limited num- ber of variables, including the aforementioned command parameters, as well as some general-purpose variable names. RoboChat features two control flow constructs – the if-else statement, and the indexed iterator statement. The former construct allows the user to implement decision logic, while the latter immensely cuts down on the required number of tokens for repeated commands. Arguably the most important feature of RoboChat is the ability to define and execute macros (i.e., macro-instructions). The user can encapsulate a list of expres- sions into a numerically tagged macro, which can then be called upon later. This feature allows the reuse of code, which is essential when trying to minimize the number of tokens needed to specify behavior. As seen from the BNF (Backus-Naur Form) grammar of RoboChat in Fig. 3.3, every construct is designed to minimize the number of tokens needed to express that construct. Reverse Polish notation (RPN) is heavily exploited to achieve this minimization – operators and operands are presented using RPN, eliminating the need for both an assignment operator and an end-of-command marker, while still allowing one-pass “compilation”. Additionally, the use of RPN notation in the more abstract control flow constructs eliminate the need of various delimiters common to most programming languages, such as THEN or { ... } (body code brackets).

48 3.2 METHODOLOGY

Figure 3.3. RoboChat BNF grammar.

RoboChat interprets the tokens in real-time, but only executes the commands upon detection of the EXECUTE token. This feature allows for batch processing, and also enables the error recovery element, using the RESET token.

Figure 3.4. A human-interface trial in progress.

49 CHAPTER 3. A VISUAL LANGUAGE FOR ROBOT PROGRAMMING

3.3. RoboChat Gestures

In spite of its utility, RoboChat suffers from three critical weaknesses in its user interface. First of all, because a separate fiducial marker is required for each robot instruction, the number of markers associated with robot commands may be significantly large for a sophisticated robotic system. This requirement can impede the diver’s locomotive capabilities, since he must ensure the secure transportation of this large amount of marker cards underwater. Secondly, the mappings between robot instructions and symbolic markers are completely arbitrary, as the diver must first read the labels on each card to locate a particular token. Thirdly, as a consequence of the previous two deficiencies, the diver may require a significant amount of time to locate the desired markers to formulate a syntactically correct script, which may be unacceptable for controlling a robot in real-time. To address the aforementioned weaknesses in the core RoboChat interface, we construct an interaction paradigm called RoboChat Gestures, which can be used as a supplementary input scheme for RoboChat. The main premise is for the operator to formulate discrete motions using a pair of fiducial markers. By interpreting different motions as robot commands, the operator no longer is required to carry one marker per instruction. The trajectories of RoboChat Gestures are derived from different types of traditional gestures, to take advantage of existing associations and conven- tions in the form of embedded information. This introduces a natural relationship between trajectories and their meanings, which alleviates the cognitive strain on the operator. Additionally, the robot can process the observed gestures and extract features from the motion, such as its shape, orientation, or its size. Each gesture is mapped to a command, while the extracted features are associated with various parameters for that instruction. Because much of the information is now embedded in each trajectory, RoboChat Gestures can express the same amount of information

50 3.3 ROBOCHAT GESTURES that the previous RoboChat interface could, but in significantly less time, and using only two fiducial markers. RoboChat Gestures is motivated partly by traditional hand signals used by all human scuba divers to communicate with one another. As mentioned in Sec. 3.2, the original RoboChat scheme was developed as an automated input interface to pre- clude the need for a human interpreter or a remote video link. Usability studies of RoboChat suggests (described in Chapter 8) that naive subjects were able to formu- late hand signals faster than searching through printed markers. This difference was apparent even when the markers were organized into indexed flip books to enhance rapid deployment. We believe that this discrepancy in performance was due to the intuitive relationships that existed between the hand signals and the commands they represented. These natural relationships served as useful mnemonics, which allowed the diver to quickly generate the input without actively considering each individual step in performing the gesture. The RoboChat Gestures scheme employs the same technique as hand signals to increase its performance. Each gesture comprises of a sequence of motions performed using two fiducial markers, whose trajectory and shape imply a relevant action known to be associated with this gesture. Because different instructions can now be specified using the same pair of markers, the total number of visual targets required to express the RoboChat vocabulary is reduced considerably, making the system much more portable. This benefit is particularly rewarding to scuba divers, who already have to attend to many instruments attached to their dive gear. In general, the expression space for RoboChat Gestures comprises of several dimensions. Different features may be used in the identification process, including the markers’ ID, the shape of the trajectory drawn, its size, its orientation, and the time taken to trace out the gesture. In addition, the gestures provide a way to communicate out-of-band signals, for example to stop the robot in case of an emergency. To optimize the system’s

51 CHAPTER 3. A VISUAL LANGUAGE FOR ROBOT PROGRAMMING usability, numerical values for these non-deterministic features are converted from a continuous representation to a discrete one, for both signal types.

3.4. Fourier Tags

Programming tools like RoboChat (along with additions like RoboChat Ges- tures) provide an operator with powerful mechanisms for instructing a robot. While RoboChat is independent of the exact input modality being used, it is imperative that it provides a low-error, high-bandwidth (in terms of information that can be transmitted at a given time) communication channel. With respect to the visual pro- gramming model, we have opted to use fiducials as our input tool, and the ARTag fiducials in particular. The ARTag markers, and all similar fiducial systems we are aware of, encode one or more bits of data (the information payload) using a geometric pattern. If the viewing conditions deteriorate (due to distance, camera noise, fog, or other factors) the pattern eventually becomes ambiguous and no further information is extractable. Note that some systems such as ARTag use error correcting codes which can tolerate partial occlusion, but eventually even such robust systems fail. Specifically, while using tags for mobile robot control, we have observed that circum- stances often arise where a landmark is observable, but its pattern information is not clear enough to be used. The Fourier tag was developed to address precisely the problem of inadequate resolution, as well as simple detection and fast decoding. Specifically, as a Fourier tag is viewed with diminishing accuracy, the information it encodes degrades gracefully. When partly recognized, the high order bits of the numerical encoding of the pattern’s identity are preserved, and the low order bits decay away successively. This is because the Fourier tag encodes bits in the amplitude spectrum of its Fourier transform, with successively lower order bits using successively higher frequencies. Since the imaging process can be approximated as a low pass filtering transformation of the image, the

52 3.4 FOURIER TAGS

Figure 3.5. Bottom left : the frequency domain encoding of 210; Top left: the corresponding spatial signal. Right: The Fourier tag of the number 210 constructed from the signal shown on the top-left.

low order bits are selectively lost as the image of the tag loses resolution with distance (due to many processes that can include perspective foreshortening and atmospheric scattering). An example of a Fourier tag can be seen in Fig. 3.5. The process of constructing a Fourier tag is as follows: the 8-bit binary representation of the number to be encoded is first represented in an amplitude spectrum, with each higher-order (i.e., more significant) bit occupying a lower frequency band. By explicitly encoding in this manner, we ensure that with gradual degradation, the least-significant bits of the encoded value are the first ones to be corrupted. This can be seen on the bottom- left plot in Fig. 3.5, in case of encoding the number 210 (binary representation of which is 11010010). Note that the plot is symmetrical around the Nyquist frequency.

53 CHAPTER 3. A VISUAL LANGUAGE FOR ROBOT PROGRAMMING

Figure 3.6. Fourier tags underwater on the sea bed and with a diver.

Performing an inverse Fourier transform on this amplitude signal produces an one- dimensional spatial (i.e., intensity) signal, shown in the top-left of Fig. 3.5. By rotating this signal around the zero-pixel location, the Fourier tag image, such as the one shown at the right of Fig. 3.5, is produced. The most significant benefit of using the Fourier tags over other tag mechanisms is this feature of gradual degradation. Particularly in underwater domains, vision is affected by a plethora of issues, ranging from drastic changes in lighting to light absorption by suspended particles. In such environments, using tags with abrupt information cutoff can be problematic, as for the robot to decipher commands, the tag must be correctly and fully interpreted by the on-board vision system. Often, this requires the divers to show the tags at exact distances and exact viewing angles, resulting in extremely time consuming communication. This also is prone to oper- ator error, and can cause a total breakdown in communication as an extreme case. With the Fourier tags, while the complete information may not be available because of such difficult visual conditions, by encoding important information in the higher- order bits of the tag (such as a “tag signature”), the robot can identify the presence of tag with partial information. As a result, the robot can reorient itself, or close

54 3.5 CONCLUSION the distance so that full information is available. In certain scenarios, conveying par- tial information could be sufficient to make the robot carry out immediate actions, such as an emergency rise to the surface. Further details on the implementation and experimental validation of Fourier tags can be found in [75]. Figure 3.6 demon- strates the Fourier tags being used as communication markers (left) and underwater landmarks for surveillance tasks (right).

3.5. Conclusion

We have presented a visual communication and programming paradigm for mo- bile robots based on visual tags. This system is optimized for operating in underwater environments, but can be used in other contexts. The evaluation results, which con- tains qualitative validation in the field and a controlled human interface study are presented later in this thesis in Chapter 8. This method of sending commands also enables us to operate the robot without a tether. The use of symbolic markers is very convenient and provides several important advantages. It is somewhat surprising that such a system has not been exploited more heavily in the past, especially given its effectiveness as reflected from our study data. The experimental results, as described in Chapter 8, demonstrate that the tag-based system can be at least as efficient as traditional gestural communication protocols, given enough training to the assistant. It also eliminates the need for the human operator, thereby reducing the sources of error due to communication. Possible future research include the investigation of multiple markers to send compound commands. The use of alternative marker systems that degrade more gracefully under impaired visibility, such as the Fourier tags presented in this chapter, helps to increase the performance of the system. It may also be appropriate to exploit probabilistic reasoning, for example in the form of a Markov model, to improve the robustness of the language over sequences of several tokens (although this approach

55 CHAPTER 3. A VISUAL LANGUAGE FOR ROBOT PROGRAMMING would imply losing some expressive power). In terms of probabilistic measure of input language accuracy, we address the issues of uncertainty in input by using a quantitative model that couples command uncertainty and task evaluation costs together in an attempt to ensure safe command execution. This approach is described in Chapter 6. The next chapter describes an algorithm for tracking (multiple) scuba divers by an underwater robot. For a mobile robot to act as a diver’s companion, it is important for the robot to possess such abilities. As it is not as direct a method of interaction as the scheme presented in this chapter, it falls into the category of implicit interaction algorithms.

56 CHAPTER 4

Tracking Biological Motion

This chapter describes an algorithm for underwater robots to visually detect and track human motion. Our objective is to enable human-robot interaction by allow- ing a robot to follow behind a human moving freely in three dimensions (i.e., in six degrees of freedom). In particular, we have developed a system to allow a robot to de- tect, track and follow a scuba diver by using frequency-domain detection of biological motion patterns. The motion of biological entities is characterized by combinations of periodic motions which are inherently distinctive. This is especially true of human swimmers. By using the frequency-space response of spatial signals over a number of video frames, we attempt to identify signatures pertaining to biological motion. This technique is applied to track scuba divers in underwater domains, typically with the robot swimming behind the diver. The algorithm is able to detect a range of motions, which includes motion directly away from or towards the camera. The mo- tion of the diver relative to the vehicle is then tracked using an Unscented Kalman CHAPTER 4. TRACKING BIOLOGICAL MOTION

Filter (UKF), an approach for non-linear estimation. The efficiency of our approach makes it attractive for real-time applications on-board our underwater vehicle, and in future applications we intend to track scuba divers in real-time with the robot. This chapter presents an algorithmic description of our approach.

(a) The Aqua underwater robot visually servoing (b) Typical visual scene encountered by an AUV off a target carried by a diver. while tracking scuba divers.

Figure 4.1. External and robot-eye-view images of typical underwater scenes during target tracking by an autonomous underwater robot.

4.1. Introduction

Motion cues have been shown to be powerful indicators of human activity and have been used in the identification of their position, behavior and identity. In this work, we exploit motion signatures to facilitate visual servoing, as part of a larger human-robot interaction framework. From the perspective of visual control of an autonomous robot, the ability to distinguish between mobile and static objects in a scene is vital for safe and successful navigation. For the vision-based tracking of human targets, motion patterns are an important signature, since they can provide a distinctive cue to disambiguate between people and other non-biological objects,

58 4.1 INTRODUCTION including moving objects, in the scene. We look at both of these features in the this chapter. Our work exploits motion-based tracking as one input cue to facilitate human- robot interaction. An important sub-task for our robot, like many others, is for it to follow a human operator (as can be seen in Fig.4.1(a)). We facilitate the detection and tracking of the human operator using the spatio-temporal signature of human motion (the psychological effect of which on human perception has been investigated by Rashid et al. [69]). In practice, this detection and servo-control behavior is just one of a suite of vision-based interaction mechanisms. In the context of servo-control, we need to detect a human, estimate his image coordinates (and possible image ve- locity), and exploit this in a control loop. We use the periodicity inherently present in biological motion, and swimming in particular, to detect human scuba divers. Divers normally swim with a distinctive kicking gait which, like walking, is peri- odic, but also somewhat individuated. In many practical situations, the preferred applications of AUV technologies call for close interactions with humans. The un- derwater environment poses new challenges and pitfalls that invalidate assumptions required for many established algorithms in autonomous mobile robotics. In partic- ular, the effect of distorted optics under water is a challenging problem to overcome for machine vision systems, thus for any autonomous system that is guided by visual sensing. Refraction, absorption and scattering of light are three major contributors for optical distortion in underwater domains (illustrated in Fig. 4.2), and can be encountered in most open-water (e.g., oceanic) environments. While truly autonomous underwater navigation remains an important goal, hav- ing the ability to guide an underwater robot using sensory inputs also has important benefits; for example, to train the robot to perform a repetitive observation or in- spection task, it might very well be convenient for a scuba diver to perform the task as the robot follows and learns the trajectory. For future missions, the robot can use

59 CHAPTER 4. TRACKING BIOLOGICAL MOTION the information collected by following the diver to carry out the inspection. This ap- proach also has the added advantage of not requiring a second person tele-operating the robot, which simplifies the operational loop and reduces the associated overhead of robot deployment. Our approach to track scuba divers in underwater video footage and real-time streaming video arises thus from the need for such semi-autonomous be- haviors and visual human-robot interaction in arbitrary environ- ments. The approach is com- putationally efficient for deploy- Figure 4.2. Effects of light in light ment on-board an autonomous rays in underwater domains. underwater robot. Visual track- ing is performed in the spatio-temporal domain in the image space; that is, spatial frequency variations are detected in the image space in different motion directions across successive frames. The frequencies associated with a diver’s gaits (flipper mo- tions) are identified and tracked. Coupled with a visual servoing mechanism, this feature enables an underwater vehicle to follow a diver without any external operator assistance, in environments similar to that shown in Fig. 4.1(b). The ability to track spatio-temporal intensity variations using the frequency domain is not only useful for tracking scuba divers, but also can be useful to detect motion of particular species of marine life [6] or surface swimmers [50]. It is also associated with terrestrial motion like walking or running, and our approach seems appropriate for certain terrestrial applications as well. It appears that most biological motion underwater as well as on land is associated with periodic motion, but we

60 4.2 METHODOLOGY

Figure 4.3. Outline of the Directional Fourier motion detection and track- ing process. The Gaussian-filtered temporal image is split into sub-windows, and the average intensity of each sub-window is calculated for every time- frame. For the length of the filter, a one-dimensional intensity vector is formed, which is then passed through an FFT operator. The resulting am- plitude plot can be seen, with the symmetric half removed. concentrate our attention to tracking human scuba divers and servoing off their position. Our robot has been developed with marine ecosystem inspection as a key appli- cation area. Recent initiatives taken for protection of coral reefs call for long-term monitoring of such reefs and species that depend on reefs for habitat and food sup- ply [1]. We envision our vehicle to have the ability to follow scuba divers around such reefs and assist in monitoring and mapping of distributions of different species of coral. This application area is representative of the general class of deployments this technique can be applied to.

4.2. Methodology

To track scuba divers in the video sequences, we exploit the periodicity and mo- tion invariance properties that characterize biological motion. To fuse the responses

61 CHAPTER 4. TRACKING BIOLOGICAL MOTION of the multiple frequency detectors, we combine their output with an Unscented Kalman Filter [41]. The core of our approach is to use periodic motion as the signa- ture of biological propulsion and specifically for person-tracking, to detect the kicking gait of a person swimming underwater. While different divers have distinct kicking gaits, the periodicity of swimming (and walking) is universal. Our approach, thus, is to examine the amplitude spectrum of rectangular slices through the video sequence along the temporal axis. We do this by computing a windowed Fourier transform on the image to search for regions that have substantial band-pass energy at a suitable frequency. The flippers of a scuba diver normally oscillate at frequencies between 1 and 2 Hz [101]. Any region of the image that exhibits high energy responses in those frequencies is a potential location of a diver. The essence of our technique is there- fore to convert a video sequence into a sampled frequency-domain representation in which we accomplish detection, and then use these responses for tracking. To do this, we need to sample the video sequence in both the spatial and temporal domain and compute local amplitude spectra. This could be accomplished via an explicit filtering mechanism such as steerable filters which might directly yield the required bandpass signals. Instead, we employ windowed Fourier transforms on the selected space-time region which are, in essence, 3-dimensional blocks of data from the video sequence (a 2-dimensional region of the image extended in time). In principle, one could directly employ color information at this stage as well, but due to the need to limit computational cost and the low mutual information content between color channels (especially underwater), we perform the frequency analysis on luminance signals only. We look at the method of Fourier Tracking in Sec. 4.2.1. In Sec. 4.2.2, we describe the multi-directional version of the Fourier tracker and motion detection algorithm in the XYT domain. The application of the Unscented Kalman Filter for position tracking is discussed in Sec. 4.2.3. Section 4.2.4 looks at two parameters

62 4.2 METHODOLOGY that affect the tracker, and lays out a set of experiments for quantitative assessment of tracker performance.

4.2.1. Fourier Tracking. The core concept of the tracking algorithm pre- sented here is to take a time-varying spatial signal (from the robot) and use the well-known discrete-time Fourier transform to convert the signal from the spatial to the frequency domain. Since the target of interest will typically occupy only a region of the image at any time, we naturally need to perform spatial and temporal windowing. The standard equations relating the spatial and frequency domain are as follows.

1 Z x[n] = X(ejω)ejωdω (4.1) 2π 2π +∞ X X(ejω) = x[n]e−jωn (4.2) n=−∞ where x[n] is a discrete aperiodic function, and X(ejω) is periodic with length 2π and frequency ω. Equation 4.1 is referred to as the synthesis equation, and Eq. 4.2 is the analysis equation where X(ejω) is often called the spectrum of x[n][60]. The coefficients of the converted signal correspond to the amplitude and phase of complex exponentials of harmonically-related frequencies present in the spatial domain. For our application, we do not consider phase information, but look only at the absolute amplitudes of the coefficients of the above-mentioned frequencies. The phase information might be useful in determining relative positions of the undulating flippers, for example. It might also be used to provide a discriminator between specific individuals. Moreover, by not differentiating between the individual flippers during tracking, we achieve a speed-up in the detection of high-energy responses, at the expense of sacrificing relative phase information.

63 CHAPTER 4. TRACKING BIOLOGICAL MOTION

Spatial sampling is accomplished using a Gaussian windowing function at reg- ular intervals and in multiple directions over the image sequence. The Gaussian is appropriate since it is well known to simultaneously optimize localization in both space and frequency space [22]. It is also a separable filter, making it computation- ally efficient. Note, as an aside, that a box filter for sampling can be simple and efficient, but these produce undesirable ringing in the frequency domain, which can lead to unstable tracking. The Gaussian filter has good frequency domain properties and it can be computed recursively making it exceedingly efficient.

(a) Motion directions covered by the various (b) Image slices along the time axis showing 5 directional Fourier operators, depicted in a 2D out of 17 possible track directions while track- spatial arrangement. ing a scuba diver.

Figure 4.4. Directions of motion for Fourier tracking, also depicted in 3D in a diver-swimming sequence.

4.2.2. Multi-directional Motion Detection. To detect motion in multiple directions, we use a predefined set of vectors, each of which is composed of a set of small rectangular sub-windows in spatio-temporal space. The trajectories of each of these sub-windows are governed by a corresponding starting and ending point in the image. In any given time T , this rectangular window resides in a particular position along this trajectory and represents a Gaussian-weighted gray-scale intensity

64 4.2 METHODOLOGY function of that particular region in the image. Over the entire trajectory, these windows generate a vector of intensity values along a certain direction in the image, producing a purely temporal signal for amplitude computation. We weight these velocity vectors with an exponential filter, such that intensity weights of a more recent location of the sub-window have a higher weight than another at that same location in the past. This weighting helps to maintain the causal nature of the frequency filter applied to this velocity vector. In the current work, we extract 17 such velocity vectors (as seen in Fig. 4.4) and apply the Fourier transform to them (17 is the optimum number of vectors we can process in quasi-real time in our robot hardware). The space formed by the velocity vectors is a conic in the XYT space, as depicted in Fig. 4.5. Each such signal provides an amplitude spectrum that can be matched to a profile of a typical human gait. A statistical classifier trained on a large collection of human gait signals would be ideal for matching these amplitude spectra to human gaits (as exemplified by [8]). However, these human-associated signals appear to be easy to identify, and as such, an automated classifier is not currently used. Currently, we use two different ap- proaches to select candidate spectra. In the first, we choose the particular direction that exhibits significantly higher energy amplitudes in the low-frequency bands, when compared to higher frequency bands. In the second approach, we pre-compute by hand an amplitude spectrum from video footage of a swimming diver, and use this amplitude spectrum as a true reference. To find possible matches, we use the Bhat- tacharyya measure [9] to find similar amplitude spectra, and choose those as possible candidates.

4.2.3. Position Tracking Using an Unscented Kalman Filter. Each of the directional Fourier motion operators outputs an amplitude spectrum of different frequencies present in each associated direction. As described in Sec. 4.2.2, we look

65 CHAPTER 4. TRACKING BIOLOGICAL MOTION

Figure 4.5. Conic space covered by the directional Fourier operators. at the amplitudes of the low-frequency components of these directional operators, the ones that exhibit high responses are chosen as possible positions of the diver, and thus the position of the diver can be tracked across successive frames. To further enhance the tracking performance, we run the output of the motion detection operators through an Unscented Kalman Filter (UKF), also referred to as the Sigma-Point Kalman Filter, or SPKF (although we use the term UKF throughout this dissertation). The UKF is a highly effective filter for state estimation problems, and is suitable for systems with a non-linear process model [41]. Compared to the Extended Kalman Filter (EKF), the UKF captures the non-linearity in the process and observation models up to the second-order of the Taylor series expansion, whereas the EKF only captures up to the first-order expansion term. The track trajectory and the motion perturbation are highly non-linear, owing to the undulating propulsion resulting from flipper motion and underwater currents and surges. We chose the UKF as an appropriate filtering mechanism because of this inherent non-linearity, and also its computational efficiency.

66 4.2 METHODOLOGY

According to the UKF model, an N-dimensional random variable x with mean ˆx and covariance Pxx is approximated by 2N +1 points known as the sigma points. The i sigma points at iteration k − 1, denoted by χk−1|k−1, are derived using the following set of equations:

0 a χk−1|k−1 = xk−1|k−1

i a q a χk−1|k−1 = xk−1|k−1 + ( (N + λ)(P )k−1|k−1)i i = 1 ...N

i a q a χk−1|k−1 = xk−1|k−1 + ( (N + λ)(P )k−1|k−1)i−N i = N + 1 ... 2N

q a where ( (N + λ)(P )k−1|k−1)i is the i-th column of the matrix square-root of ((N + a λ)(P )k−1|k−1), and λ is a predefined constant that dictates the spread of the sigma points. For the diver’s location, the estimated position x is a two-dimensional random variable, and thus the filter requires 5 sigma points. The sigma points are generated around the mean position estimate by projecting the mean along the X and Y axes, and are propagated through a non-linear motion model (i.e., the transition model) f, and the estimated mean (i.e., diver’s estimated location), ˆx, is calculated as a weighted average of the transformed points:

i i χk|k−1 = f(χk−1|k−1) i = 0 ... 2N (4.3) 2N X i i ˆxk|k−1 = W χk|k−1 (4.4) i=0

67 CHAPTER 4. TRACKING BIOLOGICAL MOTION

where W i are the constant weights for the state (position) estimator. As an initial position estimate of the diver’s location for the UKF, we choose the center point of the vector producing the highest low-frequency amplitude response. Ideally, the non-linear motion model for a scuba diver can be learned from training using video data, but for this application we use a hand-crafted model created from manually observing such footage. The non-linear motion model we employ predicts forward motion of the diver with a higher probability than vertical motion, which in turn is favored over lateral motion. For our application, a small number of iterations (approximately between 5 and 7) of the UKF is sufficient to assure convergence.

Figure 4.6. A schematic showing the RunLength and BoxSize parameters for the Fourier tracker.

68 4.3 CONCLUSIONS

4.2.4. Parameter Tuning. From the brief outline stated above, two distinct parameters are seen to be affecting the performance of the Fourier tracker – namely the size of the rectangular sub-windows, and the number of frames to look at in the time direction (i.e., the time duration in number of frames) for calculating the amplitude spectrum. The performance of the tracker, in terms of accuracy and speed, is directly dependent on these parameters. In this work, the component of the Fourier tracker which locates a diver in the image frame prior to tracking by the UKF is referred to as the Fourier Detector. To detect the intensity variations caused by the oscillation of the diver’s flippers, we average the intensities at each frame (i.e., in each time-step) in these sub-windows. We refer to the size of the sub-windows as the BoxSize parameter, and refer to this parameter with the symbol κ. The duration over which these intensity variations are collected is the second parameter the tracker depends on. We call this the RunLength parameter, and denote it by the symbol λ. Figure 4.6 depicts these two parameters of the Fourier tracker over an image sequence. To assess the effect of these parameters on tracker performance, we perform a number of experiments, which are described in detail with our findings in Chapter 9.

4.3. Conclusions

In this chapter, we present a technique for robust detection and tracking of biological motion underwater, specifically to track human scuba divers. We consider the ability to visually detect biological motion an important feature for any mobile robot, and especially for underwater environments to interact with a human operator. In a larger scale of visual human-robot interaction, such a feature forms an essential component of the communication paradigm, using which an autonomous vehicle can effectively recognize and accompany its human controller. The algorithm presented here is conceptually simple and easy to implement. Significantly, this algorithm

69 CHAPTER 4. TRACKING BIOLOGICAL MOTION is optimized for real-time use on-board an underwater robot. While we apply a heuristic for modeling the motion of the scuba diver to feed into the UKF for position tracking, we strongly believe that with the proper training data, a more descriptive and accurate model can be learned. Incorporating such a model promises to increase the performance of the motion tracker. While color information can be valuable as a tracking cue, we do not look at color in conjunction with this method. Hues are affected by the optics of the under- water medium, which changes object appearances drastically. Lighting variations, suspended particles and artifacts like silt and plankton scatter, absorb or refract light underwater, which directly affects the performance of otherwise-robust tracking algo- rithms [76]. To reduce these effects and still have useful color information for robustly tracking objects underwater, we have developed a machine learning approach based on the classic Boosting technique. In that work, we train our visual tracker with a bank of spatio-chromatic filters [79] that aim to capture the distribution of color on the target object, along with color variations caused by the above-mentioned phe- nomena. Using these filters and training for a particular diver’s flipper, robust color information can be incorporated in the Fourier tracking mechanism, and be directly used as an input to the UKF. While this will increase the computational cost some- what, and also introduce color dependency, we believe investigating the applicability of this machine learning approach in our Fourier tracker framework is a promising avenue for future research. The next chapter discusses the Spatio-Chromatic tracker in detail.

70 CHAPTER 5

Machine Learning for Robust Tracking

In this chapter, we present an application of machine learning to the semi-automatic synthesis of robust servo-trackers for underwater robotics. In particular, we inves- tigate an approach based on the use of Boosting for robust visual tracking of color objects in an underwater environment. To this end, we use AdaBoost [35], the most common variant of the Boosting algorithm, to select a number of low-complexity but moderately accurate color feature trackers, and we combine their outputs. The novelty of our approach lies in the design of this family of weak trackers, which enhances a straightforward color segmentation tracker in multiple ways. From a large and diverse family of possible color segmentation trackers (hereafter referred interchangeably with the term ‘filters’), we select a small subset that optimizes the performance of our trackers. The tracking process applies these trackers on the in- put video frames, and the final tracker output is chosen based on the weights of the final array of trackers. By using computationally inexpensive, but somewhat CHAPTER 5. MACHINE LEARNING FOR ROBUST TRACKING accurate trackers as members of the ensemble, the system is able to run at quasi real-time (i.e., approximately 65% of the camera frame-rate), and thus, is deploy- able on-board our underwater robot. We present quantitative cross-validation results of our Spatio-Chromatic visual tracker in Chapter 9, and point out challenges faced in the experiments we performed. This chapter concludes with an overall discussion of our approach and suggests directions of future research in the area of ensemble tracking in real-time.

5.1. Introduction

As part of our visual HRI framework, we investigate the application of machine learning algorithms, ensemble learning in particular, to visual tracking and vision- based robot control. This chapter looks at using machine learning to create a robust tracking system for an autonomous robot. This research is motivated by the unique challenges posed by the underwater environment (as discussed briefly in Chapter 4)to machine vision systems and also at the effect of learning algorithms on tracking ac- curacy and performance. The difficulties faced by machine vision in underwater domains can be regarded as a particular instance of a rather general case of chal- lenging visual conditions, and we demonstrate the effectiveness of machine learning to alleviate such categories of problems. To this end, we begin with a broad range of filters tuned to chromatic and spatial variations. From this collection, we select a small subset that show promise, and ‘boost’ their outputs using the AdaBoost [35] algorithm. The training process we use obtains weights for the AdaBoost ‘weak learners’, which in our case are a set of computationally-inexpensive visual trackers. The final tracking decision is a weighted average of the outputs of each individual tracker that were chosen by AdaBoost to be a part of the ensemble. Since the filters are tuned to hue variations over space, we call this a Spatio-Chromatic tracker.

72 5.2 METHODOLOGY

The goal is to have a robust visual tracker that will be able to overcome the chal- lenges of reduced and degraded visibility, by automatically choosing a subset of the visual trackers at its disposal that performs well under the conditions. Vision has the advantage of being a passive sensing medium, and is thus both non-intrusive and energy efficient. These are both important considerations in a range of applications ranging from environmental assays to security surveillance. Al- ternative sensing media such as sonar also suffer from several deficiencies which make them difficult to use for tracking moving targets at close range in potentially tur- bulent water. In addition, challenges arising from variable lighting, water salinity, suspended particles, color degradation and other similar issues often invalidate vision algorithms that are known to work well in terrestrial environments. Since the under- water environment is hazardous for humans, more so for extended periods of time, an autonomous robotic vehicle has certain advantages for many underwater applica- tions. Examples of such applications have been mentioned earlier in this thesis, in the context of the application areas of Aqua vehicle. The underwater domain poses complicated challenges in designing efficient vision systems, as quantitative analysis is difficult to perform in sub-sea environments. With the goal of circumventing these issues, we make use of video footage and ensemble learning with minimum user input, to come up with an efficient tracking algorithm, that works robustly for underwater targets.

5.2. Methodology

The following sections describe the core of our approach to robust visual tracking using AdaBoost learning. Section 5.2.1 describes briefly the color threshold tracking algorithm we use as weak learners. Technical details of the learning process are described in Sec. 5.2.2, and tracker selection is explained in Sec. 5.2.3.

73 CHAPTER 5. MACHINE LEARNING FOR ROBUST TRACKING

(a) A red target. (b) Segmented output showing the out- put blob.

Figure 5.1. A color blob tracker tracking a red-colored object.

5.2.1. Visual Tracking by Color Thresholding. Let us revisit the con- cept of color threshold-based tracking, to serve as a prelude to the tracking algorithm presented in this chapter. The goal of visual tracking in general is to localize in a sequence of images or video frames a target of interest. Successful visual tracking is an essential prerequisite of a number of algorithms for robot control and manip- ulation, such as visual servoing. Our goal, in this particular HRI framework, is to achieve robust visual tracking for a mobile robot to track a variety of targets (si- multaneously or one at a time), under changeable visual conditions. Note that the Fourier Tracker, as described in Chapter 4 is designed to track scuba divers or bi- ological entities exhibiting periodic motion – it would not perform well in tracking non-biological entities such as other AUVs or ROVs. To compensate for the chang- ing conditions (i.e., lighting for example), we combine a color-thresholding based tracking technique with machine learning, the Boosting algorithm in particular. In a thresholding-based tracking approach, a segmentation algorithm outputs (possibly disconnected) regions in a binary image that match the color properties of the tracked object. These regions are commonly referred to as ‘blobs’, and hence the approach is

74 5.2 METHODOLOGY often known as color-blob tracking. We attempt to form these blobs through a thresh- olding process. By thresholding, we refer to the operation where pixels are labeled as part of a target blob if and only if their color values fall within a certain range. The tracker is initialized with the target’s color properties; for example, in the RGB color space, the tracker is initialized with the red, green and blue color values of the tracked object, within some degree of tolerance. Next, each pixel in the input image is scanned in a sequential manner. The pixels falling within the threshold of the color values of the target are switched on in the output image, and other pixels are turned off. Figure 5.1 shows the segmentation output of tracking a red object. The target is framed by a yellow rectangle for clarity. The tracker was tuned beforehand to the red rectangular target in Fig. 5.1(a). The segmentation process produced the image in Fig. 5.1(b). The tracking algorithm detects this blob in the binary image in every frame, and calculates its centroid. This centroid is taken as the new target location. This process iterates to achieve target localization in the image space. One rather obvious downside to using a naive color blob tracker as explained above, is the presence of duplicate targets. For example, in Fig. 5.1(a) above, if any other red colored object appears in the scene, the segmentation process will generate another blob for that object. This second blob will effect the calculation of the center of mass for the tracker; the effect will be more prominent if the two generated blobs are disconnected; i.e., further away in the image frame. Therefore, the tracker works accurately only when there are no similarly-colored objects in the camera’s field-of-view. Coupling the simple blob tracker with a probabilistic model of the known position of the target in the previous frames can eliminate this problem by a large degree. A mean-shift tracker has been proven useful [106] in achieving exactly this goal. More profound problems with these type of trackers are the need for a simple target model in the color space, and the lack of shape or spatial information

75 CHAPTER 5. MACHINE LEARNING FOR ROBUST TRACKING of color arrangements. In our design of weak trackers, we explicitly address these two different issues.

5.2.2. Ensemble Tracking. In this approach to ensemble tracking (the reader is referred to the background discussed in Chapter 2 for a summary of ensemble learning methods), we apply the Boosting algorithm to visual tracking and address tracking as a binary classification problem. An ensemble of classifiers (onwards, interchangeably referred to as trackers or filters) are trained to distinguish between the background and the target object. These weak trackers are then combined and their outputs are strengthened using AdaBoost, as depicted in Figure 5.2.

Figure 5.2. Outline of the ensemble tracker using AdaBoost.

For visual tracking in color, we choose the normalized-RGB color space, as it has been shown to be more robust against lighting variations [92], and such variations are predominant in underwater vision systems. Our weak trackers are a collection of individual channel trackers, working on the red, green and blue color channels

76 5.2 METHODOLOGY

(1) 1 1: Initialize the data weights {ωn} by setting ωn = N 2: 3: for m = 1 ←M do 4: Fit a classifier ym(x) to the training data by minimizing the weighted error function: N X (m) Jm = ωn I(ym(xn) 6= tn) (5.1) n=1 where I(ym(xn) 6= tn) is the indicator function and equals 1 when ym(xn) 6= tn, and 0 otherwise. 5: Evaluate the quantities PN ω(m)I(y (x ) 6= t )  = n=1 n m n n (5.2) m PN (m) n=1 ωn and then use these to evaluate:   1 − m αm = ln (5.3) m 6: Update the data weights

(m+1) (m) (αmI(ym(xn)6=tn)) ωn = ωn e (5.4) 7: end for 8: Make predictions using the final combined hypothesis, given by M ! X YM (x) = sign αmym(m) (5.5) m=1 Algorithm 1: The AdaBoost algorithm individually, or as a combination of two or more of these three channels. The training process randomly picks one from a bank of trackers, and sets the associated thresholds for that tracker in the normalized-RGB space, attempting to fit the training set to the tracker. If for any instance in the training set, the tracker outputs the target location within a small margin of error, we label it as a correct output for that instance. Otherwise, the label is set as incorrect. Any tracker that has more than 50 per cent error rate is discarded by the Boosting process. At each iteration, any training instance that has been misclassified by the current tracker, is given a higher

77 CHAPTER 5. MACHINE LEARNING FOR ROBUST TRACKING weight according to the AdaBoost algorithm (see Algorithm 1). This ensures that the subsequent trackers focus more on this instance that was wrongly classified by the current tracker. The cycle iterates until the improvement in the training error falls below a certain threshold, or terminates after a finite upper bound is reached for the number of iterations. At the end of the training process, an array of weak trackers and their associated weights are available for use in the tracking process. The final chosen trackers are usually a mixture of different tracker types, both in terms of threshold and design. Assuming each weak tracker in the ensemble is represented as ti, and the AdaBoost-calculated weights for these trackers are wi, the final tracker output becomes: X Tfinal = wi × ti i We discuss about the trackers used in the training stage in more detail in Sec. 5.2.3. Currently, we perform training as an off-line process, using video or discrete image sequences of target objects to train our trackers, as well as synthetic data. Weights for the weak trackers are generated by AdaBoost, which are passed on to the on-line stage, where they are multiplied by the individual tracker outputs and eventually summed up to create the final tracking decision.

5.2.3. Choice of trackers. For weak trackers in our approach, RGB channel trackers are chosen to track the target objects and output the locations individually, or in combination with each other. We use trackers for each of the red, blue and green channels, and combine them to create four different classes of trackers (i.e., weak learners). The goal is to create a maximally simple set of trackers to capture a broad range of color values and their spatial distributions. These four classes of trackers are enumerated below:

78 5.2 METHODOLOGY

• Type 1 trackers consist of one single threshold of either R,G or B channel, and segments the image based on this threshold and color channel to track. • Type 2 trackers are a combination of two channels, and their associated two thresholds, and are useful for detecting borders of target objects, in both horizontal (Type 2.a) and vertical (Type 2.b) directions. These trackers are able to detect both the target and the background, and in combination with the Type 1 trackers, are able to localize the target outline as well as the centroid. • Type 3 trackers create a one-dimensional mask similar to the Type 2 track- ers, with three channels. These trackers are designed on the concept of the “on-center-off-surround” and “off-center-on-surround” cells in the human visual cortex. These cells have been found to contribute to the edge detec- tion process in the visual cortex [38]. In cases where the target object may be in a cluttered background, or partially occluded, these type of trackers would be able to distinguish the target from the surrounding background better. This is very similar to the filters used, for example, to detect eyes and the nose in face detection, as demonstrated in the work by Viola and Jones [99]. • Type 4 trackers are similar to the Type 3 trackers in their design goal, while they incorporate four channels at the same time. These filters are applied on an image location and the regions immediately to the right, below and diagonally right and below.

Figure 5.3 shows some of these different types of weak trackers. In practice, the individual blocks in the trackers calculate the average color intensity in a square neighborhood in the image, instead of just using the value of one pixel. Various combinations of these trackers make up the bank of weak

79 CHAPTER 5. MACHINE LEARNING FOR ROBUST TRACKING

Figure 5.3. Examples of the different tracker types used during Boosting. classifiers we want to boost. We perform tracking on colored objects in different video sequences, using a large number of image frames. By using a combination of tracker types and threshold values (which lie in the real number space, between 0 and 1), a large number of weak trackers can be generated, creating a large space of learners for the Boosting process to choose from. For example, for a Type 4 tracker with a size of n × n pixels, if CTR is the color value chosen from the normalized-RGB space, the possible number of all Type 4 trackers, NT rackers would be:

  C 2 N = TR × 2n (5.6) T rackers n2

Considering the values for CTR lie in the real-number range [0, 1], the actual number of such possible trackers becomes extremely large. Our family of trackers addresses some of the previously-mentioned shortcomings faced by a regular segmentation tracker. By design, these trackers look at color

80 5.3 CONCLUSION arrangements in the image space, and thus also eliminate the need for a simple color model for the target objects. Trackers that perform no better than random chance are rejected by the first stage of the learning process, but the enormous number of available trackers significantly increases the possibility that a considerable fraction of these trackers still pass the weak learning criterion.

5.3. Conclusion

We have presented a visual tracking algorithm based on AdaBoost, for robust visual servoing of an underwater robot. In Chapter 9, we show evaluations of this tracker that demonstrate improved tracking performance relative to manually-tuned trackers. The computational complexity of this Spatio-Temporal tracker puts the implementation within bounds of real-time performance, making it applicable for visual servo-control not only in the underwater domain, but terrestrial environments as well. More importantly, the boosted tracker can be constructed automatically, whereas the purely segmentation-based tracker needs to be manually tuned, often requiring frequent adjustments. As further enhancements to the current work, we are investigating different track- ers to integrate in the collection of weak trackers for the Boosting process. Specific tracker types may perform better in particular environments, and we are looking at ways to automate this process of choosing the right tracker for the task. We are also investigating on-line Boosting for adaptively retraining the tracker to dynamically improve robustness. In the next chapter, we switch our attention to the process of safe task executions in the presence of communication uncertainty. Serving as a complimentary algorithm for mechanisms such as the RoboChat scheme described in Chapter 3, this technique provides a quantitative model of confirmations to ensure expensive tasks are not

81 CHAPTER 5. MACHINE LEARNING FOR ROBUST TRACKING executed blindly by a mobile robot. The following chapter describes this approach that straddles the boundary between explicit and implicit human-robot interaction.

82 CHAPTER 6

Confirmations in Human-Robot Dialog

In any robotic task, particularly with those involving humans, safe task execution is a prime necessity. Towards that goal, we present a technique for robust human-robot interaction taking into consideration uncertainty in input and task execution costs incurred by the robot. Specifically, this research aims to quantitatively model feed- back in the form of confirmations, as required by a robot while communicating with a human operator to perform a particular task. Our goal is to model human-robot interaction from the perspective of risk minimization, taking into account errors in communication, “risk” involved in performing the required task, and task execution costs. Given an input modality with non-trivial uncertainty, we calculate the cost associated with performing the task specified by the user, and if deemed necessary, ask the user for confirmation. The estimated task cost and the uncertainty measure are given as input to a Decision Function, the output of which is then used to decide whether to execute the task, or request clarification from the user. In cases where CHAPTER 6. CONFIRMATIONS IN HUMAN-ROBOT DIALOG the cost or uncertainty (or both) is estimated to be exceedingly high by the system, task execution is deferred until a significant reduction in the output of the Decision Function is achieved. Chapter 8 presents evaluation results, where we test our system through human-interface experiments, based on this visual HRI framework deployed on-board the Aqua family of amphibious robots, and demonstrate the utility of the framework in the presence of large task costs and uncertainties. We also present qualitative results of our algorithm from field trials of our robots in both open- and closed-water environments.

6.1. Introduction

When a human gives instructions to a robot using a “natural” interface, com- munication errors are often present. For some activities the implications of such errors are trivial, while for others there may be potentially severe consequences. In this work, we consider how such errors can be considered explicitly in the context of risk minimization. While fully autonomous behaviors remain the ultimate goal for robotics research, there will always be a prevailing need for robots to act as assis- tants to humans. As such, we focus in the interim, on a control regime between full tele-operation and complete autonomy, where a semi-autonomous robot acts as an assistant to a human operator and a robust interaction mechanism exists between the two. In particular, whereas many robotic systems operate using only imperative commands, we wish to enable the system to engage in a dialog with the user. This research is a natural extension of previous work on visual languages for robot control and programming (as presented in Chapter 3), which has been successfully used to operate the Aqua family of underwater robots. In that work, divers com- municate with the robot visually using a set of fiducial markers, by forming discrete geometric gestures. While this fiducial-based visual control language, RoboChat, has

84 6.1 INTRODUCTION proven to be robust and accurate, we do not have any quantitative measure of un- certainty or cost assessment related to the tasks at hand. The algorithm we propose here is designed to be an adjunct to a language such as RoboChat and provide a measure of uncertainty in the utterances. Moreover, by providing additional robust- ness (e.g., through uncertainty reduction and ensuring robot safety) as a result of the dialog mechanism itself, a reduced level of performance is required from the base communication system, allowing for more flexible alternative mechanisms. From a broader perspective, the research presented in this chapter fits into more general classes of human-robot interaction frameworks, where there exists some form of communication between the robot and a human operator, not necessarily bidirec- tionally homogeneous (i.e., robot and human employ different methods of commu- nication to send message to each other). In our particular HRI framework, vision is the communication medium used by the human operator to send messages to the ro- bot. The robot uses a suite of machine vision-based algorithms to decipher complex visual commands, identify and follow the operator, learn to follow other objects of interest, and also engage in a dialog with a human operator. Different algorithms in this HRI suite are triggered as the instructions from the human operator are parsed and evaluated. As any input modality will always be corrupted by communication noise, it is imperative that before carrying out given instructions, the robot mini- mizes the uncertainty present in the dialog. It is in this stage an approach such as the one presented here finds a suitable application. Any interaction protocol will carry a certain degree of uncertainty with it. For accurate human-robot communication, that uncertainty must be incorporated and accounted for by a command-execution interface. In the presence of high uncertainty, large degree of risk, or moderate un- certainty coupled with substantial risk, the robot should ask for confirmation. The principled basis for this decision to ask for confirmation is our concern.

85 CHAPTER 6. CONFIRMATIONS IN HUMAN-ROBOT DIALOG

Our current work has been developed for applications specifically, but not exclu- sively, for the domain of underwater robotics. In this context, the robot operates in concert with a human and the primary risk factors are measured as a function of the difficulties incurred if the robotic system fails, and as a function of the total length of an experiment. The longer the diver has to stay underwater, the less desirable a situation it is. In addition, if the robot fails far from the diver it is much more serious than if it fails nearby. Finally, if the robot travels far away, it is intrinsically more dangerous due to reduced visibility, current and other factors. Thus, risk is primarily described in terms of risk to the human operator from a more extensive experiment, and risk to the human and the robot as a result of being separated during a procedure or as a result of a failure during the execution of a task. The work described in this chapter focuses on two principal ideas: uncertainty in the input language used for human-robot interaction, and analysis of cost of the task. We present a theoretical model for initiating dialogs between a robot and a human operator using a model for task costs and a model of uncertainty in the input scheme. A Decision Function takes as input both these parameters, and based on the output of this function, the system prompts the user for feedback (e.g., in the form of confirmation of the commands), or executes the given command. The cost assessment is a combination of external costs in the form of operational risk, and internal costs expressed in terms of operational overhead. The rest of this chapter is structured as follows: in Sec. 6.2, we uncover the structure of our approach towards formulating human-robot dialogs with uncertainty and cost assessment, and talk about each of these aspects of our algorithm in greater details in Sec. 6.2.1 and Sec.6.2.2. We introduce the concept of the Confirmation Space in Sec. 6.2.3, through which we explain the generation of task confirmation requests by the robot. Section 6.3 draws conclusions on the our approach, and discusses some potential avenues of future research.

86 6.2 METHODOLOGY

Figure 6.1. The Aqua robot being operated via gestures formed by fiducial markers.

6.2. Methodology

In a typical human-robot interaction scenario, the human operator instructs the robot to perform a task. Traditionally this takes place using a pragmatic inter- face (such as keyboards or mice), but the term “human-robot interaction” usually implies more “natural” modalities such as speech, hand gestures or physical body movements. Our approach is, in principle, independent of the specific modality, but our experimental validation described later in the thesis uses gestures. The essence of our approach is to execute costly activities only if we are sufficiently certain they have been indicated. For actions that have low cost, we are willing to execute them even when the level of certainty is low, since little is lost if they are executed inap- propriately.

87 CHAPTER 6. CONFIRMATIONS IN HUMAN-ROBOT DIALOG

Figure 6.2. Control flow in our risk-uncertainty model.

Whatever the modality, the robot has to determine the instructions and for most natural interfaces this entails a substantial degree of uncertainty. The interac- tion starts with the human operator providing input utterances to the robot. The robot estimates a set of actions and generates a plan (in this case a potential tra- jectory) needed to perform the given task using a simulator. The generated action and trajectory are then evaluated by a cost and risk analysis module, comprised of a set of Assessors. This module outputs estimated total cost and together with the uncertainty in the input dialog, is fed into a Decision Function. If the relationship between cost and uncertainty is unacceptable then the robot decides to ask for feed- back. Otherwise, the robot executes the instructed task. A flowchart illustrating the control flow in this process can be seen in Figure 6.2. The core of our approach relies on calculating a probabilistic measure of the uncertainty in the input language, and also calculating the cost involved in making the robot perform the task as instructed by the human operator. The following two subsections describe in detail these two aspects of our framework.

88 6.2 METHODOLOGY

6.2.1. Uncertainty Modeling. To interact with a mobile robot, a human operator has to use a mode of communication, such as speech, gestures, touch inter- face, or mouse input on a computer screen. In practice, there will almost always be noise in the system that will introduce uncertainty in the dialog. In our human-robot interaction framework, utterances are considered to be in- puts to the system. Gestures, gi, are symbols containing specific instructions to the robot. A gesture set, G, is made up of a finite number of gestures,

G = {g1, g2, ··· gn} (6.1)

Each gesture gi has associated with it a probability p(gi) of being in an utterance. The robot is aware of the gesture set G (i.e., the “language” set), and the probabilities pi(g) are precomputed and available to the system before interaction begins with a human operator. This notion of discrete symbols combined with probabilities is commonplace in speech understanding [67]. While these may appear to be restrictive assumptions, these are also very realistic, as bounding the language set is a pragmatic course of action, given the computational capabilities of the robot, and given the cognitive capacity of the human operator in the operating environment. One can imagine a scenario where a human operator is interacting with a robot to achieve a common task within constraints of available resources (e.g., power, time, life-support system for the human operator), and a compact language set is essential in such circumstances [105]. Statements, S,(i.e., sentences) in our framework can be atomic gestures (e.g., “go left”, or “take picture”), or they can be compound commands constituted of sev- eral atomic gestures, including repetitions. We assume each gesture is independently uttered by the operator, and thus the probability of a sequence of gestures chained together to form a compound statement simply becomes:

89 CHAPTER 6. CONFIRMATIONS IN HUMAN-ROBOT DIALOG

n Y p(S) ≡ p(g1, g2, . . . gn) = p(gi) (6.2) i=1

Given a table of values for all p(gi), it is trivial to compute the probability of any given sequence of gestures. Each statement is formed by passing an observation sequence through a Hidden Markov Model engine (described below) which calculates the likelihood of the sequence. The set of statements contain possible sentences that may have been communicated to the robot, based on the observation sequences detected by the system. Programs or Tasks are a proper subset of statements. By definition, we indicate programs to contain only consistent instructions (i.e., instructions that are legal in semantics or syntax). This implies that programs, P

|P | ⊇ |S| (6.3)

Each program Pi has a likelihood of occurrence li and a cost ci associated with it. The question remains, however, that given an observation to the system and a set of consistent programs, which program should be executed by the robot, given the likelihood and the cost values for each consistent program. Since there will always exist uncertainty in input observations, we can model the input language scheme as a Hidden Markov Model [68], with the actual input gestures becoming the hidden states of the HMM. An HMM requires three matrices to be specified to estimate the input utterances, namely:

(i) Initial probabilities of the hidden states, Π. (ii) Transition probabilities between the hidden states, A. (iii) Confusion matrix, or the emission probabilities, B.

90 6.2 METHODOLOGY

For any given input mechanism, we assume the matrices can be estimated or learned (e.g.,in our case, we learn the matrices using real-world operational data from the RoboChat language). Once the matrices are available, the Baum-Welch algorithm can be used to estimate the HMM parameters and the Viterbi algorithm can be applied to estimate the likelihood of the input utterance [7].

6.2.2. Cost Analysis. Once the uncertainty is computed, we perform a cost assessment of the given task. This is performed irrespective of the uncertainty; i.e., a low uncertainty measure will not cause the cost calculation task to be suspended. To estimate the cost of executing a robot plan expressed by a program, we use a set of Assessors, that are applied to the robot state as the task is simulated. After executing each command in the input statement, the set of assessors examines the current state of the robot and produces an estimated value of risk. At the end of the simulation the overall program cost is a sum of all the assessors’ outputs over the duration of the simulated program. This sum is taken into consideration by the Decision Function (defined below). The assessors are an integral part of the robot task simulator, as along with a reasonably accurate estimate of the outcome of the command executions, an estimate of the overall task cost needs to be calculated simultaneously. The assessors are, by design. domain-specific, and are similar to “plug-in” modules for the task simulator. They embody domain knowledge in the task simulator and provide a conservative (i.e., over-estimated) guess for the task execution costs during command simulation. Figure 6.3 demonstrate a sample set of assessors for an underwater robot, with programs as inputs and associated cost estimates as outputs. The four example assessors shown – for energy consumption, depth, path quality and total distance – provide cost assessment at each step of the simulation run (in both event-driven or timed simulation models).

91 CHAPTER 6. CONFIRMATIONS IN HUMAN-ROBOT DIALOG

Figure 6.3. The role of assessors in risk/cost estimation.

We approach the cost factor from two different perspectives: namely, the risk associated from the operator’s perspective, and the cost involved in terms of opera- tional overhead of the robot while attempting to perform the task. In conventional dialog models used for confirmation only, Bayes risk is sometimes applied [56], where the system only confirms in order to avoid error. Moreover, to create a risk minimiz- ing formulation, the Bayes’ risk model requires a set of prior probabilities over the space of all possible input tokens, given a language. On the other hand, there are often scenarios where the system should ask for confirmation even in high-confidence programs because executing the task will place the underlying system in a high-risk state. For these reasons, we adopt a different approach that does not require prior probabilities. 6.2.2.1. Risk Measurement. Risk encompasses many factors, including domain- specific ones. In our case, the risk model reflects the difficulty of recovering the robot in the event of a total systems failure. In addition, the level of risk to the human operators is a function of time. Examples of high-risk scenarios could be the robot venturing too far from safety, or drifting too close to the obstacles, or other objects that pose significant threat to the robot, requiring excessive time to perform the

92 6.2 METHODOLOGY task, etc. We denote the set of such factors by A = {α1, α2, . . . αn}. The examples presented here are by no means exhaustive, but only serve to demonstrate a possible set of parameters that can be included for measuring risk. 6.2.2.2. Cost Measurement. This component measures the operational over- head associated with robot operation, over the duration of the task to be executed. The overhead measures are a function of factors such as power consumption, battery condition, system temperature, computational load, total distance traveled, etc. We denote the set of such factors by B = {β1, β2, . . . βn}. Note that the exact measure- ment of these factors is not possible until the task at hand is completed, hence the initial values obtained are estimates based on simulation or past system operational benchmarks. One can apply machine learning methods, supervised learning in par- ticular, to learn a model of system overhead, although in this work we do not enforce any particular model. 6.2.2.3. Decision Function. Let f be the risk measurement function, and ϕ denote the overhead cost measurement function. Then, overall operational cost, C becomes,

C = f(α1, α2, . . . , αn) + ϕ(β1, β2, . . . , βn) (6.4)

If we denote the uncertainty measure as U, the Decision Function ρ can be expressed as, τ = ρ(C,U) (6.5)

The function ρ is directly proportional to the cost measure C and to U. If τ exceeds a given threshold, the system prompts the user for clarification, and the feedback is passed through the uncertainty model and cost estimation process in a similar fashion. Until the value of τ falls below a threshold, the system will keep asking the user to provide feedback. To describe the structure of the function ρ, and

93 CHAPTER 6. CONFIRMATIONS IN HUMAN-ROBOT DIALOG to estimate the threshold τ, we introduce the concept of the Confirmation Space in the following section.

6.2.3. Confirmation Space. Before executing a program with high cost or low likelihood, a robot should confirm the desirability of the task with the user. This ensures that the user truly requested the task, and the robot did not erroneously misinterpret the command. Since asking for feedback from the user is itself not a cost-free task, any HRI system should ideally want to minimize the number of confirmation requests. There are three possible alternatives to choose from, namely,

(i) Pick the safest (i.e., lowest-cost) program and execute it. (ii) Pick the program with the highest likelihood, ignoring the task cost, and execute it. (iii) Pick a combination of the two above, combining high task likelihood with low cost and execute it.

Clearly, considering cost without regard for likelihood and vice-versa would be foolhardy. Even in the absence of user error, which might lead to perfect certainty in perceiving the command, there could very well be expensive programs that the robot should verify before execution. As such, we opt for option 3 above. We generate all possible consistent programs based on the observed input (by using the confusion matrix B of gestures, gi), and pass them through the HMM to obtain likely observation values. These programs are also passed through the task simulator (i.e., set of assessors) to evaluate the cost measures for all of the programs. When the inverses of the cost values (i.e., safety) are plotted against the observation likelihoods of these programs, we obtain the Safety-Likelihood Graph, as illustrated in Figure 6.4. The Safety-Likelihood graph has three distinct regions where input commands may reside. The two shaded areas at the extremities indicate regions where tasks have high certainty and high safety (upper right), and low likelihood and low safety

94 6.2 METHODOLOGY

Figure 6.4. A pictorial depiction of the set of programs Pi = {p1, p2, . . . , p13} in the Safety-Likelihood graph. Programs in the non- shaded areas are in the Confirmation Space.

(lower left). Tasks that fall outside these areas are in-between these ranges, and are candidates for requiring confirmations. As such, we label this region as the Confirmation Space in the Safety-Likelihood graph. Once the possible sentences have been generated and their corresponding likeli- hoods and costs have been computed, we have available a set of programs with their beliefs and associated costs (as seen in Fig. 6.5). To pick the most likely sentence

95 CHAPTER 6. CONFIRMATIONS IN HUMAN-ROBOT DIALOG

Figure 6.5. Pictorial depiction of the task selection process from a range of likely alternative programs. The straight line indicates the average cost of likely programs. with the best cost return, we multiply the probability with the cost value and choose the program that generates the maximum output of this process. More formally, from a set of m possible programs, we choose the first program that maximizes the above-mentioned product. That is,

 1 1  Pchosen = arg max × (6.6) m Cm Um We take the average cost of all possible programs and set that as the value of threshold τ. Next, we compare the cost of Pchosen to that of the threshold. If it exceeds the threshold, we ask for confirmation. Otherwise, we execute the program as instructed.

96 6.3 CONCLUSIONS

6.3. Conclusions

This chapter presents an approach to human-robot dialog in the context of ob- taining assurance prior to actions that are both risky and uncertain. Our model for risk is slightly unconventional in that it expresses the risk of a system failure and the associated recovery procedure that may be needed on the part of a human operator. This model of dialog uncertainty is a direct product of the HMM used for recognition, and by simulating the program and likely alternatives that this observation encodes, we can obtain an estimate of the risk involved in executing the action. Experimental evaluations of this approach can be found in Chapter 8. By seeking confirmation for particularly costly actions when they are also uncertain, we demon- strate experimentally that this achieves a reduction in overall cost of action while requiring a relatively small number of confirmatory interactions. In our current framework, we do not combine a failure-based risk model with a cost function based on Bayes Risk. This appears to be a challenging undertaking due to the intrinsic complexity of the computation required, but it would be an appealing synthesis that would capture most of the key aspects of our problem domain. It remains an open problem for the moment. We are also interested in evaluating the interaction mechanism across a wider user population and a larger range of dialog models, across multiple robotic platforms, including terrestrial and aerial vehicles. The next chapter looks at the systems architecture of the Aqua robot, which has been the validation platform for this research. We focus on the sensing hardware, and describe the software architecture put in place that enables human-interaction behaviors in Aqua.

97

CHAPTER 7

The Aqua Family of Amphibious Robots

The visual HRI framework, and the associated algorithms, that have been presented so far in this thesis, have all been validated on-board the Aqua autonomous ve- hicle described in this chapter. In fact, the algorithms presented till now enable autonomous capabilities in the Aqua robot. As such, a significant amount of system development work has been performed to incorporate these features on-board the robot. This chapter looks into the architecture of the perception engine, in terms of algorithms and sensory hardware, of the Aqua amphibious robots. Currently, there are three different evolutions of these robots in existence, although from the perspective of the research presented in this thesis, discussion of the most recent ver- sion of these robots is relevant. A vision-based control scheme enables the robot to navigate underwater, follow targets of interest, and interact with human operators. The visual framework presented in this thesis enables deployment of the vehicle in underwater environments along with a human scuba diver as the operator, without CHAPTER 7. THE AQUA FAMILY OF AMPHIBIOUS ROBOTS requiring any external tethered control. With a focus on the underlying software and hardware infrastructure, we look at the practical issues pertaining to system imple- mentation as it applies to our framework, from choice of operating systems to the design of the communication bus. As an analog to what is commonly referred to as the Guidance-Navigation-Control or GNC bus in space technology literature [51], the systems development contributed to the development of a control and communica- tion infrastructure for the Aqua robots. In addition to supporting the research in this thesis, this software framework has aided research in state estimation in underwater environments [89] and navigation summaries [37]. The Aqua family of robots are hexapodal amphibious robots capable of both underwater and ground locomotion, as well as surface swimming. The robots use six flippers for propulsion, which also act as hydroplanes for depth and direction control under water. They are power autonomous, with two on-board Lithium-ion batteries providing power for up to six hours. The robots have two computers on-board, one for control and the other for visual processing. For sensing, the robots are equipped with three cameras (two in front, one in the back), an inertial measurement unit (IMU), depth sensors, and motor current and thermal sensors. The robots have a fiber optic tether connection which acts as a conduit for video and data for remotely controlled operations. Figure 7.1 shows a cutaway diagram of the current generation of the Aqua robot.

7.1. Computing

All Aqua robots are equipped with two primary computers – one for performing computations related to vision sensing (hereafter referred to as the vision stack) and the other to generate motor commands to drive the robot within hard real-time [12] constraints (hereafter referred to as the control stack). The vision and control stacks conform to the PC104/Plus form-factor, and are thus “stackable” to accommodate

100 7.1 COMPUTING

Figure 7.1. A cutaway annotated diagram of the Aqua robot, showing salient hardware components. Image courtesy of C. Prahacs. connections to other peripheral cards of the same form-factor, by sharing the ISA and PCI buses with other computers and peripherals on the same stack. The control stack uses a Pentium-class CPU having a clock speed in the order of 300 MHz, and runs the QNX1 real-time operating system to ensure hard real-time (running at ≈1000Hz) operations for robot control using the RoboDevel robot control software suite (we will discuss more details about the software system in the following sections). Since neither the control stack operating system (OS) nor the RoboDevel library has a high memory demand, we use 256MB of RAM on the control stack. On the other hand, the vision stack does all of the vision processing on-board, and is thus equipped

1http://www.qnx.com

101 CHAPTER 7. THE AQUA FAMILY OF AMPHIBIOUS ROBOTS with a scalable-frequency Pentium-M class processor, with a gigabyte of RAM. The vision stack can operate at a maximum clock rate of 1.6GHz, but during idle times is scaled down by the OS to 600MHz. This features keeps the operating temperature low on the robot, and more importantly, prolongs battery life. A stripped-down, highly-optimized version of the Linux operating system was developed to support the work reported in this thesis and it runs on the vision stack. This OS, which we call Vizix, runs the code that implements the vision-interaction algorithms. In all generations of Aqua robots, the control stack and the vision stack communicate via an Ethernet bus. For storage, both stacks use CompactFlash(CF) cards of various capacities, with the control stack using more robust versions of the CF medium (to protect against magnetic fields and high-temperatures arising from the proximity to motors and motor driver circuitry.)

Figure 7.2. The hardware schematic of the Aqua robot.

102 7.2 VISION HARDWARE

7.2. Vision Hardware

There are three generations of the Aqua robot currently in operation, with very similar mechanical and locomotive capabilities, but with slightly different internal architectures for vision sensing. Since the bulk of the validation of this research was performed on-board the current generation of the Aqua robot, we limit our discussion to this specific instance. Each of the Aqua robots, as mentioned before in this chapter, has three cameras for visual sensing, two in the front and one in the back. In the current generation of the Aqua robots, all three cameras have been upgraded to IIDC2-compliant FireWire cameras, with front cameras having higher resolution than the rear (1024 × 768 as opposed to 640×480). Internally, the FireWire bus is used to transport image frames from the cameras to the vision stack, and also to the external Operator Control Unit (OCU). For video and control data transport over the fiber optic cable, we use a pair of fiber optic media converter multiplexer-demultiplexer cards. For communications to the outside world, the Ethernet-over-FireWire (ETH1394) protocol is used to route command and control packets over the FireWire bus. The interconnection of various components of the vision subsystem is depicted in Fig. 7.2. The robot is capable of tetherless (i.e., autonomous) behavior with this architecture, with the added benefit of having live FireWire streams available for a remote HUD application in the tethered mode. Because of the design of the FireWire bus, camera data is readily available at any end point. This also creates an opportunity for running vision processing off-board, for example to perform quantitative analysis and simultaneous data collection at a remote computer.

2Instrumentation & Industrial Digital Camera Standard

103 CHAPTER 7. THE AQUA FAMILY OF AMPHIBIOUS ROBOTS

7.3. Software Architecture

To harness the utility provided by the visual sensors and the computing power on board the Aqua robots, a significant amount of infrastructure has been built to establish the proper software system for vision-based operations. Our investigation into the visual interaction techniques includes this system-building task. The soft- ware infrastructure spans areas from the choice of OS, implementation of the core vision-based interaction algorithms, support code for device drivers, robot control and robot tele-operation.

7.3.1. Operating Systems. As mentioned before, the robot control loop runs with a hard real-time constraint, which means a particular operation is consid- ered to have executed successfully if and only if the operation succeeds within the maximum time allocated for the given task. For the Aqua robots, this time limit is 1 millisecond. This constraint enforces the use of an operating system with a real-time kernel, and we have chosen the QNX operating system for our purposes. The QNX OS is a commercial UNIX-like real-time OS, although for non-commercial, research and educational purposes, the OS is available for free. QNX is a micro-kernel [104] based OS, and thus is based on the principle of running most of the OS in the form of a number of small tasks. It is this micro-kernel architecture that enables the OS to enforce hard-real time constraints on running programs. For the vision stack, we did not have a similar hard real-time constraint, but the vision processing needed to be fast and efficient for the robot to maintain a responsive behavior. The availability of a rich development platform was also a driving factor for choosing an operating system for the vision stack. From these requirements, we considered using a small footprint Linux distribution, which would be fast and also be well suited for installation on CompactFlash medium without having an adverse effect on the health and lifespan of the card. We performed trials with several Linux

104 7.3 SOFTWARE ARCHITECTURE distributions, including small-footprint ones such as Damn Small Linux3, but it did not meet the needs of the system (in particular size and speed requirements). In the end, we created the operating environment for the vision stack based on the Ubuntu Linux distribution. This OS, christened Vizix, has been designed to be lightweight, fast and robust for deployment in the flash memory storage commonly found in embedded systems. The core of Vizix is built from source, and the kernel and other essential components of the OS are highly optimized for the vision stack hardware. Also, by writing temporary data (such as log files, temporary storage etc)to RAM disks, frequent writes to the CF card is avoided, and this extends the life of the card by preserving write cycles. While Vizix primarily is used as a vision computing platform, it also provides essential connectivity services (over wired and wireless Ethernet (802.11a/b/g/n), and the Bluetooth 2.0 Enhanced Data Rate protocol) and provides support for video and image data logging both on-board the robot and off-board. The primary vision application runs on Vizix, and is designed to be an ensemble of the different techniques we have integrated into the robots.

7.3.2. Robot control. Real-time control of the robots motions is performed by a software suite called RoboDevel, which is suite of software systems designed to control the RHex [3] family of legged robots (including descendants of RHex), by providing a platform to write robot-specific behaviors, and provide graphical tools to create a graphical user interface (GUI) for a remote OCU. The RoboDevel suite also includes a simulation suite to facilitate development off-line. Figure 7.3 shows an instance of the remote OCU front end, operating in simulation mode.

7.3.3. Implementation of vision-interaction algorithms. Due to the real-time requirements for operating the robot, the algorithm implementations need to adhere to hard real-time constraints, and that calls for efficient implementations in

3Damn Small Linux, http://www.damnsmalllinux.org/

105 CHAPTER 7. THE AQUA FAMILY OF AMPHIBIOUS ROBOTS

Figure 7.3. Aqua remote operator console, working in simulation mode. a real-time capable programming language. The visual HRI framework presented in this thesis has been implemented in C++, with visual processing code based on the VXL 4 vision libraries, and robot control code based on the RoboDevel suite. This framework, called VisionSandbox, is a collection of algorithms for visual tracking, diver following [80], learning-based object detection [79], gesture-based human-robot communications [24] and risk assessment in human-robot dialogs [83]. VisionSand- box contains two major components – one for off-board, on-the-bench algorithm development, prototyping and testing, and the other for real-time deployment on the Aqua robots. The robot implementation runs embedded on-board the vision stack, with no external operator assistance or monitoring. In such a “tetherless”

4Vision “Something” Libraries, http://vxl.sourceforge.net

106 7.4 CONCLUSION operating mode, the robot can operate autonomously, and can interact with a diver with the aid of the RoboChat gesture system. While tethered, the vision code can be operated on-board, but also can be run remotely, off-board at a remote computer. In this mode, it is possible to monitor real-time behavioral responses in the visual interaction mode, and also modify robot behavior through a graphical user interface.

7.3.4. Vision-guided autonomous control. With the aid of the RoboChat scheme, together with the visual tracking and servoing mechanism, the robot has the ability to operate without a tethered remote controller. The vision stack runs the on-board component of VisionSandbox that contains the implementations of the algorithms in our visual HRI framework. The robot controller code is a different executable that runs on the control stack, and at power-up, both these programs come on-line. The vision client immediately goes into fiducial detection mode and waits for input from the human operator. Once it detects a valid tag (or a set of tags, correctness of which is enforced by the RoboChat language grammar), the vision client communicates with the control client over the network using the UDP (User Datagram Protocol) protocol and sends robot behavior commands, and reads back robot responses. Once put into swimming mode, the vision client has the ability to control five degrees of motion of the robot, and also engage the visual servoing system to track and follow an object underwater in full autonomous mode.

7.4. Conclusion

In this chapter, we have presented an overview of the Aqua amphibious robot platform, with an emphasis on the sensing and computing hardware, and the software components that enable Aqua to exhibit semi-autonomous behavior and interact with a scuba diver. While a detailed description of all the robot components are beyond the scope of this thesis (and also not relevant to the core research presented here),

107 CHAPTER 7. THE AQUA FAMILY OF AMPHIBIOUS ROBOTS we believe it is important to highlight the amount of systems development that was required to implement and validate our HRI framework. Nevertheless, for the sake of completeness, we present a functional schematic diagram of the Aqua robot in Figure 7.4, with all key hardware components and data pathways. The next two chapters present the experimental evaluations of the algorithms presented from chapters 3 to 6. The first chapter discusses human-interface experi- ments performed to evaluate the quality of the RoboChat scheme, and also presents experimental verifications of the confirmation system presented in Chapter 6.

108 7.4 CONCLUSION

Figure 7.4. Aqua robot functional schematic diagram.

109

CHAPTER 8

Experiments and Evaluations – Explicit Interactions

The framework presented in this thesis aims to create a consistent, robust mechanism for efficient visual human-robot interaction. The preceding chapters (except one on the Aqua platform) discuss this framework, with each chapter presenting one algo- rithm that contributes to the core of this framework. To evaluate the performance and usability of the framework as a whole and the individual components, we have performed an extensive set of experiments, both on-board our robot and also off- board in the form of human-interface experiments. The next two chapters describe these experiments in greater detail. We first turn our attention to the experiments performed to evaluate the algorithms for explicit interactions described in Chapter 3, where we present a fiducial-assisted gestures-based visual programming language. In addition, we also present experimental results, both off-board and on-board, of our dialog confirmation algorithm presented in Chapter 6. CHAPTER 8. EXPERIMENTS AND EVALUATIONS – EXPLICIT INTERACTIONS

8.1. RoboChat Human-Interface Trials

This section describes a set of human interface tests we performed to evaluate the RoboChat language scheme. RoboChat, as described in Chapter 3, is a well- defined formal language for robot programming, and in our framework, it is used as the primary means of robot programming through visual cues. We perform a set of experiments, described in the following sections, to evaluate the speed, accuracy and adaptability of the RoboChat scheme compared to free-form hand gestures. Of course, the accurate detection of free-form hand gestures in arbitrary environments (i.e., non-engineered) is a non-trivial issue, and thus we perform the hand gesture detection by a human observer.

8.1.1. Human Interaction Study. An important concern regarding human- robot interfaces in arbitrary environments is the effect of a given interface on the cognitive load [13] of the user. Cognitive load theory suggests that humans are able to learn quickly and execute learned behavior efficiently when there exists substantial familiarity in the methods and modalities used to teach new skills. The RoboChat scheme, while similar to free-form hand gesture control, is a novel approach to robot control, as it uses the fiducials to deliver commands to the robot. Particularly in the case of underwater operations, “Task Loading” [10] in scuba divers is a significant concern. Task Loading refers to the phenomenon where the cognitive ability of scuba divers are overloaded as a result of a multitude of responsibilities that require their attention at a given instant. This often results in mistakes in performing basic scuba diving functions, including those actions that are essential for diver safety. Any task or set of tasks that significantly task-loads a scuba diver would be undesirable to perform in underwater domains. Our human interface experiments are thus geared towards assessing usability under cognitive load, and under task loading in particular.

112 8.1 ROBOCHAT HUMAN-INTERFACE TRIALS

Two sets of studies were conducted using the marker-based input scheme in com- bination with the RoboChat language, to assess their usability. The ARTag marker scheme, discussed briefly in Chapter 2, was used in the interaction experiments. In particular, the ARTag mechanism is compared to a hand gestures system, as com- peting input devices for environments unsuitable for the use of conventional input interfaces. The first study investigates the performance of the two systems under a stressful environment, similar to the one real-world operators might encounter (e.g., scuba divers underwater). The second study aims to compare the two input mecha- nisms in the presence of different vocabulary sizes. The main task in both studies is to input a sequence of action commands, with the possibility of specifying additional parameters. The RoboChat format is used with both input devices, although in the case of the hand signal system, the gestures are interpreted by an expert human operator at a remote location, who subsequently validates the correctness of the in- put using the RoboChat syntax. This setup is realistic because in the specific case of the Aqua robot, the diver’s hand signals are interpreted by an operator on land, who then takes control of the robot. Also, the operator is not forced to be unbiased when interpreting gestures, because realistically the robot operator will guess and infer what the diver is trying to communicate, if the hand gestures are ambiguously perceived. Before starting each of the two sessions (using different input devices), partic- ipants are briefed on the RoboChat syntax, and are given the chance to practice using the devices and the RoboChat language, on a limited version of the experi- ment interface. This way, the participant acquires a familiarity with the working of the system, before attempting to carry out the complete experiments. 8.1.1.1. Study A. In the first study, 50 ARTag markers are provided in groups of six, in a cubical dice configuration (examples of which can be seen in Fig. 8.1). The participants are allowed to place the dice in any configuration in the provided

113 CHAPTER 8. EXPERIMENTS AND EVALUATIONS – EXPLICIT INTERACTIONS

1 TURN RIGHT, REVERSE, EXECUTE 2 FORWARD, TURN LEFT, FORWARD, EXECUTE 3 REVERSE, TURN RIGHT, FORWARD, REVERSE, EXECUTE 4 REVERSE, TURN RIGHT, REVERSE, TURN LEFT, REVERSE, EX- ECUTE 5 TURN RIGHT, SURFACE, TURN LEFT, SURFACE, REVERSE, TURN LEFT, STOP, EXECUTE 6 STOP, FORWARD, SURFACE, TURN RIGHT, SURFACE, EXECUTE 7 REVERSE, STOP, SURFACE, FORWARD, STOP, SURFACE, TURN RIGHT, EXECUTE 8 FORWARD, STOP, FORWARD, SURFACE, STOP, EXECUTE 9 TURN RIGHT, TURN LEFT, SURFACE, EXECUTE 10 FORWARD, REVERSE, FORWARD, STOP, EXECUTE 11 FORWARD, REVERSE, TURN RIGHT, EXECUTE Table 8.1. Tasks used in Study A. work area, and they are encouraged to do so in a manner so that the cubes can be easily accessible. The hand gestures in this study are predetermined, and are visually demonstrated to the participants, who are then asked to remember all the gestures. During the experiment session, the participants must rely on memory alone to recall the gestures, similar to the scenario for scuba divers. In this study, most of the commands in the sequence do not take parameters (making the language simpler). Even for those few that require parameters, the variables conserve their values across commands, so not all parameters need to be re- entered for subsequent commands. The system tells the participant when the entered command is incorrect, and proceeds onto the next command only after receiving the previous one correctly. The incorrect commands are penalized both as a user mistake and in terms of additional time taken to enter the command correctly. The set of the tasks used is presented in Table 8.1. The stress factor in the first study is introduced by asking participants to play a game of Pong (a classical table tennis video game [44] from the 1970s) during the

114 8.1 ROBOCHAT HUMAN-INTERFACE TRIALS

Figure 8.1. Different ARTag structures used during the experiments. experimental sessions. Note that a suitable distractor task must be fairly accessible to all users, continually demanding of attention, yet still allow the core task to be achievable. Thus, after considering several alternatives, Pong was eventually chosen as the distraction task. This particular implementation of Pong uses the mouse to control the user’s paddle. As such, participants are effectively limited to using only one hand to manipulate the markers and to mark out gestures, while constantly controlling the mouse with the other hand. But since some of the predefined hand gestures require the use of both hands, this distraction introduces additional stress for the participants in terms of alternatively showing gestures and playing Pong. In this study, the system (controlled by the operator for the hand gesture session) informs the participant when the entered command is incorrect, and proceeds onto the next command only after the current command is entered correctly. The participants are told to complete the sessions as fast as possible, but also with as little error as possible as well.

115 CHAPTER 8. EXPERIMENTS AND EVALUATIONS – EXPLICIT INTERACTIONS

FORWARD, TURN LEFT, FORWARD, TURN RIGHT, REVERSE, STOP, FORWARD, SURFACE, TURN RIGHT, SURFACE, STOP, FORWARD, DEPTH, 10, TURN RIGHT, FORWARD, STOP, TAKE PICTURE, SURFACE, GPSFIX, DEPTH, 15, FORWARD, TURN RIGHT, FORWARD, TAKE PICTURE, 10, TURN LEFT, FORWARD, SURFACE, STOP, EXECUTE Table 8.2. Example of a long command used in Study B.

Limited feedback is given to participants, mainly in the form of an audio beep to acknowledge the reception of a marker (automatically generated by the ARTag detection code) or a gesture (manually triggered by the ‘operator’). The participant does not know which token has been detected, and therefore must make an effort to ensure that the correct token is entered at all times, as well as remember all the tokens entered. The sole exception to this restriction is the EXECUTE token – when the language is used to control a robot, the EXECUTE token will trigger robot actions that can be readily observed by the robot operator. Thus during the experiments, the user is notified through textual feedback whenever the EXECUTE token has been detected. Before starting each of the two sessions, participants are given the chance to practice using the input device as well as RoboChat on a single-command version of the session. These practice sessions are not timed, and their main purpose is to let the participants understand how the overall system works, before attempting to carry out the rest of the sessions. 8.1.1.2. Study B. The second study shares many similarities to the first (an example task is seen in Table 8.2), but the parameter of interest is no longer the par- ticipant’s concentration level, but rather the performance difference using different vocabulary sizes. Two vocabulary sets are given in this study – the first set contains only 4 action commands, while the second includes 32. This distinction is mentioned to every participant so that they can use this information to their advantage. Due

116 8.1 ROBOCHAT HUMAN-INTERFACE TRIALS

Figure 8.2. A subset of gestures presented to participants in study B. to the increase in vocabulary size, the ARTag markers are provided in two different mediums – the digits are still offered on dice, while the whole vocabulary set is also organized into a flipbook (also shown in Fig. 8.1). The flipbook is separated by several high-level tabs into different token groups (such as ‘digits’, ‘parameters’ and ‘commands’). Each tag sheet has a low-level tab on the side, listing the mappings for the two markers on both sides of the sheet. This feature halves the number of sheets needed, and by grouping similar mapping pairs into single sheets, it increases the access speed of the device. The same vocabulary-size issue arises for the hand gestures. Real scuba divers are required to remember a number of hand signals, but because it is unrealistic to ask participants to remember more than 50 different hand gestures under the tight time constraints of the experiment, a gesture lookup sheet is given to each participant (a subset of such gestures can be seen in Fig. 8.2). The subjects are encouraged to

117 CHAPTER 8. EXPERIMENTS AND EVALUATIONS – EXPLICIT INTERACTIONS familiarize themselves with this cheat sheet during the practice sessions, to ensure that they spend minimal time searching for particular hand signals. There is no distraction factor in this second study, but at the same time, the system accepts incorrect commands without informing the participants or making them re-enter the commands. The users are informed of this criterion, and are requested to constantly keep track of the entered tokens and try to make as few mistakes as possible.

8.1.2. Criteria. Two criteria are used to compare the performance of the two input interfaces. The first criterion is speed; i.e., the average speed it takes to enter a command. A distinction is made between the two studies regarding this metric: in the first study, the input time per command is measured from the time a command is shown on screen until the time the command is correctly entered by the participant, whereas in the second study, the command speed does not take into consideration the correctness of the command. The second study also uses the average time per individual token as a comparison metric. This metric demonstrates the raw access speeds of both input interfaces outside the context of RoboChat or any other specific environment. The second criterion used to compare the two systems is the error rate associated to each input scheme. Once again, due to the distinction between how incorrect commands are treated between the two studies, results from this metric cannot be compared directly between studies. This criterion is used to look at whether the two schemes affect the user’s performance in working with RoboChat differently. In total, 12 subjects participated in study A, whereas 4 subjects participated in study B. One of the participants present in both studies has extensive experience with ARTag markers, RoboChat, and the hand gesture system. This expert user is introduced in the dataset to demonstrate the performance of a well-trained user.

118 8.1 ROBOCHAT HUMAN-INTERFACE TRIALS

(a) Study A: Average time taken per command using ARTag markers (in red) and using hand gestures (in dark blue).

(b) Study B: Average time taken per command using ARTag markers (in red) and using hand gestures (in dark blue).

Figure 8.3. Studies A and B: Average time taken per command using ARTag markers and hand gestures.

However, this user has no prior knowledge of the actual experiments, therefore is capable of exhibiting similar performance improvements throughout the sessions.

119 CHAPTER 8. EXPERIMENTS AND EVALUATIONS – EXPLICIT INTERACTIONS

8.1.3. Results: Study A. One obvious observation we can make from the performance data is that the gesture system allows for faster communication than the marker system. The ratio between the two input techniques for some users surpasses 3:1 favoring hand gestures, while data from other users (including those from the expert user) show ratios of lower than 2:1. Since all users have experience with primitive hand gestures, we can infer that it may simply be that those users who did almost equally well with markers as gestures adapted to the marker system more quickly. Thus, the data suggest that the ARTag markers are capable of matching half the speed of the hand gestures, even given only limited practice. It is worth noting that contrary to the hand gestures which are chosen to have intuitive and natural mappings to their corresponding tokens, the mappings between the ARTag markers and tokens are completely arbitrary. To further substantiate the hypothesis that the enhanced performance of hand gestures is due to familiarity, note that Fig. 8.3(a) indicates that the spread of the average time per command using gestures (± 3 seconds) is much smaller than that for markers (± 8 seconds). Arguably the more sporadic spread for the markers is due to unfamiliarity with this new input interface. The distraction task (playing Pong) also plays an important role in increasing the performance disparity between the two systems. For each token, the partici- pants need to search through the entire ARTag vocabulary set for the correct marker, whereas the associated hand gesture can be much easily recalled from memory. Since the Pong game requires the participant’s attention on an ongoing basis, the sym- bol search process was repeatedly disrupted by the distraction task, amplifying the marker search time. In terms of the error rate associated with each system, all the participants dis- played error rates of roughly 5 per cent for both systems. This finding is surprising and interesting, because even though the symbolic system is harder to learn, it does

120 8.1 ROBOCHAT HUMAN-INTERFACE TRIALS

(a) Study B: Average time taken per command using ARTag markers.

(b) Study B: Average time taken per command using hand gestures.

Figure 8.4. Study B: Average time taken per command using ARTag markers and hand gestures. In both plots, user 4 is the “expert user”. not seem to generate more errors than the gesture system, even for inexperienced users.

8.1.4. Results: Study B. The data from study B suggests that the two input interfaces have very similar performances under the new constraints. Major

121 CHAPTER 8. EXPERIMENTS AND EVALUATIONS – EXPLICIT INTERACTIONS

Figure 8.5. Study B: Trial progression using tags. contributing factors include the increase in the vocabulary size and the inclusion of many abstract action tokens (such as RECORD_VIDEO and POWER_CYCLE). This variation takes away the crucial advantage gestures had in the former study, and par- ticipants are now forced to search through the gesture sheet rather than remembering the many hand gestures. Essentially, in this study, the command speed criterion boils down to the search speed for each input device, and therefore depends on the ref- erence structure, whether it is the ARTag flipbook or the gesture cheat sheet. And using the two engineered reference structures, the data of the experiments show that the speed performance of both input systems are actually very similar. Interestingly enough, the data spread between systems are actually reversed, as shown in Fig. 8.4. With the exception of the expert user, the average command and token speeds for all the participants using ARTag markers are almost identical,

122 8.1 ROBOCHAT HUMAN-INTERFACE TRIALS

Figure 8.6. Study B: Trial progression using hand gestures. whereas the same speeds using gestures are now erratic between individuals. This result can be attributed to the fact that since the gestures are not kept in memory, different subjects adapt to the cheat sheet setup at different speeds. One interesting observation from Fig. 8.5 is that the command speeds for the non-expert users seem to all feature a similar decaying factor, decaying along with the number of trials. That is, as the users acquire more experience programming the system with the tags, their programming times improve. This gradual improve- ment suggests that the ARTag marker system is still relatively quick to learn. This hypothesis is reinforced by observing the nearly-constant command speeds of the expert user. In comparison, Fig. 8.6 shows that the command speeds for the gesture portion of the study are more or less linear, with a chaotic nature due to the spread explanation

123 CHAPTER 8. EXPERIMENTS AND EVALUATIONS – EXPLICIT INTERACTIONS above. This result further strengthens the theory presented in the previous section, that all the participants have prior experience with the general hand gesture system, and only have to learn the particular supplied gestures. The two vocabulary sets employed did not have any negative impacts on the subjects’ performances. Even though most participants acknowledged that the initial commands only use a small subset of the larger vocabulary set, the difference in the input speed is too small to be significantly commented. The expert user’s data shows almost a 1:1 ratio between the two command input speeds. Because the expert user is familiar with all the specified hand gestures as well as the configuration of the ARTag flipbook, his data suggests that the ARTag markers can rival the gesture system in terms of speed, given enough training. As for errors in the different sessions, despite that most commands were entered correctly, the RESET token was employed at several occasions. This result simply says that without distractions, RoboChat can be used easily without committing any non-reversible errors.

8.1.5. RoboChat field trials. The results from our controlled usability study were corroborated by field trials in which the visual symbol-driven RoboChat interface was used on-board the Aqua robot operating underwater. These trials were conducted both in a large enclosed swimming pool at a depth of roughly 2 meters, and in a large open water lake at a depth that ranged in depth from 0 meters to 6 meters. While the demands imposed by the experimental conditions precluded quantitative measurements like those in the preceding section, both the hand signals and RoboChat symbols were used. The simple fact that the RoboChat system makes a tether for remote control unnecessary makes it extremely desirable a system for underwater deployment. The subjective response of two divers familiar with controlling the robot is that the RoboChat system is easy and convenient to use.

124 8.2 QUANTITATIVE ANALYSIS OF CONFIRMATION IN DIALOGS

In general, the RoboChat controller may reduce the cognitive load on the diver, but it does imply an additional risk if the device showing the symbolic markers becomes lost or inaccessible. The system was also validated in a terrestrial environment in conjunction with a Nomadics SuperScout robot (controlled using the RoboDevel [74] software package). This interface also appeared to be convenient, but probably would have been more effective in combination with a keyboard for low-level programming.

8.2. Quantitative Analysis of Confirmation in Dialogs

In this section, we present the experimental results of our algorithm to gen- erate confirmations in human-robot dialogs in presence of uncertainty. Similar to the RoboChat experiments presented in the previous section, we perform a two-part evaluation of this system – a quantitative analysis based on “off-board” human- interaction studies performed in a controlled, in-lab environment, and a qualitative assessment based on extensive field-trials with the Aqua robot. The field trials also served as a benchmark to qualitatively assess the feasibility of a real-world deploy- ment of this system. Note that for the latter, the algorithm was completely imple- mented to run on-board the Aqua robot, in real-time, with the RoboChat scheme serving as the input modality. The field trials were performed both in open-water (i.e., oceans and lakes) and in closed-water (i.e., swimming pools) environments. In the off-board experiments, a group of users were asked to program the robot to perform certain tasks, with an input modality that ensured a non-trivial amount of uncertainty in communication. Since the key concept in this work involves a human- robot dialog mechanism, we did not require task execution for the off-board trials. Results and experiences from both sets of experiments are presented in the following sections, preceded by a brief description of the input language.

125 CHAPTER 8. EXPERIMENTS AND EVALUATIONS – EXPLICIT INTERACTIONS

8.2.1. Language Description. While in field deployments we use fiducials to instruct the robot and thus generate tokens for the RoboChat language parser, the language is independent of the particular input modality used, as long as the modality used generates independent tokens for use by RoboChat. This feature enables us to use the same language set for field trials and off-board human interface trials.

8.2.2. User Study. We performed a set of user studies to collect quantitative performance measures of our algorithm. When operating as a diver’s assistant in underwater environments, the system uses fiducials to engage in a dialog with the robot. However, in the off-board bench trials, we employed a simplified “gesture-only language”, where the users were limited to using mouse input. We used a vocabulary set of 18 tokens defined by oriented mouse gestures, and as such each segment is bounded by a 20o-wide arc. The choice for using mouse gestures stemmed from the need to introduce uncertainty in the input modality, while keeping the cognitive load roughly comparable to that experienced by scuba divers. We could not use the ARTag scheme for these experiments, as the ARTag library neither provides a confidence factor for tag detection, nor does it have a significant false positive detection rates. Using ARTags would have provided insufficient data to throughly validate the algorithm for an arbitrary input modality. To calculate uncertainty in input, we trained a Hidden Markov Model using com- monly used programs given to the robot (such as those used in previous experiments and field trials, i.e., real operational data). To estimate task costs, we simulated the programs using a custom-built simulation engine and used a set of assessors that takes into account the operating context of an autonomous underwater vehicle. The

126 8.2 QUANTITATIVE ANALYSIS OF CONFIRMATION IN DIALOGS simulator has been designed to take into account the robot’s velocity, maneuverabil- ity and propulsion characteristics to accurately and realistically simulate trajectories taken by the robot while executing commands such as those used in our experiments. In choosing assessors for the user studies, we considered factors that directly affect underwater robot operations. For example, the distance traveled by the robot (and the farthest distance it travels from the start point) often has a direct bear- ing on the outcome of the mission, as the probability of robot recovery is inversely proportional to these factors. That is because energy consumption is directly pro- portional to the distance traveled. Robot safety (e.g., chance of collisions) is also significantly compromised by traveling large distances. In particular, we applied the following assessors during the user studies:

(i) Total distance: The operating cost and risk factors both increase with total distance traveled by the robot. The cost associated with the amount of wear is a function of total travel, and higher travel distances also increase external operational risks. (ii) Farthest distance: The farther the robot goes from the initial position (i.e., operator’s position), the higher the chance of losing the robot. In the event that the robot encounters unusual circumstances which it is not equipped to handle, the involvement of a human operator is also a small possibility, thereby increasing the overall task cost. (iii) Execution Time: An extremely long execution time also carries the over- head of elevated operational and external risk. (iv) Average Distance: While the farthest and total distance metrics consider extremes in range and travel, respectively, the average distance looks at the distance of the robot (from start location) where most of the task execution time is spent.

127 CHAPTER 8. EXPERIMENTS AND EVALUATIONS – EXPLICIT INTERACTIONS

ID Sequence Confirm? 1 FORWARD, 3, PICTURE, No LEFT, 3, PICTURE, UP, GPSFIX, GPSBEARING, EXE- CUTE 2 FORWARD, 9, LEFT, 6, FOR- Yes WARD, 9, MOVIE, 9, RIGHT, 3, SURFACE, STOP, GPSFIX, EX- ECUTE 3 LEFT, 6, RIGHT, 3, MOVIE, 3, No TUNETRACKER, FOLLOW, 6, UP, GPSFIX, EXECUTE Table 8.3. Programs used in the user study.

Figure 8.7. An example set of tags used in field trials to command the robot, which corresponds to program # 1 shown in Tab. 8.3.

Each user was given three programs to send to the system, and each program was performed three times. A total of 10 users participated in the trials, resulting in 30 trials for each program, and 90 programs in all; please refer to Table 8.3 for the programs used for the experiments, and whether confirmations were expected or not. Except for mistakes that created inconsistent programs, users did not receive any feedback about the correctness of their program. When a user finished “writing” a program, she either received feedback notifying her of program completion, or a confirmation dialog was generated based on the output of the Decision Function. The users were informed beforehand about the estimated cost of the program; i.e., whether to expect to receive a feedback or not. In case of a confirmation request for Programs 1 and 3, the users were instructed to redo the program. For Program 2, the users were informed of the approximate values of the outputs of the assessors.

128 8.2 QUANTITATIVE ANALYSIS OF CONFIRMATION IN DIALOGS

If the values shown in the confirmation request exceeded the expected numbers by 10%, the users were required to reprogram it. Thus, in all cases, users were required to conduct the programming task until the output of the system (i.e., either quanti- tative values from assessor outputs or confirmation dialogs) was consistent with the expected behavior. It is worth noting, however, that this does not necessarily indi- cate correctness of the programming, but merely indicates that the Decision Function has judged the input program (and likely alternatives of that) to be sufficiently safe (i.e.,“inexpensive”) and thus safe for execution.

Figure 8.8. Mapping of mouse gestures to robot commands for the user trials. On the left, the command mapping can be seen. The figure on the right demonstrates a user trial in progress, with the straight lines depicting a user’s attempts to issue the commands shown on the left.

8.2.3. User Interface. The experimental setup for the user studies consisted of an application where users were asked to draw straight lines with a mouse in a drawing area to communicate with the robot. Specifically, the straight lines drawn by dragging the mouse at different orientations mapped into corresponding robot commands. The orientation-to-command mappings were set up before the trials

129 CHAPTER 8. EXPERIMENTS AND EVALUATIONS – EXPLICIT INTERACTIONS began, and were kept constant throughout the complete set of user trials. A screen- shot of the application shown to the users can be seen in Fig. 8.8. The users were presented with a mapping of orientations-to-commands as shown on the left. Each arc (bounded by two lines) correspond to a range of orientations which map to the command labeled on the arc (e.g., a line with orientation between 10 degrees and 30 degrees correspond to the RIGHT command). The language contained a maximum of 18 possible commands, thus each command occupied a 20-degree arc in terms of angular distribution. The drawing area was shown on the right, which began as a blank area with only the X and Y axes being displayed. It is important to note that while the boundaries of the individual ranges were shown on the left, the drawing area itself had no such visual cues overlaid on it to assist users, and hence they had to estimate the limits of the correct arc for each command by eye, which added to the input uncertainty. The trial number and the run number (out of a possible 3 for both) are displayed on the top left corner of the drawing area. Two buttons, one for advancing to the next run and one for quitting the trial, are available. No feedback is given to the users after each command is entered (i.e., after drawing each line), and so the users are not notified of mistakes that may have been made during programming. This ensures that mistakes made during programming are propagated to the dialog mechanism, and are handled by the uncertainty estimation algorithm. The design of the user interface was influenced by the need to simulate cognitive loads experienced by operators of a mobile robot in real-world scenarios. In many cases, operators of field robots (or in general, applied robots in arbitrary domains) are exposed to high cognitive loads, where the task of operating a mobile robot is multiplexed with other concurrent tasks in the environment. This is especially true for the underwater environment, where a scuba diver is under substantial cognitive load, maintaining dive gear and associated life-support equipment; the task of communicating with a robot adds to that already-significant workload. Also, as the goal of our dialog

130 8.2 QUANTITATIVE ANALYSIS OF CONFIRMATION IN DIALOGS

ID Sequence Confirm? 1 FORWARD, 9, LEFT, 5, FOR- Yes WARD, 9, LEFT 5, STOP, MOVIE, 9, EXECUTE 2 FORWARD, 5, LEFT, 3, FOR- No WARD, 5, LEFT 3, FORWARD, 5, LEFT 3, STOP, EXECUTE 3 SWIMCIRCLE, 3, STOP, EXE- No CUTE 4 SWIMCIRCLE, 3, FORWARD, Yes 5, PICTURE, LEFT, 2, PIC- TURE, FORWARD, 3, PIC- TURE, RIGHT, 2, PICTURE, SURFACE, STOP, EXECUTE 5 TUNETRACKER, FOLLOW, 9, Yes SURFACE, STOP, EXECUTE Table 8.4. Programs used in the field trials in Lake Ouareau, Qu´ebec. mechanism is to capture programming errors under input uncertainty and high task costs, the desire was to deploy an interface, however unintuitive, that induces errors in input, and thus makes it possible to rigorously test the dialog engine at length.

8.2.4. Field Trials. We performed field trials of our system on-board the Aqua underwater robot, in both open-water and closed-water environments. In both trials, the robot was visually programmed with the same language set used for the user studies, with ARTag and ARToolkitPlus [65] fiducials used as input tokens; see Tab. 8.4 and Tab. 8.5 for the programs used in the field trials. The assessors used for the user studies were also used in the field trials; in addition, we provided an assessor to take into account the depth of the robot during task execution. Because of the inherent difficulty in operating underwater, the trials were not timed. Users were asked to do each program once. Unlike in the user study, where there was no execution stage, the robot performed the tasks that it was programmed to do, when

131 CHAPTER 8. EXPERIMENTS AND EVALUATIONS – EXPLICIT INTERACTIONS

ID Sequence Confirm? 1 TUNETRACKER, FOLLOW, 5, Yes STOP, FORWARD, 4, LEFT 4, FORWARD, 4, STOP, EXE- CUTE 2 SWIMCIRCLE, 5, NPICTURE, No 5, FORWARD, 5, STOP, EXE- CUTE 3 SWIMCIRCLE, 9, STOP, EXE- No CUTE 4 FORWARD, 5, RIGHT, 5, No SWIMCIRCLE, 5, STOP, EXE- CUTE 5 FORWARD, 5, LEFT, 5, FOR- No WARD 5, RIGHT 5, FOR- WARD, 5, LEFT 5, FORWARD, 5, RIGHT, 5, FORWARD, 5, STOP, EXECUTE Table 8.5. Programs used in the open-ocean field trials in Barbados. given positive confirmation to do so. In all experimental cases, the robot behaved consistently, asking confirmations when required, and executing tasks immediately when the tasks were inexpensive to perform. Unlike the user study, where the users had no feedback, the field trial participants were given limited feedback in the form of symbol acknowledgement using an micro-Organic-LED (Light Emitting Diode) or µOLED display at the back of the robot. Also unlike the user studies, the field trial participants were given access to a command to delete the program and start from the beginning, in case they made a mistake. A pictorial demonstration of our system in action during field trials can be seen in Fig. 8.9, which demonstrates the visual programming, and command feedback through the µOLED screen.

8.2.5. Results. From the user studies, it was observed that in cases where the programs were correctly entered, the system behaved consistently in terms of

132 8.2 QUANTITATIVE ANALYSIS OF CONFIRMATION IN DIALOGS

(a) Divers programming Aqua during pool tri- (b) A diver programming Aqua during an HRI als. trial held at a lake in central Qu´ebec.

(c) Example of command acknowledgement given on the LED screen of the Aqua robot during field trials.

Figure 8.9. Field trials of the proposed algorithm on board the Aqua robot. confirmation requests. Program 2 was the only one that issued confirmations, while Programs 1 and 3 only confirmed that the task would be executed as instructed. As mentioned in Sec. 8.2.2, the users were not given any feedback in terms of program correctness. Thus, the programs sent to the robot were not accurate in some trials;

133 CHAPTER 8. EXPERIMENTS AND EVALUATIONS – EXPLICIT INTERACTIONS i.e., the input programs did not match exactly the programs listed in Tab. 8.3. In case of mistakes, the Decision Function evaluated the input program and most likely alternatives, and only allowed a program to be executed (without confirmation) if and only if the task was evaluated to be less costly. The cost of feedback, not unexpectedly, is the required time to program the ro- bot. As seen in Figure 8.10(a), all three programs took more time to program on average with confirmations (top bar in each program group). From the user studies data, we see that the use of confirmations increases total programming time by ap- proximately 50%. Although the users paid a penalty in terms of programming time, the absence of safety checks meant a greater risk to the system and higher probability of task failures. This was illustrated in all cases where the system issued a confir- mation request; an example of which is demonstrated in a trial of program 3 by user 2. The input to the system was given as “LEFT 9 RIGHT 3 MOVIE 3 FOLLOW FOLLOW 9 UP GPSFIX EXECUTE”, where the mistakes are in bold. The system took note of the change in duration from 6 × 3 = 18 seconds to 9 × 3 = 27 seconds on two occasions, but more importantly, the FOLLOW command was issued with- out a TUNETRACKER command. This, and the change in parameters to the higher values, prompted the system to generate a confirmation request, which helped the user realize that mistakes were made in programming. A subsequent reprogramming fixed the mistakes and the task was successfully accepted without a confirmation. The distribution of confirmation requests and total number of attempts to program is shown in Figure 8.10(b). To further establish the benefits of this approach, we introduce a metric termed the Error Filter Rate (EFR). The EFR is a measure of the number of confirmations compared to the number of mistakes made by users during a programming task; i.e., Confirmations EFR = Total Errors

134 8.2 QUANTITATIVE ANALYSIS OF CONFIRMATION IN DIALOGS

(a) Programming times, all users combined.

(b) Programming attempts and generated confirmations, all users combined.

Figure 8.10. Results from user studies, timing 8.10(a) and confirmations 8.10(b).

135 CHAPTER 8. EXPERIMENTS AND EVALUATIONS – EXPLICIT INTERACTIONS

The EFR indicates the percentage of erroneous inputs which the system deemed to be dangerous; in other words, a low EFR value does not necessarily indicate a low error rate in programming, but indicates that most of the commands to the robot are interpreted as low-risk. In our studies, we achieved an EFR of approximately 72.8 per cent, as can be seen in Fig. 8.11, indicating the system interpreted roughly 72% of the erroneous commands as high-risk and intervened (with confirmation dialogs) to ensure the user’s true desire. One of the tangential issues in the study was the effect of long programming sequences on mistakes made by the users; i.e., whether more mistakes were made in longer programs. While this might seem like an obvious conclusion, we did not observe this behavior in the user trials, as demonstrated in Fig. 8.12. More detailed analysis and further experiments are required to obtain a definitive answer to this issue.

Figure 8.11. Error filter rate plot over all user studies data.

136 8.2 QUANTITATIVE ANALYSIS OF CONFIRMATION IN DIALOGS

Figure 8.12. Mistakes with respect to program length, all users combined.

During the field trials, we were not able to collect quantitative data – overheads in underwater deployment, diver training, extreme task loading and personnel safety concerns, absence of accurate data collection tools being a few of the insurmountable difficulties we faced. However, the system consistently generated confirmations based on the expensiveness of the task. In the underwater environment, where divers are cognitively loaded with maintaining dive gear and other life-support tasks, having feedback on input and the ability to start over proved to be especially important. These two features relieved some of the burden of programming, and also ensured correct task execution by the robot, as the diver could restart programming in case of mistakes.

137 CHAPTER 8. EXPERIMENTS AND EVALUATIONS – EXPLICIT INTERACTIONS

Figure 8.13. The Aqua robot with ARTag markers used for RoboChat token delivery. 8.3. Conclusion

In this chapter, we have presented the human-interaction studies and field trial validations of RoboChat and an algorithm for safe task execution under uncertainty in input. These two algorithms constitute the core interaction mechanism in our framework, enabling users to communicate with robots, by sending commands and engaging in dialog to ensure his commands are correctly interpreted by the robot. It is our belief that the results presented in this chapter clearly demonstrate the usability and utility of these two approaches, particularly when validated on the Aqua robot in an environment as challenging as the underwater domain. The re- strictions posed by the underwater medium are quite severe even when considering human-human communication, and also create difficulties in using established com- munication methods such as radio or electronic visual cues. RoboChat, together with

138 8.3 CONCLUSION the dialog engine, provides a simple yet highly effective means of communication, re- quiring nothing more than fiducials printed on cards. This makes for an extremely low-cost deployment (Fig. 8.13 shows the Aqua robot with a set of tags), and the intuitive nature of gestures makes it easy for human operators to adapt to the system relatively easily. Results of the user studies on RoboChat demonstrate this fact. The dialog mechanism, as an adjunct to RoboChat, ensures that high-risk tasks are not executed without obtaining confirmations from the user, thus ensuring operational safety. Both systems have been implemented to run in real-time on-board the Aqua vehicle, and is available as a robot operating mechanism. In the next chapter, we shift our focus from direct to indirect human-robot communication and present experimental results of the Fourier and Spatio-Chromatic trackers, both belonging to the class of implicit interactions.

139

CHAPTER 9

Experiments and Evaluations – Implicit Interactions

Throughout this thesis, we have presented algorithms for vision-based human robot interaction. The previous chapter presents experimental validation of one part of this collection of algorithms for HRI, where the communication between the robot and a human takes place via explicit instructions. That is, the human operator instructs the robot to carry out a task or set of tasks using an explicit, well-defined means of communication. While explicit communications are important for error-free, un- ambiguous interaction, a variety of other algorithms are necessary for establishing a seamless operational interface between a human and a robot. As such, one part of the research presented here has investigated methods of implicit visual interaction. This chapter presents the experimental results of these proposed techniques, namely those of the Spatio-Chromatic and Fourier tracking. Similar in vein to the CHAPTER 9. EXPERIMENTS AND EVALUATIONS – IMPLICIT INTERACTIONS previous chapter, the results explained here are evaluated both off-board and on- board, with the on-board experiments focusing on the qualitative evaluation of the system.

9.1. Evaluations of the Spatio-Chromatic Tracker

We describe the experimental setup used to validate our boosted tracking algo- rithm in this section. The training and the validation process is explained in detail, with accuracy results based on underwater footage.

9.1.1. Training. To train the system, we use video sequences containing different target objects in aquatic environments, approximately 2000 frames for each target. Six different target objects were used, of varying color characteristics, result- ing in approximately 12000 training frames. The color channel trackers are trained on these images using a variant of the AdaBoost algorithm [35], which results in a set of weak trackers and their corresponding weights. Our implementation of AdaBoost is written in C++, with the VXL vision library for image processing support. Ground truth data for training is manually obtained, by grabbing frames from video files (having dimensions of 1024 × 768 pixels), inspecting them and setting the bounds by hand. This data is supplied to the trainer in the form of an uncompressed plain text file. The ground truth data contains target boundaries in (x, y) coordinates of the top left and bottom right corners as input data (i.e. X) and as output Y ∈ {−1, +1}, where −1 indicates the target is not in the frame. In such cases, the target rectangle coordinates are set to negative values. The training method works as follows: each tracker outputs a target center (x, y) coordinate. If the target location falls within the boundary of the target region in the training data for a given instance, or is within ±10% of the original boundary size, we assume an output of +1. Any other

142 9.1 EVALUATIONS OF THE SPATIO-CHROMATIC TRACKER output location is treated as being misclassified. Based on this training logic, the Ad- aBoost algorithm trains the system for either 4000 rounds, or until the difference in improvement in the training error is less than  = 0.01 (although in our experiments, the training algorithm rarely ran the full 4000 iterations).

9.1.2. Tracking Experiments. We have tested our boosted tracker on video sequences containing a target object in an aquatic environment. Validation data are supplied using the same format as the training data, containing limits of the target object in a plain text file. The tracking algorithm is run on the validation set and the output is compared to the ground truth. For real-time servo control, we define the error measures for both the X and Y axes as the difference between the tracker output and center of the image frame (the set-point). The goal for visual servo control is to minimize these errors. The error values are fed into two Proportional- Integral-Derivative (PID) controllers, one for each axis of motion, which generate pitch and yaw commands for the robot controller. The robot controller accepts pitch and yaw commands from the PID controller and adjusts the leg positions accordingly to achieve the desired robot pose. A schematic outline of this process can be seen in Fig. 9.1.

9.1.3. Experimental Snapshots. We have evaluated the boosted tracker using several different datasets with different targets, both synthetic and natural images. The performance of the boosted tracker is compared against the non- boosted color segmentation (“blob”) tracker [76], in terms of tracking accuracy. The testing video sequences have been acquired from field trials of the Aqua robot in the Caribbean sea, near Barbados. To demonstrate the performance improvement achieved by using the Boosting process, we present two short sequences of underwa- ter tracking data. In the example sequence shown in Fig. 9.2, the non-boosted color blob tracker fails to lock on to the target, as seen in the fourth image of the sequence,

143 CHAPTER 9. EXPERIMENTS AND EVALUATIONS – IMPLICIT INTERACTIONS

Figure 9.1. Overall system architecture of the Spatio-temporal tracker, showing the off-line components and the PID controller. possibly due to change in illumination. Figure 9.2 (bottom row) shows the output of the boosted tracker on the same sequence, and it has successfully found the target in the fourth frame. Figure 9.3 shows examples of the robot tracking the flippers of the scuba diver. While the non-boosted blob tracker performs reasonably well in tracking the flippers (Fig. 9.3, top), the boosted tracker has a better accuracy as demonstrated in the second row of Fig. 9.3. We consider the accuracy rate for both the boosted tracker and the color segmentation tracker, and also record the number of false positives detected by the trackers. The results are shown in Fig. 9.4(a) and Fig. 9.4(b),

144 9.1 EVALUATIONS OF THE SPATIO-CHROMATIC TRACKER

(a) Frame 1 (b) Frame 2 (c) Frame 3 (d) Frame 4

Figure 9.2. Comparison of the non-boosted color segmentation tracker (top row) with the boosted version (bottom row) on a sequence of four images. The target in the last frame is missed by the non-boosted tracker, but the boosted tracker is able to detect it. respectively. Another point of interest is the spread of the different types of trackers chosen by the AdaBoost process. In Fig. 9.5, we present the percentage of each family of trackers chosen after training terminates, on the diver following dataset

(a) Frame 1. (b) Frame 2. (c) Frame 3. (d) Frame 4.

Figure 9.3. Top row: Output of non-boosted color segmentation tracker on a video sequence of a diver swimming, shown by the yellow squares. Bottom row: Output of boosted color segmentation tracker of the diver tracking sequence, shown by the white squares.

145 CHAPTER 9. EXPERIMENTS AND EVALUATIONS – IMPLICIT INTERACTIONS

(a) Frames with correct tracking.

(b) Frames tracking false positives.

Figure 9.4. Tracking results showing accuracy(top) and false posi- tives(bottom) and superiority of the boosted tracker (blue bars) as com- pared to the non-boosted tracker (red bars). as depicted in Fig. 9.3. The total number of each tracker type is presented, across different values for hue.

146 9.2 EXPERIMENTS WITH THE FOURIER TRACKER

Figure 9.5. Percentage of each type of tracker (shown left) chosen by AdaBoost on the diver following dataset shown in Fig. 9.3.

9.1.4. Tracking results. We use three targets of different colors to quanti- tatively evaluate the boosted tracker. As shown in Fig. 9.4(a), the boosted tracker performs better than the standard segmentation-based algorithm in tracking each type of colored target. In Figure 9.4(b), we see that the boosted tracker detects smaller numbers of false positives for targets in datasets 2 and 3, respectively, as shown in the plots, but more for targets in dataset 1. Dataset 1 contained significant numbers of predominantly yellow targets, and one possible explanation for this error can be the similarity in color of the surface reflection to the color of the target being tracked. In a significant number of frames, the robot actually is directed upwards, therefore catching the surface and the light falling on it. The surface lighting effect exhibits similar characteristics as a yellow target and is classified as a false positive output. Examples of this phenomenon can bee seen in Fig. 9.6.

9.2. Experiments with the Fourier Tracker

This section presents the results of the experimentations with the Fourier tracker, the visual algorithm for following scuba divers in underwater scenes. This algorithm

147 CHAPTER 9. EXPERIMENTS AND EVALUATIONS – IMPLICIT INTERACTIONS also has been experimentally validated on video footage recorded of divers swimming in open-water and closed-water environments. Both types of video sequences pose significant challenges due to the unconstrained motion of the robot and the diver, and the poor imaging conditions, particularly observed in the open-water footage due to suspended particles, water salinity and varying lighting conditions. The algorithm outputs a direction corresponding to the most dominant biological motion present in the sequence, and a location of the most likely position of the entity generating the motion response. Since the Fourier tracker looks backward in time every N frames

Figure 9.6. Top row: Two sample testing images from Dataset 1, with yellow targets. Bottom row: Output of one of the trackers chosen by Ad- aBoost. While the target is still detected by the tracker, so are a variety of background objects, which affects tracking accuracy.

148 9.2 EXPERIMENTS WITH THE FOURIER TRACKER to find the new direction and location of the diver, the output of the computed locations are only available after a “bootstrap phase” of N frames. We present the experimental setup below in Sec. 9.2.1 findings and the results in Sec. 9.2.2.

9.2.1. Experimental Setup. We conduct experiments off-line on video se- quences recorded from the cameras of an underwater robot. The video sequences contain footage of one or more divers swimming in different directions across the image frame, which make them suitable for validating our approach. We run our algorithm on a total of 2530 frames of a diver swimming in a pool, and 2680 frames of a diver swimming in the open-ocean, collected from open ocean field trials of the robot. In total, the frames amounted to over 10 minutes video footage of both envi- ronments. The Xvid-compressed video frames have dimensions of 768 × 576 pixels, the detector operated at a rate of approximately 10 frames per second, and the time window for the Fourier tracker for this experiment is 15 frames, corresponding to approximately 1.5 seconds of footage. Each rectangular sub-window is 40 × 30 pixels in size (one-fourth in each dimension). The sub-windows do not overlap each other on the trajectory along a given direction. Ground truth for evaluation was obtained by-hand, as discussed in Sec. 9.2.3.1. For visually servoing off the responses from the frequency operators, we couple the motion tracker with a simple Proportional-Integral-Derivative (PID) controller, similar to the case with the Spatio-Temporal tracker. The PID controller accepts image space coordinates as input and provides as output motor commands for the robot such that the error between the desired position of the tracked diver and the current position is minimized. While essential for following any arbitrary target, the servoing technique is not an integral part of this motion detection algorithm, and thus runs independently of any specific visual tracking algorithm.

149 CHAPTER 9. EXPERIMENTS AND EVALUATIONS – IMPLICIT INTERACTIONS

9.2.2. Results. Figure 9.7 shows a diver swimming along a diagonal direction away from the camera, as depicted by the green arrow. No part of the diver falls on the direction shown by the red arrow, and as such there is no component of motion present in that direction. Figure 9.8(a) and 9.8(b) show the Fourier filter output for those two directions, respectively (the green bars correspond to the response along the green direction, and similarly for the red bars). The DC component from the FFT and the symmetric half of the FFT over the Nyquist frequency has been manually removed. The plots clearly show a much higher low-frequency response along the direction of the diver’s motion, and almost negligible response in the low frequencies (as a matter of fact in all frequencies) in the direction containing no motion component (as seen from the amplitude values). Note that the lane markers on the bottom of the pool (that appear periodically in the image sequence) do not

Figure 9.7. Snapshot image showing direction of a swimmer motion (in green) and an arbitrary direction without a diver (in red).

150 9.2 EXPERIMENTS WITH THE FOURIER TRACKER generate proper frequency responses to be categorized as biological motion in the direction along the red line.

(a) Frequency responses along the motion of the diver, depicted by the green arrow in Fig. 9.7.

(b) Frequency responses along the direction de- picted by the red arrow. Note the low amplitude values.

Figure 9.8. Contrasting frequency responses for directions with and with- out diver motion in a given image sequence.

In Fig. 9.9(a), we demonstrate the performance of the detector in tracking mul- tiple divers swimming in different directions. The sequence shows a diver swimming

151 CHAPTER 9. EXPERIMENTS AND EVALUATIONS – IMPLICIT INTERACTIONS

Direction Lowest-Frequency Amplitude response Left-to-right 205.03 Right-to-left 209.40 Top-to-bottom 242.26 Up-from-center 251.61 Bottom-to-top 281.22 Table 9.1. Low-frequency amplitude responses for multiple motion directions.

in a direction away from the robot, while another diver is swimming in front of her across the image frame in an orthogonal direction. The amplitude responses obtained from the Fourier operators along the directions of the motion for the fundamental frequency are listed in ascending order in Tab. 9.1. The first two rows correspond to the direction of motion of the diver going across the image, while the bottom three rows represent the diver swimming away from the robot. As expected, the diver closer and unobstructed to the camera produces the highest responses, but motion of the other diver also produces significant low-frequency responses. The other 12 directions exhibit negligible amplitude responses in the proper frequencies compared to the directions presented in the table. The FFT plots for motion in the bottom-to- top and left-to-right direction are seen in Figs. 9.9(b) and 9.9(c), respectively. Some additional Fourier detector responses are shown in Fig. 9.11. As before, the FFT plot has the DC component and the symmetric half removed for presentation clarity. An interesting side-effect of the Fourier tracker is the effect of the diver’s dis- tance from the robot (and hence the camera) on the low-frequency signal. The close proximity to the robot (i.e., camera) results in a lower variation of the intensity amplitude, and thus the resulting Fourier amplitude spectra exhibits lower energy in the low-frequency bands. Figure 9.10 shows two sequences of scuba divers swimming away from the robot, with the second diver closer to the camera. The amplitude responses have similar patterns, exhibiting high energy at the low-frequency regions.

152 9.2 EXPERIMENTS WITH THE FOURIER TRACKER

(a) An image sequence capturing two divers swimming in orthogonal directions.

(b) Frequency responses for the diver swim- (c) Frequency responses for the diver swim- ming away from the robot (red cross) in ming across the robot (yellow cross) in Fig. 9.9(a). Fig. 9.9(a).

Figure 9.9. Frequency responses for two different directions of diver mo- tion in a single image sequence.

The spectrum on top, however, has more energy in the low-frequency bands than the one on the bottom, where the diver is closer to the camera.

9.2.3. Performance Evaluation.

153 CHAPTER 9. EXPERIMENTS AND EVALUATIONS – IMPLICIT INTERACTIONS

(a) (b)

(c) (d)

Figure 9.10. Effect of diver’s distance from camera on the amplitude spec- tra. Being farther away from the camera produces higher energy responses (Fig. 9.10(b)) in the low-frequency bands, compared to divers swimming closer (Fig. 9.10(d)).

9.2.3.1. Datasets. To measure the effect of tracker accuracy and timing, we conducted a set of experiments by running the tracker on three different datasets (with available ground truth) of scuba divers swimming. These datasets were col- lected in a variety of environmental conditions - one dataset is of a diver swimming in a pool, the others are of scuba divers swimming in open-ocean environments. Each dataset contains approximately 3000 frames, which accounts for about a total of 40

154 9.2 EXPERIMENTS WITH THE FOURIER TRACKER

(a) (b)

(c) (d)

Figure 9.11. Additional instantaneous amplitude responses at various times during diver tracking. Figures 9.11(a) and 9.11(b) are Fourier sig- natures of the diver’s flippers, whereas Figures 9.11(c) and 9.11(d) are ex- amples of random (i.e., non-diver) locations.

minutes of footage. The datasets were collected from the robot’s on-board cameras, running at a rate of 15 frames per second, and has dimensions of 720 × 480 pixels. While the footage is in color, we normalize the images to gray scale format, and only use the luminosity channel (average of the R, G and B channels) for Fourier tracking, discarding all color information.

155 CHAPTER 9. EXPERIMENTS AND EVALUATIONS – IMPLICIT INTERACTIONS

9.2.3.2. Experiments. We created 15 configurations of the Fourier tracker by setting the RunLength(i.e., λ) and BoxSize(i.e., κ) arguments to different values, and ran all 15 configurations on the three datasets mentioned above. The λ parameter 1 was set to 2 , same, and twice that of the camera frame-rate, corresponding to λs of 8, 15 and 30 frames per second. The κ parameter was set to maintain an aspect ratio of 5 : 4, at 5 different scales, with dimensions of 10 × 8, 20 × 16, 40 × 32, 80 × 64, and 160 × 128 pixels. We computed running time and accuracy for all configurations. In the experiments, we used the UKF component to track targets after the Fourier detector outputs the divers initial location. The accuracy measure includes output of the UKF, with the Fourier detector serving as measurement for the filter. Thus, while we vary the two parameters of the detector, the UKF parameters remain unchanged throughout the experiments. To obtain a representative evaluation of the tracker’s performance, we executed the test runs on-board the Aqua amphibious robot, thus ensuring that the execution occurs on the exact computing hardware available on the robot. The tracker is implemented in C++, and is part of a large suite of vision-based Human-Robot Interaction framework called VisionSandbox [81]. The VisionSandbox framework is a collection of algorithms for visual tracking, diver following, learning-based object detection [79], gesture-based human-robot communications [24] and risk assessment in human-robot dialogs [83]. Performance data was collected and stored in Matlab .mat format data files, for off-line analysis with Matlab. 9.2.3.3. Results. Figure 9.12(a) shows the time taken by the Fourier tracker per tracking sequence (i.e., over a length of λ frames). We observe from the experiments that the speed of the tracker increases linearly as the λ parameter is increased. As it can be seen, the time taken at each value of λ virtually doubles with the doubling of the λ parameter for up to the κ value of 80×64 pixels, and is almost five times higher when κ is set to 160×128 pixels. This is easily explained as the output of the tracker

156 9.2 EXPERIMENTS WITH THE FOURIER TRACKER is not available until a vector of intensity values are available, and such vectors are only available after every λ frame. If the camera is operating at a frame-rate of Cf , then the time Tiv taken for each intensity vector to become available is RunLength Tiv = Cf Thus, for a given camera (i.e., a given frame-rate), the time taken by the Fourier tracker is directly proportional to the value of the λ parameter. The effect of variable κ parameters are not as linear, as shown in Fig. 9.12(b). For the first three κ values of 10 × 8, 20 × 16 and 40 × 32 pixels, execution time does not increase proportionally. On the other hand, the time required by the 80 × 64 dimension κ is almost double of that taken by κ of dimension 40×32. For the sake of brevity, the plots are labeled with the width of the κ parameter only (i.e., “BoxSize 80” denotes a κ value of 80 × 64 pixels). The timing data suggests a larger value for κ and a longer λ would be detrimental for real-time performance that we seek. Smaller values for both would result in the Fourier tracker running at approximately 10 frames/second, and the Aqua vehicle would receive control inputs from the Fourier tracker at the same rate. For the aquatic environment, and considering the dynamics of the robot, a command rate of 10 Hz is more than sufficient to control the robot to follow targets based on visual input; given the fact that a scuba diver swims at a sustained speed of less than 0.5 meters per second [4, 64], a tracking rate of 1 Hz is sufficient for the robot to keep track of a diver using the Fourier tracker. From the timing data, it is thus evident that for all κ and λ values we tested, the Fourier tracker is able to run sufficiently fast.

157 CHAPTER 9. EXPERIMENTS AND EVALUATIONS – IMPLICIT INTERACTIONS

(a) Timing with different RunLengths values, all BoxSizes.

(b) Timing with different BoxSize values, all RunLengths.

Figure 9.12. Effect of the λ and κ parameters on tracker timing, with time taken shown per detection. Consequently, accuracy of the tracker for these different configurations become the significant statistic. We executed the Fourier tracker on the same dataset, anno- tated with ground truth locations for the diver’s locations, and measured the number 158 9.2 EXPERIMENTS WITH THE FOURIER TRACKER

(a) Accuracy with different RunLengths values, all BoxSizes.

(b) Accuracy with different BoxSize values, all RunLengths.

Figure 9.13. Effect of the RunLength and BoxSize parameters on tracker accuracy. of frames the tracker was able to correctly locate the diver in. The results are sum- marized in Fig. 9.13. Figure 9.13(a) shows the accuracy rate of the Fourier tracker

159 CHAPTER 9. EXPERIMENTS AND EVALUATIONS – IMPLICIT INTERACTIONS with different run lengths. A λ value of 15 yields the highest accuracy, while the significantly worst performer is the λ value of 30 frames. This suggests against using a longer duration for Fourier signature detection. On the other hand, as shown in Fig. 9.13(b), the difference in accuracy across κ values is quite significant. A κ value of 80 × 64 pixels (shown as “BoxSize 80”) yields very high accuracy across the entire dataset, and across all λ values. The κ values before and after 80 × 64 pixels show decreasing accuracy, indicating no utility in arbitrarily increasing κ. Moreover, from the timing data, it is evident that a larger κ contributes to a slow-running tracker, making larger values of κ undesirable.

9.3. Conclusion

This chapter presented experimental results of the Spatio-Temporal and Fourier Tracker, the two tracking algorithms currently included in the visual HRI framework. Both algorithms are designed for robust visual tracking, but the application domain of the two are very unique, The Spatio-temporal tracker has a wider, more generic applicability, as long as enough color information is available in the visual scene. The Fourier tracker, conversely, does not use color at all, and only uses motion signatures to detect and track scuba divers. In principle, it is trivial to tune the Fourier tracker to detect a different frequency of oscillation, and thus track a different class of biolog- ical (or even non-biological) entities. From a broader perspective, the two trackers fill different niches in the visual tracking domain, but both are essential components of the visual HRI framework described in this thesis. The experimental results demon- strate the usability of both, and also provide us with ideas for improvements and extensions. With the Fourier tracker, a very feasible future direction would be to use it in terrestrial (i.e., non-underwater) environments, to track and even follow humans, or infer patters from human activities by merely tracking people’s motion trajectories. An interesting extension to the Spatio-Temporal tracker would be the

160 9.3 CONCLUSION possibility of using the Fourier tracker (or a version thereof) as a “weak learner” in the Boosting process. With the current accuracy of the Fourier tracker, the weak learner criterion is easily satisfied, but computational cost is currently a large hin- drance in implementing it in real-time. In general, using a variety of weak trackers that track a number of different cues (e.g., for color, frequency, shape etc.) might provide a higher degree of robustness, but we have not explored that direction during the course of this research. This concludes the experimental evaluations for the algorithms and the visual HRI framework. We conclude the thesis in the next chapter by summarizing the research and providing some thoughts for future avenues that can be explored towards enhanced visual human-robot interactions.

161

CHAPTER 10

Conclusions

10.1. Summary

This thesis presents a multi-layer framework for human-robot interaction, us- ing vision as the primary communication medium. The framework classifies a suite of algorithms (mostly based on machine vision) based on the frequency of invoca- tion during an interaction scenario; that is, how often the algorithm executes as a robot attempts to establish robust and efficient communication with a human opera- tor. Similar to the classic three-layer architectures in robotics systems (e.g., Brooks’ Subsumption [11] architecture), this framework also includes three distinct layers. Unlike Subsumption, however, where “behaviors” from one layer completely “sub- sumes” another behavior from a low-priority layer, our framework allows for com- munication between algorithms between different layers (see Chapter 2), and does not strictly enforce task preemption. The core of this research lies not only in the CHAPTER 10. CONCLUSIONS formulation of the architecture, but also in the formulation and implementation of various algorithms to demonstrate the applicability of this framework towards suc- cessful human-robot interaction. Algorithms for vision-based robot programming, dialog management, person following and learned target tracking are presented in this thesis, from chapters 3 to 6, and these algorithms constitute the principal algo- rithmic contributions of this work. Experimental results, both of the on-board and off-board variety, demonstrate success to a a varying degree, which show promise in the current thread of research, and also points to future directions and improvements with the current techniques. To keep the thesis coherent and the focus on the algo- rithmic contributions, the discussions on the systems development as been kept at a minimum, although the amount of work required to implement the framework (and the accompanying algorithms) on-board the Aqua vehicle was substantial, including approximately 50K lines of code, and installations of a varied set of hardware sensors and the supporting software drivers. The state of robotics in current times is very similar to the state of personal computers in the beginning of the 1980s, and research in human-robot interaction carries the ultimate goal of making robots work in human-centric environments. It is our belief that any algorithms for human-robot interaction should be validated on “real” robots – robots that have the demonstrated ability to work outside the research labs, and in arbitrary environments. The Aqua robot provides us with a robust electro-mechanical platform, and also provides a varied set of sensory capabilities. Multiple field trials of the robot have proven its usability in a variety of environmental conditions, from the tropics to the arctics, and has proven its worth as a useful companion to human operators. Particularly in the shallow underwater environment, Aqua has demonstrated its use through a variety of tasks – ship hull inspections, pipeline and cable monitoring, and surveillance of marine ecosystems being a few.

164 10.2 FUTURE DIRECTIONS

The task of inspecting the health of marine life, both animal and vegetation, has drawn particular attention as marine biologists have found the use of a robotic vehicle significantly advantageous. Surveys at coral reefs typically require a marine biologist, in full scuba gear and with audio or video recording equipment, to stay submerged for a number of hours. This creates difficulty in data collection and poses significant risk to the diver, as she has to not only operate the required recording apparatus, but also has to maintain life-support equipment. Staying under water for more than 90 minutes requires extra oxygen tanks, and prolonged stays are also very likely to cause dropping of body temperature, causing discomfort at the least. Extended underwater dives may very well result in more extensive physiological damage. Aqua is capable of approximately six hours of autonomous operations on a single battery charge, and can acquire visual and other sensory data on-board or transmit the off-board for as long. We have performed numerous field tests of divers working with Aqua on different missions, using RoboChat to program the robot to carry out a variety of tasks. Given the capabilities provided by the visual HRI framework, one can imagine a diver “teaching” Aqua visually by performing a surveillance task, and sending the robot for subsequent missions for longer durations, thus removing the human from the potentially unsafe conditions. This scenario captures one of the motivations of our research – removing a human from harms way and using a robot to carry out operations that would otherwise pose significant risk to human beings. From experimental validations of the visual-HRI framework and the component algorithms we have presented, we believe that this research brings us closer to that goal.

10.2. Future Directions

The contributions made through this research is a step towards achieving efficient human-robot interaction; however, there is clear room for improvement and further research to enhance our framework. The problem we tackled is that of a robot and a

165 CHAPTER 10. CONCLUSIONS human working together, with the robot acting as an assistant, aiding and assisting the human in performing tasks – in some cases, tasks that could be hazardous for the human. To establish direct dialog between them, we proposed the RoboChat scheme with a model for confirmations under uncertainty. These two methods com- plement each other in creating a robust, safe interaction modality, although, one could see the language itself and the confirmation model being improved. RoboChat includes a number of features that make it attractive as a visual language, but at its current state does not include easy expansion to include new language constructs. This feature would be useful as the robot was equipped with new capabilities and sensors, thereby extending the range of possible operations. It is also not trivial to manipulate certain robot parameters, which are often specific to the platform which RoboChat is operating. This is a more difficult problem to address, as the control of robot-specific parameters would require a reduction of the inherent generality that RoboChat provides. A “middleware” layer which interprets RoboChat commands and in turn converts them into platform-specific commands would be an elegant so- lution, to the effect that it would preserve RoboChat’s generic nature. This approach would however require each robot to have it’s own middleware layer, which might need significant development. Using a standard robotic middleware software such as the Robot Operating System (ROS) [66] would alleviate the problem to some degree. The exact form and content of the confirmation message also currently is an open problem. We have minimized the length and complexity of dialog feedback to reduce cognitive load. However, the context of the exact confirmation message is not clearly comprehensible in this form. That is, the true reasons for the confirmations are not conveyed to the user. Often, the system would assess a task as high-cost based on one part of the input, particularly for long input phrases. The current feedback system either just prompts for confirmation, or repeats the entire input as feedback before prompting for confirmation. In many situations, this could be

166 10.2 FUTURE DIRECTIONS redundant and create excessive cognitive load for the user. A streamlined form that only asks confirmation for the part of the input that carries high uncertainty is more desirable in such circumstances. This, however, is not trivial, as the processing of the sentence requires a detailed analysis of the semantics of the particular language being used, which might prove to be computationally infeasible. That said, it is a challenging and rewarding direction of research worth investigating. Incorporating methods to express a probability distribution for an arbitrary language is also a challenging undertaking, although there have been recent developments [54] in this direction. The Fourier tracker has shown great promise in tracking scuba divers, and our intent is to apply this tracker in terrestrial domains. The key challenge here is the extraction of oscillations as signatures of biological motion. Visual sensing has been used in underwater domains on-board the Aqua robot towards this goal, but one can envision a variety of sensors being used for similar purposes. Multimodal sensors such as RGB-D cameras, as popularized by the Microsoft Kinect, has seen adoption by the robotics community in recent times, and outcomes of our own preliminary evaluations were positive as well. It would be trivial to equip a terrestrial (i.e., non-underwater) robot with such a sensor and perform people tracking by using the Fourier tracker. The specific case with the RGB-D sensors opens up the possibility of multimodal interactions, and in general that is one rich future direction of research. Irrespective of the particular platform, using multiple sensory sources for human-robot interaction has the potential for information-rich communication. Similar to the sensor fusion in various robotics tasks (e.g., SLAM, or state estimation), multiple information sources are likely to provide an HRI system with a higher degree of robustness. A single modality may be insufficient for a human operator to engage in a degree of interaction that might be deemed high-quality. In our work, we have used vision

167 CHAPTER 10. CONCLUSIONS as the primary modality for communication, but have relied on a number of other sensors (such as IMUs, depth gauges, and other proprioceptive sensors) to execute commanded tasks. In other words, while vision has been used as the explicit tool for interaction, implicit interactions relied on a number of other sensors including vision. In the future, we intend to extend the use of multiple modalities for explicit interaction as well, particularly towards human behavior extraction techniques to establish user intent. Another particularly appealing domain of future HRI research involves multi- robot scenarios. That is, a group of robots, possibly of varied characteristics, in- teracting with their human operator or operators to perform coordinated tasks. Examples of such scenarios can be imagined in production assembly lines, large- scale surveillance and inspection, search-and-rescue problems, among others. An interesting feature of these class of problems is that the robots not only have to interact with humans, but also quite possibly with each other. This “robot-robot interaction” problem has been addressed in-depth in the literature, in the form of multi-robot exploration [72], cooperative localization [71], data muling [26] and so forth. Nevertheless, a reinvestigation of such problems may be of interest in the broader context of human-robot interaction. Currently, we are looking at multi- robot exploration scenarios for marine environments, with cooperation between a human operator and different classes of autonomous marine vehicles, particularly between autonomous underwater vehicles (AUV) and autonomous surface vehicles (ASV) (as seen in Fig. 10.1).

10.3. Final Thoughts

We have presented a framework for visual human-robot interaction and a number of methods towards robust communication between a robot and its human operator or user. Applications of such a system can be envisioned in domains where the robot

168 10.3 FINAL THOUGHTS

Figure 10.1. The Aqua robot swims under an Autonomous Surface Vehi- cle. The robot operator can be seen swimming on the surface. works as an assistant to a human, assisting in task execution, and in some cases, replacing the human from potentially hazardous environmental conditions. We have relied on a variety of core techniques to create this framework and provide abilities to a mobile robot that requires minimal user intervention for deployment. In a number of field trials, the usability of our system has been shown, where a single scuba diver has been able to operate the robot with the help of a set of fiducials and perform a variety of tasks. In general, as robots become more mainstream, and the robotics industry is poised on the verge of an explosive growth, better and efficient algorithms are required for human-robot coexistence in predominantly human environments. Potential of future developments in human-robot interaction is thus immense, arising from both practical and theoretical needs.

169

REFERENCES

[1] Coral reef conservation act of 2000. P.L. 106-562; 16 U.S.C. 6401 et seq, December 2003. [2] A Roadmap for US Robotics: From Internet to Robotics, May 2009. [3] R. Altendorfer, N. Moore, H. Komsuoglu, M. Buehler, H.B. Brown Jr., D. McMordie, U. Saranli, R. Full, and D. E. Koditschek. RHex: A bio- logically inspired hexapod runner. Autonomous Robots, 11:207–213, 2001. [4] B.G. Andersen. Measurement of scuba diver performance in open ocean en- vironment. American Society of Mechanical Engineers, Papers, pages 8–16, 1969. [5] Shai Avidan. Ensemble tracking. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 29(2):261–271, February 2007. [6] Richard Bainbridge. The speed of swimming of fish as related to size and to the frequency and amplitude of the tail beat. Journal of Experimental Biology, 35(1):109 –133, March 1958. REFERENCES

[7] L.E. Baum, T. Petrie, G. Soules, and N. Weiss. A maximization technique oc- curring in the statistical analysis of probabilistic functions of Markov chains. The Annals of Mathematical Statistics, pages 164–171, 1970. [8] R. Begg and J. Kamruzzaman. A machine learning approach for automated recognition of movement patterns using basic, kinetic and kinematic gait data. Journal of Biomechanics, 38(3):401–408, 2005. [9] A Bhattcharyya. On a measure of divergence between two statistical pop- ulations defined by their probability distributions. Bulletin Calcutta Math Society, (35):99–110, 1943. [10] M. A. Blumenberg. Human factors in diving. Technical Report ADA322423, University of California at Berkeley, 1996. [11] Rodney A. Brooks. Intelligence without representation. Artificial Intelli- gence, 47:139–159, 1991. [12] Giorgio C. Buttazzo. Hard Real-Time Computing Systems: Predictable Scheduling Algorithms and Applications. Springer, 2nd edition, October 2004. [13] Paul Chandler and John Sweller. Cognitive load theory and the format of instruction. Cognition and Instruction, 8(4):293332, 1991. [14] S. Chernova and M. Veloso. An evolutionary approach to gait learning for four-legged robots. Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 3:2562–2567, 2004. [15] Youngkwan Cho and Ulrich Neumann. Multi-ring color fiducial systems for scalable fiducial tracking augmented reality. Presence: Teleoperators and Virtual Environments, 10(6):599–612, December 2001. [16] David Claus and Andrew W. Fitzgibbon. Reliable fiducial detection in nat- ural scenes. In Proceedings of the 8th European Conference on Computer Vision (ECCV’04), pages 469–480, May 2004.

172 REFERENCES

[17] Dorin Comaniciu and Peter Meer. Mean Shift: A robust approach toward feature space analysis. IEEE Transactions On Pattern Analysis And Machine Intelligence, 24(5):603–619, May 2002. [18] Dorin Comaniciu, Visvanathan Ramesh, and Peter Meer. Kernel-based ob- ject tracking. IEEE Transaction on Pattern Analysis and Machine Intelli- gence, 25(5):564–575, 2003. [19] K.G. Derpanis, R.P. Wildes, and J.K. Tsotsos. Hand gesture recognition within a linguistics-based framework. In Proceedings of the European Con- ference on Computer Vision (ECCV), pages 282–296, 2004. [20] F. Doshi, J. Pineau, and N. Roy. Reinforcement learning with limited rein- forcement: Using Bayes risk for active learning in POMDPs. In Proceedings of the 25th international conference on Machine learning, pages 256–263. ACM New York, NY, USA, 2008. [21] Finale Doshi and Nicholas Roy. Spoken language interaction with model un- certainty: an adaptive human-robot interaction system. Connection Science, 20(4):299–318, 2008. [22] Richard O. Duda, Peter E. Hart, and David G. Stork. Pattern Classification. Wiley-Interscience, 2 edition, October 2000. [23] Gregory Dudek, Michael Jenkin, Chris Prahacs, Andrew Hogue, Junaed Sat- tar, Philippe Gigu`ere,Andrew German, Hui Liu, Shane Saunderson, Arlene Ripsman, Saul Simhon, Luiz Abril Torres-Mendez, Evangelos Milios, Pifu Zhang, and Ioannis Rekleitis. A visually guided swimming robot. In Pro- ceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 3604–3609, Edmonton, Alberta, Canada, Au- gust 2005. [24] Gregory Dudek, Junaed Sattar, and Anqi Xu. A visual language for robot control and programming: A human-interface study. In Proceedings of the

173 REFERENCES

International Conference on Robotics and Automation ICRA, pages 2507– 2513, Rome, Italy, April 2007. [25] Matthew Dunbabin, Iuliu Vasilescu, Peter Corke, and Daniela Rus. Data muling over underwater wireless sensor networks using an autonomous un- derwater vehicle. In International Conference on Robotics and Automation, ICRA 2006, Orlando, Florida, May 2006. [26] Matthew Dunbabin, Iuliu Vasilescu, Peter Corke, and Daniela Rus. Exper- iments with cooperative control of two autonomous underwater vehicles. In Proceedings of the International Conference on Robotics and Automation, ICRA 2006, Rio de Janeiro, Brazil, July 2006. [27] R. Erenshteyn, P. Laskov, R. Foulds, L. Messing, and G. Stern. Recognition approach to gesture language understanding. In 13th International Confer- ence on Pattern Recognition, volume 3, pages 431–435, August 1996. [28] Mark Fiala. ARTag Revision 1, a fiducial marker system using digital tech- niques. Technical Report NRC 47419, National Research Council, Canada, November 2004. [29] Mark Fiala. ARTag, a fiducial marker system using digital techniques. In Proceedings of the 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05) - Volume 2, pages 590–596, Washington, DC, USA, 2005. IEEE Computer Society. [30] Graham. D. Finlayson. Computational color constancy. In International Conference on Pattern Recognition, volume 1, pages 191–196, Barcelona, Spain, September 2000. [31] David J. Fleet and Allan D. Jepson. Computation of component velocity from local phase information. International Journal of Computer Vision, 5(1):77–104, August 1990.

174 REFERENCES

[32] Daniel Freedman and Michael S. Brandstein. Contour tracking in clutter: a subset approach. International Journal of Computer Vision, 38(2):173–186, 2000. [33] W. T. Freeman and E. H. Adelson. The design and use of steerable fil- ters. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 13(9):891–906, 1991. [34] Yoav Freund. Boosting a weak learning algorithm by majority. In Proceedings of the third annual workshop on Computational learning theory, COLT ’90, pages 202–216, San Francisco, CA, USA, 1990. Morgan Kaufmann Publishers Inc. [35] Yoav Freund and Robert E. Schapire. A decision-theoretic generalization of on-line learning and an application to boosting. Journal of Computer and System Sciences, 55(1):119–139, August 1997. [36] Philippe Gigu`ereand Gregory Dudek. Clustering sensor data for terrain iden- tification using a windowless algorithm. In Robotics: Science and Systems IV, pages 25–32. The MIT Press, 2008. [37] Yogesh Girdhar and Gregory Dudek. ONSUM: A system for generating on- line navigation summaries. In Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems, IROS, pages 746–751, Octo- ber 2010. [38] E. Bruce Goldstein. Sensation and Perception. Wadsworth Publishing Com- pany, 6 edition, August 2001. [39] S. A. Hutchinson, G. D. Hager, and P. I. Corke. A tutorial on visual servo control. IEEE Transactions on Robotics and Automation, 12(5):651–670, Oc- tober 1996.

175 REFERENCES

[40] Michael Isard and Andrew Blake. Condensation – conditional density prop- agation for visual tracking. International Journal of Computer Vision, 29(1):5–28, 1998. [41] Simon Julier and Jeffrey K. Uhlmann. A new extension of the Kalman filter to nonlinear systems. Signal processing, sensor fusion, and target recognition VI, pages 182–193, 1997. [42] Rudolph Emil Kalman. A new approach to linear filtering and prediction problems. Transactions of the ASME–Journal of Basic Engineering, 82(Se- ries D):35–45, 1960. [43] H. Kato and M. Billinghurst. Marker tracking and hmd calibration for a video-based augmented reality conferencing system. In Proceedings of the 2nd International Workshop on Augmented Reality (IWAR 99), San Francisco, October 1999. [44] Steve L. Kent. The ultimate history of video games: from Pong to Pok´emon and beyond : the story behind the craze that touched our lives and changed the world. Prima, 2001. [45] G.J. Klinker, S.A. Shafer, and Takeo Kanade. A physical approach to color image understanding. International Journal of Computer Vision, 4(1):7–38, January 1990. [46] David Kortenkamp, Eric Huber, and R. Peter Bonasso. Recognizing and interpreting gestures on a mobile robot. In Proceedings of the thirteenth national conference on Artificial intelligence - Volume 2, AAAI’96, pages 915–921. AAAI Press, 1996. [47] Vladimir Kravtchenko. Tracking color objects in real time. Master’s thesis, University of British Columbia, Vancouver, British Columbia, November 1999.

176 REFERENCES

[48] K. Krebsbach, D. Olawsky, and M. Gini. An empirical study of sensing and defaulting in planning. In Artificial intelligence planning systems: proceed- ings of the first international conference, June 15-17, 1992, College Park, Maryland, page 136. Morgan Kaufmann Pub, 1992. [49] D. Kulic and E. Croft. Safe planning for human-robot interaction. In 2004 IEEE International Conference on Robotics and Automation, 2004. Proceed- ings. ICRA’04, volume 2, 2004. [50] Michael F. Land. Optics of the eyes of marine animals. In P. J. Herring, A. K. Campbell, M. Whitfield, and L. Maddock, editors, Light and life in the sea, pages 149–166. Cambridge University Press, Cambridge, UK, 1990. [51] Wiley J. Larson and James R. Wertz. Space Mission Analysis and Design, 3rd edition. Microcosm, 3rd edition, October 1999. [52] S. Lavallee, L. Brunie, B. Mazier, and P. Cinquin. Matching of medical im- ages for computed and robot assisted surgery. In Engineering in Medicine and Biology Society, 1991. Vol.13: 1991., Proceedings of the Annual Inter- national Conference of the IEEE, pages 39–40, 1991. [53] J. R. Leigh. Control Theory. Institution of Electrical Engineers, July 2004. [54] Percy Liang, Michael I. Jordan, and Dan Klein. Learning programs: A hier- archical Bayesian approach. In Johannes F¨urnkranzand Thorsten Joachims, editors, Proceedings of the 27th International Conference on Machine Learn- ing (ICML-10), pages 639–646, Haifa, Israel, June 2010. Omnipress. [55] Niels Lynnerup, Marie Andersen, and Helle Petri Lauritsen. Facial image

identification using photomodeler R . Legal Medicine, 5(3):156–160, 2003. [56] Teruhisa Misu and Tatsuya Kawahara. Bayes risk-based dialogue manage- ment for document retrieval system with speech interface. Speech Commu- nication, 52(1):61–71, 2010.

177 REFERENCES

[57] Michael Montemerlo, Joelle Pineau, Nicholas Roy, Sebastien Thrun, and V. Verma. Experiences with a mobile robotic guide for the elderly. In Pro- ceedings of the 18th National Conference on Artificial Intelligence AAAI, pages 587–592, 2002. [58] M. S. Nixon, T. N. Tan, and R. Chellappa. Human Identification Based on Gait. The Kluwer International Series on Biometrics. Springer-Verlag New York, Inc. Secaucus, NJ, USA, 2005. [59] S. A. Niyogi and E. H. Adelson. Analyzing and recognizing walking figures in XYT. In In Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pages 469–474, 1994. [60] Alan V. Oppenheim, Alan S. Willsky, and S. Hamid Nawab. Signals & sys- tems (2nd ed.). Prentice-Hall, Inc., Upper Saddle River, NJ, USA, 1996. [61] Charles B. Owen, Fan Xiao, and Paul Middlin. What is the best fiducial? In Proceedings of the First IEEE International Augmented Reality Toolkit Workshop, pages 98–105, Darmstadt, Germany, September 2000. [62] Claudia Pateras, Gregory Dudek, and Renato De Mori. Understanding re- ferring expressions in a person-machine spoken dialogue. In Proceedings of the International Conference on Acoustics, Speech, and Signal Processing, 1995. (ICASSP ’95), volume 1, pages 197–200, May 1995. [63] Vladimir Pavlovic, Rajeev Sharma, and Thomas S. Huang. Visual interpre- tation of hand gestures for human-computer interaction: A review. IEEE Transactions on Pattern Analysis and Machine Intelligence, 19(7):677–695, 1997. [64] D.R. Pendergast, M. Tedesco, D.M. Nawrocki, and N.M. Fisher. Energetics of underwater swimming with SCUBA. Medicine & Science in Sports & Exercise, 28(5):573, 1996.

178 REFERENCES

[65] I. Poupyrev, H. Kato, and M. Billinghurst. ARToolkit User Manual Version 2.33. Human Interface Technology Lab, University of Washington, Seattle, Washington, 2000. [66] Morgan Quigley, Ken Conley, Brian P. Gerkey, Josh Faust, Tully Foote, Jeremy Leibs, Rob Wheeler, and Andrew Y. Ng. ROS: an open-source Robot Operating System. In ICRA Workshop on Open Source Software, 2009. [67] Lawrence Rabiner and Biing-Hwang Juang. Fundamentals of Speech Recog- nition. Prentice Hall, 1 edition, April 1993. [68] L.R. Rabiner et al. A tutorial on hidden Markov models and selected appli- cations in speech recognition. Proceedings of the IEEE, 77(2):257–286, 1989. [69] R.F. Rashid. Toward a system for the interpretation of moving light display. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2(6):574– 581, November 1980. [70] Jun Rekimoto and Yuji Ayatsuka. CyberCode: Designing augmented reality environments with visual tags. In Proceedings of DARE 2000 on Designing augmented reality environments, pages 1–10, Elsinore, Denmark, 2000. [71] Ioannis M. Rekleitis, Gregory Dudek, and Evangelos Milios. Multi-robot collaboration for robust exploration. Annals of Mathematics and Artificial Intelligence, 31(1-4):7–40, 2001. [72] Ioannis M. Rekleitis, Ai Peng New, and Howie Choset. Distributed cover- age of unknown/unstructured environments by mobile sensor networks. In Alan C. Schultz, Lynne E. Parker, and Frank Schneider, editors, 3rd In- ternational NRL Workshop on Multi-Robot Systems, pages pages 145–155, Washington, D.C., March 14-16 2005. Kluwer. [73] Paul E. Rybski and Richard M. Voyles. Interactive task training of a mobile robot through human gesture recognition. In IEEE International Conference on Robotics and Automation, volume 1, pages 664–669, 1999.

179 REFERENCES

[74] Uluc Saranli and Eric Klavins. Object oriented state machines. Embedded Systems Programming, May 2002. [75] Junaed Sattar, Eric Bourque, Philippe Gigu`ere,and Gregory Dudek. Fourier tags: Smoothly degradable fiducial markers for use in human-robot interac- tion. In Proceedings of the Fourth Canadian Conference on Computer and Robot Vision, pages 165–174, Montr´eal,QC, Canada, May 2007. [76] Junaed Sattar and Gregory Dudek. On the performance of color tracking algorithms for underwater robots under varying lighting and visibility. In Proceedings of the IEEE International Conference on Robotics and Automa- tion ICRA, pages 3550–3555, Orlando, Florida, May 2006. [77] Junaed Sattar and Gregory Dudek. Where is your dive buddy: track- ing humans underwater using spatio-temporal features. In Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 3654–3659, San Diego, California, USA, October 2007. [78] Junaed Sattar and Gregory Dudek. A boosting approach to visual servo- control of an underwater robot. In Experimental Robotics – The Eleventh International Symposium. Springer Tracts in Advanced Robotics, ISER, vol- ume 54, pages 417–428, Athens, Greece, July 2008. [79] Junaed Sattar and Gregory Dudek. Robust servo-control for underwater robots using banks of visual filters. In Proceedings of the IEEE International Conference on Robotics and Automation, ICRA, pages 3583–3588, Kobe, Japan, May 2009. [80] Junaed Sattar and Gregory Dudek. Underwater human-robot interaction via biological motion identification. In Proceedings of the International Confer- ence on Robotics: Science and Systems V, RSS, pages 185–192, Seattle, Washington, USA, June 2009. MIT Press.

180 REFERENCES

[81] Junaed Sattar and Gregory Dudek. A vision-based control and interaction framework for a legged underwater robot. In Proceedings of the Sixth Cana- dian Conference on Robot Vision (CRV), pages 329–336, Kelowna, British Columbia, May 2009. [82] Junaed Sattar and Gregory Dudek. Reducing uncertainty in human-robot interaction – a cost analysis approach. In Proceedings of the Twelfth Inter- national Symposium on Experimental Robotics (ISER), New Delhi and Agra, India, December 2010. In press. [83] Junaed Sattar and Gregory Dudek. Towards quantitative modeling of task confirmations in human–robot dialog. In Proceedings of the IEEE Interna- tional Conference on Robotics and Automation, ICRA, pages 1957–1963, Shanghai, China, May 2011. [84] Junaed Sattar, Gregory Dudek, Olivia Chiu, Ioannis Rekleitis, Alec Mills, Philippe Gigu`ere, Nicolas Plamondon, Chris Prahacs, Yogesh Girdhar, Meyer Nahon, and John-Paul Lobos. Enabling autonomous capabilities in underwater robotics. In Proceedings of the IEEE/RSJ International Con- ference on Intelligent Robots and Systems (IROS), pages 3628–3634, Nice, France, September 2008. [85] Junaed Sattar, Philippe Gigu`ere, and Gregory Dudek. Sensor-based behav- ior control for an autonomous underwater vehicle. International Journal of Robotics Research, 28(6):701–713, June 2009. [86] Junaed Sattar, Philippe Gigu`ere,Gregory Dudek, and Chris Prahacs. A visual servoing system for an aquatic swimming robot. In Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 1483–1488, Edmonton, Alberta, Canada, August 2005. [87] Junaed Sattar, Anqi Xu, Gabrielle Charette, and Gregory Dudek. Graphi- cal state-space programmability as a natural interface for robotic control. In

181 REFERENCES

Proceedings of the IEEE International Conference on Robotics and Automa- tion, ICRA, pages 4609–4614, Anchorage, Alaska, USA, May 2010. [88] Robert E. Schapire. A brief introduction to boosting. In Proceedings of the Sixteenth International Joint Conference on Artificial Intelligence, IJCAI ’99, pages 1401–1406, San Francisco, CA, USA, 1999. Morgan Kaufmann Publishers Inc. [89] Florian Shkurti, Ioannis Rekleitis, Milena Scaccia, and Gregory Dudek. State estimation of an underwater robot using visual and inertial information. In Proceedings of the 2011 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS ’11) (to appear), San Francisco, USA, September 2011. [90] Hedvig Sidenbladh and Michael J. Black. Learning the statistics of people in images and video. International Journal of Computer Vision, 54(1-3):181– 207, 2003. [91] Hedvig Sidenbladh, Michael J. Black, and David J. Fleet. Stochastic tracking of 3D human figures using 2D image motion. In Proceedings of the European Conference on Computer Vision, volume 2, pages 702–718, 2000. [92] Wladyslaw Skarbek and Andreas Koschan. Colour image segmentation – a survey. Technical report, Technical University of Berlin, Fachbereich 13 Informatik, Franklinstrasse 28/29, 10587 Berlin, Germany, October 1994. [93] M. Skubic, D. Perzanowski, S. Blisard, A. Schultz, W. Adams, M. Bugajska, and D. Brock. Spatial language for human-robot dialogs. IEEE Transactions on Systems, Man and Cybernetics, Part C, 34(2):154–167, May 2004. [94] Michael J. Swain and Dana H. Ballard. Color indexing. International Journal of Computer Vision, 7(1):11–32, November 1991.

182 REFERENCES

[95] Axel Techmer. Contour-based motion estimation and object tracking for real-time applications. In International Conference on Image Processing, vol- ume 3, pages 648–651, Thessaloniki, Greece, October 2001. [96] S. Thrun. Bayesian landmark learning for mobile robot localization. Machine Learning, 33(1):41–76, 1998. [97] G. T. Toussaint and B. K. Bhattacharya. Optimal algorithms for computing the minimum distance between two finite planar sets. Pattern Recognition Letters, 2:79–82, 1983. [98] J. K. Tsotsos, G. Vergheseand S. Dickinson, M. Jenkin, A. Jepson, E. Milios, F. Nuflo, S. Stevenson, M. Black adn D. Metaxas, S. Culhane, Y. Ye, , and R. Mannn. PLAYBOT: A visually-guided robot for physically disabled children. Image Vision Computing, 16(4):275–292, April 1998. [99] Paul Viola and Michael Jones. Robust real-time object detection. Interna- tional Journal of Computer Vision, 57(2):137–154, 2004. [100] S. Waldherr, S. Thrun, and R. Romero. A gesture-based interface for human- robot interaction. Autonomous Robots, 9(2):151–173, 2000. [101] T Wannier, C Bastiaanse, G Colombo, and V Dietz. Arm to leg coordination in humans during walking, creeping and swimming activities. Experimental Brain Research, 141(3):375–379, December 2001. PMID: 11715082. [102] Greg Welch, Gary Bishop, Leandra Vicci, Stephen Brumback, Kurtis Keller, and D’nardo Colucci. The HiBall tracker: High-performance wide-area track- ing for virtual and augmented environments. In Symposium on Virtual Re- ality Software and Technology, pages 1–10, December 1999. [103] Louis Whitcomb. Underwater robotics: Out of the research laboratory and into the field. In IEEE International Confeference on Robotics and Automa- tion, pages 85–90. IEEE, 2000.

183 REFERENCES

[104] Rob Williams. Real-Time Systems Development. Butterworth-Heinemann, December 2005. [105] Anqi Xu, Gregory Dudek, and Junaed Sattar. A natural gesture interface for operating robotic systems. In Proceedings of the IEEE International Con- ference on Robotics and Automation, ICRA, pages 3557–3563, Pasadena, California, May 2008. [106] Richard Y. D. Xu, John G. Allen, and Jesse S. Jin. Robust mean-shift track- ing with extended fast colour thresholding. In International Symposium on Intelligent Multimedia, Video and Speech Processing (ISIMP2004), pages 542–545, Hong Kong, October 2004.

184 Document Log:

Manuscript Version 1.0 Typeset by AMS-LATEX — 28 November 2011

Junaed Sattar

McGill University, 3480 University St., Montreal´ (Quebec)´ H3A 2A7, Canada, Tel. : (514) 398-7071 E-mail address: [email protected]

Typeset by AMS-LATEX