Head Gesture Recognition in Intelligent Interfaces: the Role of Context in Improving Recognition
Total Page:16
File Type:pdf, Size:1020Kb
Head Gesture Recognition in Intelligent Interfaces: The Role of Context in Improving Recognition Louis-Philippe Morency Trevor Darrell MIT CSAIL MIT CSAIL Cambridge, MA 02139, USA Cambridge, MA 02139, USA [email protected] [email protected] ABSTRACT make other interface actions more efficient or easy to per- Acknowledging an interruption with a nod of the head is a form. natural and intuitive communication gesture which can be Modern computer interfaces often interrupt a user’s pri- performed without significantly disturbing a primary inter- mary activity with a notification about an event or condi- face activity. In this paper we describe vision-based head tion, which may or may not be relevant to the main activity. gesture recognition techniques and their use for common Currently, a user must shift keyboard or mouse focus to at- user interface commands. We explore two prototype per- tend to the notification, and use keyboard and mouse events ceptual interface components which use detected head ges- to page through the displayed information and dismiss the tures for dialog box confirmation and document browsing, notification before returning to the main activity. Requir- respectively. Tracking is performed using stereo-based align- ing keyboard or mouse events to respond to notifications can ment, and recognition proceeds using a trained discrimina- clearly cause disruption to users during certain activities or tive classifier. An additional context learning component tasks. is described, which exploits interface context to obtain ro- Recent advances in computer vision have led to efficient bust performance. User studies with prototype recognition and robust head pose tracking systems, which can return components indicate quantitative and qualitative benefits of the position and orientation of a user’s head through auto- gesture-based confirmation over conventional alternatives. matic passive observation. Efficient methods for recognition of head gestures using discriminatively trained statistical classifiers have been proposed. In this work we use a ro- Categories and Subject Descriptors bust real-time head tracking and gesture recognition system I.2.10 [Artificial Intelligence]: Vision and Scene Under- which was originally developed for interacting with conversa- standing—Motion; H.5.2 [Information systems]: User In- tional robots or animated characters [15]. We show here how terfaces detected head gestures can be used for interface commands under a traditional windows-based graphical user interface. General Terms We explore two types of head gesture-based controls: dia- log box acknowledgment/agreement, and document brows- Design, Experimentation, Algorithms, Human Factors, Per- ing. These components were chosen as they support the formance aspects of the notification scenario described above. The first allows a user to effectively accept or reject a dialog box Keywords or other notification window by nodding or shaking their head. The second component allows a user to page through Head gesture, Context-based recognition, nodding, nod recog- a document using head nods. nition, multi-modal input, user study, IUI design Head gestures have been used in conversational speech in- terfaces [15] and context was shown to be a critical factor to 1. INTRODUCTION obtain reliable recognition. Recognition based only on vi- When people interact naturally with each other, it is com- sion can be inaccurate or erratic; by learning a predictor of mon to see indications of acknowledgment, agreement, or gesture likelihood from the current dialog state, robust per- disinterest given with a simple head gesture. We conjecture formance was obtained. We develop an analogous context that a human-computer interface that could perceive such predictor here, but exploit cues from the window manager gestures would enable less disruptive notification, as well as rather than a dialog model. Figure 1 presents our frame- work for context-based gesture recognition. We show that learning simple cues about interface state, including features from mouse activity and windows manager state, can simi- Permission to make digital or hard copies of all or part of this work for larly reduce erroneous detections by the gesture recognition personal or classroom use is granted without fee provided that copies are system. not made or distributed for profit or commercial advantage and that copies We conduct user studies with gesture-based recognition bear this notice and the full citation on the first page. To copy otherwise, to components. Results indicate that users feel acknowledg- republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. ment and agreement are preferred modes of use, and quan- IUI’06, January 29–February 1, 2006, Sydney, Australia. titative evaluation of task performance are significantly im- Copyright 2006 ACM 1-59593-287-9/06/0001 ...$5.00. Context-based Gesture Recongizer Window Recognition of head gestures has been demonstrated by Manager tracking eye position over time. Kapoor and Picard pre- sented a technique to recognize head nods and head shakes Visual Contextual Input Preprocessing Predictor based on two Hidden Markov Models (HMMs) trained and Events tested using 2D coordinate results from an eye gaze tracker [9]. Output Kawato and Ohya suggested a technique for head gesture Vision-based Multi-modal Events recognition using between eye templates [10]. When com- gesture recognizer integrator pared with eye gaze, head gaze can be more accurate when dealing with low resolution images and can be estimated over a larger range than eye gaze [13]. Fugie et al. also used Figure 1: Framework for context-based gesture HMMs to perform head nod recognition [7]. In their paper, recognition. The contextual predictor translates they combined head gesture detection with prosodic recog- contextual features into a likelihood measure, simi- nition of Japanese spoken utterances to determine strongly lar to the visual recognizer output. The multi-modal positive, weak positive and negative responses to yes/no integrator fuses these visual and contextual likeli- type utterances. hood measures. In this paper, we use a head gesture recognition approach based on head velocities and classified using a support vec- tor machine (SVM). Our stereo-based head pose tracker is robust to strong illumination changes, automatically initial- izes without user intervention, and can re-initialize auto- proved under the gesture-based component. Users do not matically if tracking is lost (which is rare) [14]. The SVM significantly prefer the gesture-based scrolling component, approach for head gesture recognition was shown to out- however, and task performance is equivalent. We conclude perform previous techniques based on HMMs [15]. To our with a discussion of the applicability of head-gesture com- knowledge no previous work has investigated recognition- ponents for acknowledgment/agreement, and possible modi- based head-gesture controls in the context of a conventional fications to the scrolling component based on user feedback. windows-based interface. 2. BACKGROUND 3. GESTURE-BASED INTERACTIONS Several authors have proposed face tracking for pointer or Head nods and head shakes are natural gestures com- scrolling control and have reported successful user studies monly used during face-to-face interaction. Inspired by re- [20, 11]. In contrast to eye gaze [22], users seem to be able search with human-ECA interactions [19, 16], we propose to maintain fine motor control of head gaze at or below using head gesture-based controls for two conventional win- the level needed to make fine pointing gestures1. However, dows interfaces: dialog boxes and document browsing. many systems required users to manually initialize or reset Dialog boxes are special windows that are used by com- tracking. These systems supported a direct manipulation puter programs or by the operating system to display in- style of interaction, and did not recognize distinct gestures. formation to the user, or to get a response if needed [21]. There has been substantial research in hand/body gesture We will focus our attention to two types of dialog boxes: in for human-computer interaction. Lenman et al. explored notification dialog boxes and question dialog boxes. the use of pie- and marking menus in hand gesture-based Notification dialog boxes are one-button windows that interaction[12]. Cohen et al. studied the issues involved in show information from an application and wait for the user controlling computer applications via hand gestures com- to acknowledge the information and click a confirmation posed of both static and dynamic symbols[5]. button. During human-to-human interactions, the process Several authors have considered head gesture in interac- of ensuring common understanding is called grounding [4]. tive dialog with a conversational agent. Cassell [2, 3] and Grounding is also present during interactions with embod- Sidner [18, 19, 17], have developed rich models of multi- ied conversational agents, and human participants naturally modal output in the context of embodied natural language head nod as a non-verbal feedback for grounding [16]. From conversation, including multimodal representations of agree- these observations, we can expect human participants to ment and grounding gestures, but used mostly simple visual- naturally accept head nodding as a way to answer notifica- based inputs like face detection. Nakano et al. analyzed eye tion dialog boxes. gaze and head nods