A Gesture Based Interface for Human-Rob ot Interaction
Stefan Waldherr
Computer Science Department
Carnegie Mel lon University
Pittsburgh, PA, USA
Roseli Romero
Instituto de Ci^encias Matem aticas e de Computac~ao
UniversidadedeS~ao Paulo
S~ao Carlos, SP, Brazil
Sebastian Thrun
Computer Science Department
Carnegie Mel lon University
Pittsburgh, PA, USA
Abstract.
Service rob otics is currently a pivotal research area in rob otics, with enormous
so cietal p otential. Since service rob ots directly interact with p eople, nding \natu-
ral" and easy-to-use user interfaces is of fundamental imp ortance. While past work
has predominately fo cussed on issues suchasnavigation and manipulatio n, relatively
few rob otic systems are equipp ed with exible user interfaces that p ermit controlling
the rob ot by \natural" means.
This pap er describ es a gesture interface for the control of a mobile rob ot equipp ed
with a manipulator. The interface uses a camera to track a p erson and recognize
gestures involvin g arm motion. A fast, adaptive tracking algorithm enables the rob ot
to track and follow a p erson reliably through oce environments with changing
lighting conditions. Two alternative metho ds for gesture recognition are compared:
a template based approach and a neural network approach. Both are combined with
the Viterbi algorithm for the recognition of gestures de ned through arm motion in
addition to static arm p oses. Results are rep orted in the context of an interactive
clean-up task, where a p erson guides the rob ot to sp eci c lo cations that need to b e
cleaned and instructs the rob ot to pick up trash.
1. Intro duction
The eld of rob otics is currently undergoing a change. While in the
past, rob ots were predominately used in factories for purp oses suchas
manufacturing and transp ortation, a new generation of \service rob ots"
has recently b egun to emerge Schraft and Schmierer, 1998. Service
rob ots co op erate with p eople, and assist them in their everyday tasks.
Sp eci c examples of commercial service rob ots include the Helpmate
rob ot, which has already b een deployed at numerous hospitals world-
wide King and Weiman, 1990, an autonomous cleaning rob ot that has
c
2000 Kluwer Academic Publishers. Printed in the Netherlands.
2
successfuly b een deployed in a sup ermarket during op ening hours En-
dres et al., 1998, and the Rob o-Caddy www.icady.com/homefr.htm,
a rob ot designed to make life easier by carrying around golf clubs. Few
of these rob ots can interact with p eople other than byavoiding them.
In the near future, similar rob ots are exp ected to app ear in various
branches of entertainment, recreation, health-care, nursing, etc., and it
is exp ected that they interact closely with p eople.
This up coming generation of service rob ots op ens up new research
opp ortunities. While the issue of robot navigation has b een researched
quite extensively Cox and Wilfong, 1990; Kortenkamp et al., 1998;
Borenstein et al., 1996, considerably little attention has b een paid to
issues of human-robot interaction see Section 5 for a discussion on
related literature. However, many service rob ots will b e op erated by
non-exp ert users, who might not even b e capable of op erating a com-
puter keyb oard. It is therefore essential that these rob ots b e equipp ed
with \natural" interfaces that facilitate the interaction b etween rob ots
and p eople.
Nevertheless, the need for more e ectivehuman-rob ot interfaces has
well b een recognized by the research community.For example, Torrance
develop ed in his M.S. thesis a natural language interface for teaching
mobile rob ots names of places in an indo or environmentTorrance,
1994. Due to the lack of a sp eech recognition system, his interface
still required the user to op erate a keyb oard; nevertheless, the natural
language comp onent made instructing the rob ot signi cantly easier.
More recently Asoh and colleagues develop ed an interface that inte-
grates a sp eech recognition system into a phrase-based natural language
interface Asoh et al., 1997. The authors successfully instructed their
\oce-conversant" rob ot to navigate to oce do ors and other signif-
icant places in their environment through verbal commands. Among
p eople, communication often involves more than sp oken language. For
example, it is far easier to p oint to an ob ject than to verbally describ e
its exact lo cation. Gestures are an easy way to give geometrical informa-
tion to the rob ot. Hence, other researchers have prop osed vision-based
interfaces that allow p eople to instruct mobile rob ots via arm gestures
see Section 5. For example, b oth Kortenkamp and colleagues Ko-
rtenkamp et al., 1996 and Kahn Kahn, 1996 have develop ed mobile
rob ot systems instructed through arm p oses. Most previous mobile
rob ot approaches only recognize static arm p oses as gestures, and they
cannot recognize gestures that are de ned through sp eci c temp oral
patterns, suchaswaving. Motion gestures, which are commonly used
for communication among p eople, provide additional freedom in the
design of gestures. In addition, they reduce the chances of accidentally
classifying arm p oses as gestures that were not intended as such. Thus,
3
they app ear b etter suited for human rob ot interaction than static p ose
gestures alone.
Mobile rob ot applications of gesture recognition imp ose several re-
quirements on the system. First of all, the gesture recognition system
needs to b e small enough to \ t" on the rob ot, as pro cessing p ower is
generally limited. Since b oth the human and the rob ot maybemoving
while a gesture is shown, the system may not assume a static back-
ground or a xed lo cation of the camera or the human who p erforms
a gesture. In fact, some tasks mayinvolve following a p erson around,
in which case the rob ot must b e able to recognize gestures while it
tracks a p erson and adapt to p ossibly drastic changes in lighting con-
ditions. Additionally, the system must work at an acceptable sp eed.
Naturally, one would want that a `Stop' gesture would immediately
halt the rob ot|and not ve seconds later. These requirements must
b e taken into consideration when designing the system.
This pap er presents a vision-based interface that has b een designed
to instruct a mobile rob ot through b oth p ose and motion gestures
Waldherr et al., 1998. At the lowest level, an adaptive dual-color
tracking algorithm enables the rob ot to track and, follow a p erson
around at sp eeds of up to 30 cm p er second while avoiding collisions
with obstacles. This tracking algorithm quickly adapts to di erent
lighting conditions, while segmenting the image to nd the p erson's
p osition relative to the center of the image, and using a pan/tilt unit
to keep the p erson centered. Gestures are recognized in two phases:
In the rst, individual camera images are mapp ed into a vector that
sp eci es likeliho o d for individual p oses. We compare two di erent ap-
proaches, one based on neural networks, and one that uses a graphical
correlation-based template matcher. In the second phase, the Viterbi
algorithm is employed to dynamically match the stream of image data
with pre-de ned temp oral gesture templates.
The work rep orted here go es b eyond the design of the gesture in-
terface. One of the goals of this researchistoinvestigate the usability
of gesture interface in the context of a realistic service rob ot appli-
cation. The interface was therefore integrated into our existing rob ot
navigation and control software. The task of the rob ot is motivated
by the \clean-up-the-oce" task at the 1994 mobile rob ot comp etition
Simmons, 1995. There, a rob ot had to autonomously search an oce
for ob jects scattered at the o or, and to dep osit them in nearby trash
bins. Our task di ers in that wewanta human to guide the rob ot to
the trash, instructed with gestures. The rob ot should than pick up the
trash and carry it and dump it into a trash bin.
The remainder of the pap er is organized as follows. Section 2 de-
scrib es our approach to visual servoing, followed by the main gesture
4
recognition algorithm describ ed in Section 3. Exp erimental results are
discussed in Section 4. Finally, the pap er is concluded by a discussion
of related work Section 5 and a general discussion Section 6.
2. Finding and Tracking People
Finding and tracking p eople is the core of any vision-based gesture
recognition system. After all, the rob ot must know where in the image
the p erson is. Visual tracking of p eople has b een studied extensively
over the past few years Darrel et al., 1996, Crowley, 1997, and Wren
et al., 1997. However, the vast ma jority of existing approaches as-
sumes that the camera is mounted at a xed lo cation. Such approaches
typically rely on a static background, so that human motion can b e
detected, e.g., through image di erencing. In the case of rob ot-based
gesture recognition, one cannot assume that the background is static.
While the rob ot tracks and follows p eople, background and lighting
conditions often change considerably. In addition, pro cessing p ower on
a rob ot is usually limited, which imp oses an additional burden. As we
will see in this Section, an adaptive color based approach tracking b oth
face and shirt color app ears to b e capable of tracking a p erson under
changing lighting conditions.
Our approach tracks the p erson based on a combination of two col-
ors, facecolor and shirt color.Face color as a feature for tracking p eople
has b een used b efore, leading to remarkably robust face tracking as long
as the lighting conditions do not chance to o muchYang and Waib el,
1995. The advantage of color over more complicated mo dels e.g., head
motion, facial features Rowley et al., 1998 lies in the sp eed at which
camera images can b e pro cessed|a crucial asp ect when implemented
onalow-end rob otic platform. In our exp eriments, however, we found
color alone to b e insucient for tracking p eople using a moving rob ot,
hence we extend our algorithm to tracking a second color typically
aligned vertically with a p erson's face: shirt color. The remainder of
this section describ es the basic algorithm, which run at approximately
10Hz on a low-end PC.
2.1. Color Filters
Our approach adopts a basic color mo del commonly used in the face de-
tection and tracking literature. To detect face color, it is advantageous
to assume that camera images are represented in the RGB color space.
More sp eci cally, each pixel in the camera image is represented bya
triplet P =R; G; B . RGB implicitly enco des color, brightness, and 5
g
r
Figure1.Face-color distribution in chromatic color space. The distribution was
obtained from the p erson's face shown in Figure 2a.
saturation. If two pixels in a camera image, P and P , are prop ortional,
1 2
that is
G B R
P P P
1 1 1
= = ; 1
R G B
P P P
2 2 2
they share the same color but if R 6= R di er in brightness.
P P
1 2
Since brightness mayvary drastically with the lighting condition, it is
common practice to represent face color in the chromatic color space.
Chromatic colors Wyszecki and Styles, 1982, also known as pure
3 2
colors, are de ned by a pro jection < 7! < , where
R
2 r =
R + G + B
and
G
g = : 3
R + G + B
Following the approachbyYang et al. in Yang and Waib el, 1995,
our approach represents the chromatic face color mo del by a Gaussian
mo del M =; ; , characterized by the means and
r g r g
N
X
1
= r ; 4
r i
N
i=0
N
X
1
= g 5
g i
N
i=0
6
and the covariance matrix
2
rg
r
= ; 6
2
gr
g
where
N
X
2 2
= r ; 7
i r
r
i=0
N
X
2 2
= g 8
i g
g
i=0
and
N
X
= = r g : 9
rg gr i r i g
i=0
Typically, the color distribution of human face color is centered in a
small area of the chromatic color space, i.e., only a few of all p ossible
colors actually o ccur in a human face. Figure 1 shows such a face color
distribution for the p erson shown in Figure 2a. The two horizontal
axes represents the r and g values, resp ectively, and the vertical axis is
the resp onse to a color mo del. The higher the resp onse, the b etter the
p erson's face matches the color mo del.
Under the Gaussian mo del of face color given ab ove, the Maha-
lanobis distance of the i'th's pixels chromatic r and g value is given
by:
T 1
r ;g r ;g
i r i g i r i g
1