Visual Event Classification Via Force Dynamics

From: AAAI-00 Proceedings. Copyright © 2000, AAAI (www.aaai.org). All rights reserved. Visual Event Classification via Force Dynamics Jeffrey Mark Siskind NEC Research Institute, Inc. 4 Independence Way Princeton NJ 08540 USA 609/951–2705 [email protected] http://www.neci.nj.nec.com/homepages/qobi Abstract These systems follow the tradition of linguists and cognitive scientists, such as Leech (1969), Miller (1972), This paper presents an implemented system, called Schank (1973), Jackendoff (1983), or Pinker (1989), that LEONARD, that classifies simple spatial motion events, such represent the lexical semantics of verbs via the causal, as- as pick up and put down, from video input. Unlike previ- pectual, and directional qualities of motion. Some linguists ous systems that classify events based on their motion profile, LEONARD uses changes in the state of force-dynamic and cognitive scientists, such as Herskovits (1986) and Jack- relations, such as support, contact, and attachment, to distin- endoff & Landau (1991), have argued that force-dynamic guish between event types. This paper presents an overview relations (Talmy 1988), such as support, contact, and attach- of the entire system, along with the details of the algorithm ment, are crucial for representing the lexical semantics of that recovers force-dynamic interpretations using prioritized spatial prepositions. For example, in some situations, part of circumscription and a stability test based on a reduction to what it means for one object to be on another object is for the linear programming. This paper also presents an example il- former to be in contact with, and supported by, the latter. In lustrating the end-to-end performance of LEONARD classify- other situations, something can be on something else by way ing an event from video input. of attachment, as in the knob on the door. Siskind (1992) has argued that changes in the state of force-dynamic rela- Introduction tions plays a more central role in specifying the lexical semantics of simple spatial motion verbs than motion profile. People can describe what they see. If someone were to pick The particular linear and angular velocities and accelerations up a block and ask you what you saw, you could say The don’t matter when picking something up or putting some- person picked up the block. In doing so, you describe both thing down. What matters is a state change. When picking objects, like people and blocks, and events, like pickings up. something up, the patient is initially supported by being on Most recognition research in machine vision has focussed top of the source. Subsequently, the patient is supported by on recognising objects. In contrast, this paper describes a being attached to the agent. Likewise, when putting some- system for recognising events. Objects correspond roughly thing down, the reverse is true. The patient starts out being to the noun vocabulary in language. In contrast, events cor- supported by being attached to the agent. It is subsequently respond roughly to the verb vocabulary in language. The supported by being on top of the goal. Furthermore, what overall goal of this research is to ground the lexical seman- distinguishes putting something down from dropping it is tics of verbs in visual perception. that, in the former, the patient is always supported, while in A number of reported systems can classify event occur- the latter, the patient undergoes unsupported motion. rences from video or simulated video, among them, Yamoto, Siskind (1995), among others, describes a system for re- Ohya, & Ishii (1992), Regier (1992), Pinhanez & Bo- covering force-dynamic relations from simulated video and bick (1995), Starner (1995), Siskind & Morris (1996), Bai- using those relations to perform event classification. Mann, ley et al. (1998), and Bobick & Ivanov (1998). While they Jepson, & Siskind (1997), among others, describes a system differ in their details, by and large, these system classify for recovering force-dynamic relations from video but does event occurrences by their motion profile. For example, a not use those relations to perform event classification. This pick up event is described as a sequence of two subevents: paper describes a system, called LEONARD, that recovers the agent moving towards the patient while the patient is at force-dynamic relations from video and uses those relations rest above the source, followed by the agent moving with the to perform event classification. It is the first reported system patient away from the source. Such systems use some com- that goes all the way from video to event classification using bination of relative and absolute; linear and angular; posi- recovered force dynamics. LEONARD is a complex, compre- tions, velocities, and accelerations as the features that drive hensive system. Video input is processed using a real-time classification. colour- and motion-based segmentation procedure to place Copyright c 2000, American Association for Artificial Intelli- a convex polygon around each participant object in each in- gence (www.aaai.org). All rights reserved. put frame. A tracking procedure then computes the corre- spondence between the polygons in each frame and those in adjacent frames. LEONARD then constructs force-dynamic interpretations of the resulting polygon movie. These inter- (a) (b) (c) (d) pretations are constructed out of predicates that describe the attachment relations between objects, the qualitative depth of objects, and their groundedness. Some interpretations are consistent in that they describe stable scenes. Others are in- (e) (f) (g) (h) consistent in that they describe unstable scenes. LEONARD performs model reconstruction, selecting as models, only 0 0 those interpretations that explain the stability of the scene. 0 1 Kinematic stability analysis is performed efficiently via a reduction to linear programming. There are usually multiple (i) (j) models, i.e. stable interpretations of each scene. LEONARD selects a preferred subset of models using prioritized, car- Figure 1: A graphical representation of scene interpreta- dinality, and temporal circumscription. Event classification tions. Consider the vertical lines to be li and the horizontal is efficiently performed on this preferred subset of models lines to be lj. (a) depicts g(lj). (b) depicts li 1 lj li 2 6$ ^ $ using an interval-based event logic. A precise description of lj li θ lj. (c) depicts li 1 lj li 2 lj li θ lj. ^ 6$ $ ^ 6$ ^ 6$ the entire system is beyond the scope of this paper. The re- (d) depicts li 1 lj li 2 lj li θ lj. (e) depicts li 1 $ ^ $ ^ 6$ 6$ mainder of this paper focuses on kinematic stability analysis lj li 2 lj li θ lj. (f) depicts li 1 lj li 2 ^ 6$ ^ $ 6$ ^ $ and model reconstruction. It also presents an example of the lj li θ lj. (g) depicts li 1 lj li 2 lj li θ lj. ^ $ $ ^ 6$ ^ $ entire system in operation. Future papers will describe other (h) depicts li 1 lj li 2 lj li θ lj. The ./ rela- components of this system in greater detail. tion is depicted$ by assigning^ $ layer^ indices$ to line segments. (i) depicts li ./ lj. (j) depicts li ./ lj. Kinematic Stability Analysis 6 Let us consider a simplified world that consists of line segments. Polygons can be treated as collections of rigidly at- not model objects touching one another along the focal axis tached line segments. Let us denote line segments by the of the observer. symbol l. In this simplified world, some line segments will A scene is a set L of line segments. An interpretation of not need to be supported. Such line segments are said to be scene is a quintuple g; 1; 2; θ; ./ . It is convenient to grounded. Let us denote the fact that l is grounded by the depict scene interpretationsh $ graphically.$ $ i Figure 1 shows the property g(l). In this simplified world, the table top and the graphical representation of the predicates g, 1, 2, θ, agent’s hand will be grounded. and ./. $ $ $ In this simplified world, line segments can be joined to- An interpretation is admissible if the following conditions gether. If li and lj are joined, the constraint on their rela- hold: tive motion is specified by three relations 1, 2, and θ. $ $ $ For all li and lj, if li 1 lj, li 2 lj, or li θ lj, then li If li 1 lj, then the position of the joint along li is fixed. • $ $ $ $ intersects lj. In other words, attached line segments must Likewise, if li 2 lj, then the position of the joint along lj $ intersect. is fixed. And if li θ lj, then the relative orientation $ For all li and lj, li 1 lj iff lj 2 li. of li and lj is fixed. Combinations of these three rela- • $ $ tions allow specifying a variety of joint types. If li 1 θ is symmetric. $ • $ lj li 2 lj li θ lj, then li and lj are rigidly joined. ^ $ ^ $ For all li and lj, if li ./ lj, then li and lj do not overlap. If li 1 lj li 2 lj li θ lj, then li and lj are joined • $ ^ $ ^ 6$ In other words, line segments on the same layer must not by a revolute joint. If li 1 lj li 2 lj li θ lj, then li 6$ ^ $ ^ $ overlap. Two line segments overlap if they intersect in a and lj are joined by a prismatic joint that allows lj to slide noncollinear fashion and the point of intersection is not an along li. If li 1 lj li 2 lj li θ lj, then li and lj $ ^ 6$ ^ $ endpoint of either line segment. are joined by a prismatic joint that allows li to slide along lj. ./ is symmetric and transitive. If li 1 lj li 2 lj li θ lj, then li and lj are not joined.

Visual Event Classification Via Force Dynamics

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support