From: AAAI-00 Proceedings. Copyright © 2000, AAAI (www.aaai.org). All rights reserved.

Visual Event Classification via Force Dynamics Jeffrey Mark Siskind NEC Research Institute, Inc. 4 Independence Way Princeton NJ 08540 USA 609/951–2705 [email protected] http://www.neci.nj.nec.com/homepages/qobi

Abstract These systems follow the tradition of linguists and cog- nitive scientists, such as Leech (1969), Miller (1972), This paper presents an implemented system, called Schank (1973), Jackendoff (1983), or Pinker (1989), that LEONARD, that classifies simple spatial motion events, such represent the lexical of verbs via the causal, as- as pick up and put down, from video input. Unlike previ- pectual, and directional qualities of motion. Some linguists ous systems that classify events based on their motion pro- file, LEONARD uses changes in the state of force-dynamic and cognitive scientists, such as Herskovits (1986) and Jack- relations, such as support, contact, and attachment, to distin- endoff & Landau (1991), have argued that force-dynamic guish between event types. This paper presents an overview relations (Talmy 1988), such as support, contact, and attach- of the entire system, along with the details of the algorithm ment, are crucial for representing the of that recovers force-dynamic interpretations using prioritized spatial prepositions. For example, in some situations, part of circumscription and a stability test based on a reduction to what it means for one object to be on another object is for the linear programming. This paper also presents an example il- former to be in contact with, and supported by, the latter. In lustrating the end-to-end performance of LEONARD classify- other situations, something can be on something else by way ing an event from video input. of attachment, as in the knob on the door. Siskind (1992) has argued that changes in the state of force-dynamic rela- Introduction tions plays a more central role in specifying the lexical se- mantics of simple spatial motion verbs than motion profile. People can describe what they see. If someone were to pick The particular linear and angular velocities and accelerations up a block and ask you what you saw, you could say The don’t matter when picking something up or putting some- person picked up the block. In doing so, you describe both thing down. What matters is a state change. When picking objects, like people and blocks, and events, like pickings up. something up, the patient is initially supported by being on Most recognition research in machine vision has focussed top of the source. Subsequently, the patient is supported by on recognising objects. In contrast, this paper describes a being attached to the agent. Likewise, when putting some- system for recognising events. Objects correspond roughly thing down, the reverse is true. The patient starts out being to the noun vocabulary in . In contrast, events cor- supported by being attached to the agent. It is subsequently respond roughly to the verb vocabulary in language. The supported by being on top of the goal. Furthermore, what overall goal of this research is to ground the lexical seman- distinguishes putting something down from dropping it is tics of verbs in visual perception. that, in the former, the patient is always supported, while in A number of reported systems can classify event occur- the latter, the patient undergoes unsupported motion. rences from video or simulated video, among them, Yamoto, Siskind (1995), among others, describes a system for re- Ohya, & Ishii (1992), Regier (1992), Pinhanez & Bo- covering force-dynamic relations from simulated video and bick (1995), Starner (1995), Siskind & Morris (1996), Bai- using those relations to perform event classification. Mann, ley et al. (1998), and Bobick & Ivanov (1998). While they Jepson, & Siskind (1997), among others, describes a system differ in their details, by and large, these system classify for recovering force-dynamic relations from video but does event occurrences by their motion profile. For example, a not use those relations to perform event classification. This pick up event is described as a sequence of two subevents: paper describes a system, called LEONARD, that recovers the agent moving towards the patient while the patient is at force-dynamic relations from video and uses those relations rest above the source, followed by the agent moving with the to perform event classification. It is the first reported system patient away from the source. Such systems use some com- that goes all the way from video to event classification using bination of relative and absolute; linear and angular; posi- recovered force dynamics. LEONARD is a complex, compre- tions, velocities, and accelerations as the features that drive hensive system. Video input is processed using a real-time classification. colour- and motion-based segmentation procedure to place Copyright c 2000, American Association for Artificial Intelli- a convex polygon around each participant object in each in- gence (www.aaai.org). All rights reserved. put frame. A tracking procedure then computes the corre- spondence between the polygons in each frame and those in adjacent frames. LEONARD then constructs force-dynamic interpretations of the resulting polygon movie. These inter- (a) (b) (c) (d) pretations are constructed out of predicates that describe the attachment relations between objects, the qualitative depth of objects, and their groundedness. Some interpretations are consistent in that they describe stable scenes. Others are in- (e) (f) (g) (h) consistent in that they describe unstable scenes. LEONARD performs model reconstruction, selecting as models, only 0 0 those interpretations that explain the stability of the scene. 0 1 Kinematic stability analysis is performed efficiently via a re- duction to linear programming. There are usually multiple (i) (j) models, i.e. stable interpretations of each scene. LEONARD selects a preferred subset of models using prioritized, car- Figure 1: A graphical representation of scene interpreta- dinality, and temporal circumscription. Event classification tions. Consider the vertical lines to be li and the horizontal is efficiently performed on this preferred subset of models lines to be lj. (a) depicts g(lj). (b) depicts li 1 lj li 2 6↔ ∧ ↔ using an interval-based event logic. A precise description of lj li θ lj. (c) depicts li 1 lj li 2 lj li θ lj. ∧ 6↔ ↔ ∧ 6↔ ∧ 6↔ the entire system is beyond the scope of this paper. The re- (d) depicts li 1 lj li 2 lj li θ lj. (e) depicts li 1 ↔ ∧ ↔ ∧ 6↔ 6↔ mainder of this paper focuses on kinematic stability analysis lj li 2 lj li θ lj. (f) depicts li 1 lj li 2 ∧ 6↔ ∧ ↔ 6↔ ∧ ↔ and model reconstruction. It also presents an example of the lj li θ lj. (g) depicts li 1 lj li 2 lj li θ lj. ∧ ↔ ↔ ∧ 6↔ ∧ ↔ entire system in operation. Future papers will describe other (h) depicts li 1 lj li 2 lj li θ lj. The ./ rela- components of this system in greater detail. tion is depicted↔ by assigning∧ ↔ layer∧ indices↔ to line segments. (i) depicts li ./ lj. (j) depicts li ./ lj. Kinematic Stability Analysis 6 Let us consider a simplified world that consists of line seg- ments. Polygons can be treated as collections of rigidly at- not model objects touching one another along the focal axis tached line segments. Let us denote line segments by the of the observer. symbol l. In this simplified world, some line segments will A scene is a set L of line segments. An interpretation of not need to be supported. Such line segments are said to be scene is a quintuple g, 1, 2, θ, ./ . It is convenient to grounded. Let us denote the fact that l is grounded by the depict scene interpretationsh ↔ graphically.↔ ↔ i Figure 1 shows the property g(l). In this simplified world, the table top and the graphical representation of the predicates g, 1, 2, θ, agent’s hand will be grounded. and ./. ↔ ↔ ↔ In this simplified world, line segments can be joined to- An interpretation is admissible if the following conditions gether. If li and lj are joined, the constraint on their rela- hold: tive motion is specified by three relations 1, 2, and θ. ↔ ↔ ↔ For all li and lj, if li 1 lj, li 2 lj, or li θ lj, then li If li 1 lj, then the position of the joint along li is fixed. • ↔ ↔ ↔ ↔ intersects lj. In other words, attached line segments must Likewise, if li 2 lj, then the position of the joint along lj ↔ intersect. is fixed. And if li θ lj, then the relative orientation ↔ For all li and lj, li 1 lj iff lj 2 li. of li and lj is fixed. Combinations of these three rela- • ↔ ↔ tions allow specifying a variety of joint types. If li 1 θ is symmetric. ↔ • ↔ lj li 2 lj li θ lj, then li and lj are rigidly joined. ∧ ↔ ∧ ↔ For all li and lj, if li ./ lj, then li and lj do not overlap. If li 1 lj li 2 lj li θ lj, then li and lj are joined • ↔ ∧ ↔ ∧ 6↔ In other words, line segments on the same layer must not by a revolute joint. If li 1 lj li 2 lj li θ lj, then li 6↔ ∧ ↔ ∧ ↔ overlap. Two line segments overlap if they intersect in a and lj are joined by a prismatic joint that allows lj to slide noncollinear fashion and the point of intersection is not an along li. If li 1 lj li 2 lj li θ lj, then li and lj ↔ ∧ 6↔ ∧ ↔ endpoint of either line segment. are joined by a prismatic joint that allows li to slide along lj. ./ is symmetric and transitive. If li 1 lj li 2 lj li θ lj, then li and lj are not joined. • A total6↔ of eight∧ 6↔ different∧ kinds6↔ of joints are possible, includ- (Note that whether two line segments intersect or overlap is ing ones that are simultaneously revolute and prismatic. purely a geometric property of the scene L and is indepen- In this simplified world, line segments reside on paral- dent of any interpretation of that scene.) Only admissible lel planes that are perpendicular to the focal axis of the ob- interpretations will be considered. server. This simplified world uses an impoverished notion Statically, the line segments in a scene have fixed posi- of depth. All that is important is whether two given line seg- tions and orientations. Let us denote the coordinates of a ments reside on the same plane. Such line segments are said point p as x(p) and y(p). And let us denote the endpoints of to be on the same layer. Let us denote the fact that li and lj a line segment l as p(l) and q(l). And let us denote the length are on the same layer by the relation li ./ lj. This impov- of a line segment l as l . And let us denote the position of erished notion of depth lacks any notion of depth order. It a line segment l as the|| position|| of its midpoint c(l). And let cannot model objects being in front of or behind other ob- us denote the orientation θ(l) of a line segment l as the an- jects. It also lacks any notion of adjacency in depth. It can- gle of the vector from p(l) to q(l). The quantities p(l), q(l), l , c(l), and θ(l) are all fixed given a static scene. And p(l) be the vector containing the elements x(p(li)), y(p(li)), || || and q(l) are related to l , c(l), and θ(l) as follows: x(q(li)), y(q(li)), x(p(lj)), y(p(lj)), x(q(lj)), and y(q(lj)). || || And let β be the vector containing the elements x(c(l )), p(l) + q(l) i c(l) = y(c(li)), θ(li), x(c(lj)), y(c(lj)), and θ(lj). And let γ be 2 ∂I the vector where γk = . And let D be the matrix ∂αk l = (q(l) p(l)) (q(l) p(l)) ∂αk where Dkl = . I˙(li, lj) can be computed by the chain || || p − · − ∂βl 1 y(q(l)) y(p(l)) rule as follows: θ(l) = tan− − T ˙ x(q(l)) x(p(l)) I˙(li, lj) = γ Dβ − 1 ˙ ˙ ˙ x(p(l)) = x(c(l)) l cos θ(l) Note that I(li, lj) is linear in c˙(li), θ(li), c˙(lj), and θ(lj) − 2|| || because all of the partial derivatives are constant. 1 Next, let us denote by ρ(p, l), where p is a point on l, the y(p(l)) = y(c(l)) l sin θ(l) − 2|| || fraction of the distance where p lies between p(l) and q(l). 1 y(p) y(p(l)) x(q(l)) = x(c(l)) + l cos θ(l) − x(p(l)) = x(q(l)) 2|| || ( y(q(l)) y(p(l)) ρ(p, l) = x(p) −x(p(l)) 1 x(q(l))− x(p(l)) otherwise y(q(l)) = y(c(l)) + l sin θ(l) − 2|| || And if 0 ρ 1, let us denote by l(ρ) the point that is the Let us postulate an unknown instantaneous motion for fraction ρ≤of the≤ distance between p(l) and q(l). each line segment in the scene. This can be represented by l(ρ) = p(l) + ρ(q(l) p(l)) associating a linear and angular velocity with each line seg- − ment. Let us denote such velocities with the variables c˙(l) ˙ ˙ And let us denote by l(ρ) the velocity of the point that is the and θ(l). Let us assume that there is no motion in depth so fraction ρ of the distance between p(l) and q(l) as l moves. the ./ relation does not change. If the scene contains n line Let α be the vector containing the elements x(p(l)), y(p(l)), segments, then there will be 3n scalar variables, because c˙ x(q(l)), and y(q(l)). And let β be the vector containing the has x and y components. Assuming that the line segments elements x(c(l)), y(c(l)), and θ(l). And let γ be the vector are rigid, i.e. that instantaneous motion does not lead to a ∂l(ρ) where γk = . And let D be the matrix where Dkl = change in their length, one can relate p˙(l) and q˙(l), the in- ∂αk ˙ ∂αk . Again, by the chain rule: stantaneous velocities of the endpoints, to c˙(l) and θ(l), us- ∂βl ing the chain rule as follows: l˙(ρ) = γT Dβ˙ ∂p(l) ∂p(l) ∂p(l) p˙(l) = x(c ˙(l)) + y(c ˙(l)) + θ˙(l) ˙ ˙ ∂x(c(l)) ∂y(c(l)) ∂θ(l) Again, note that l(ρ) is linear in c˙(l) and θ(l) because all of the partial derivatives are constant. ∂q(l) ∂q(l) ∂q(l) ˙ The 1 constraint can now be formulated as follows: q˙(l) = x(c ˙(l)) + y(c ˙(l)) + θ(l) ↔ ∂x(c(l)) ∂y(c(l)) ∂θ(l) if li 1 lj, then ↔ ˙ ˙ Note that p˙(l) and q˙(l) are linear in c˙(l) and θ(l). li(ρ(I(li, lj), li)) = I˙(li, lj) Each of the components of an admissible interpretation And the 2 constraint can now be formulated as follows: of a scene can be viewed as imposing constraints on the ↔ if li 2 lj, then instantaneous motions of the line segments in that scene. ↔ The simplest case is the g property. If g(l), then c˙(l) = 0 ˙ lj(ρ(I(li, lj), lj)) = I˙(li, lj) and θ˙(l) = 0. Note that these equations are linear in c˙(l) ˙ and θ˙(l). Again, note that these equations are linear in c˙(li), θ(li), ˙ Let us now consider the li 1 lj and li 2 lj rela- c˙(lj), and θ(lj). ↔ ↔ tions and the constraint that they imposes on the motions Let us now consider the li θ lj relation and the con- ↔ of li and lj. First, let us denote the intersection of li and lj straint that it imposes on the motions of li and lj. If li θ lj, ↔ as I(li, lj). If we let then ˙ ˙ θ(li) = θ(lj) y(p(li)) y(q(li)) x(q(li)) x(p(li)) A =  − −  Again, note that this equation is linear in θ˙(l ) and θ˙(l ). y(p(lj)) y(q(lj)) x(q(lj)) x(p(lj)) i j − − The same-layer relation li ./ lj imposes the constraint y(p(li))(x(q(li)) x(p(li))) − that the motion of li and lj must not lead to an instantaneous  +x(p(li))(y(p(li)) y(q(li)))  b = − penetration of one by the other. An instantaneous penetra- y(p(lj))(x(q(lj)) x(p(lj)))  −  tion can occur only when the endpoint of one line segment  +x(p(lj))(y(p(lj)) y(q(lj)))  − touches the other line segment. Without loss of generality, 1 let us assume that p(li) touches lj. Let p denote a vector of then I(li, lj) = A− b. the same magnitude as p rotated counterclockwise 90◦. Next, let us compute I˙(li, lj), the velocity of the intersec- tion of the two lines segments li and lj as they move. Let α (x, y) = ( y, x) − Let σ be a vector that is normal to lj, in the direction to- In other words, only interpretations that meet the following wards li. constraint are considered:

σ = [q(l ) p(l ) (q(l ) p(l ))]q(l ) p(l ) [(li 1 li) (li 2 li)] j j i i j j ( li, lj L)  ↔ ↔ ↔ ∧  − − · − − ∀ ∈ [(li θ li) (li 1 li)] An instantaneous penetration can occur only when the ve- ↔ → ↔ Let us define several preference relations between com- locity of p(li) in the direction of σ is less than the velocity of the point of contact in the same direction. The velocity ponents of interpretations. First, let us define a prefer- ence relation between two grounded properties. Let us say of p(li) is p˙(li). And the velocity of the point of contact ˙ that g is preferred to g0, denoted g g0, if g = g0 and is lj(ρ(p(li), lj)). Thus if li ./ lj and p(li) touches lj, then ≺ 6 ( l L)g(l) g0(l). In other words, g is preferred to g0 ∀ ∈ → ˙ if they are different and every grounded line segment in the p˙(li) σ lj(ρ(p(li), lj)) σ (1) · ≤ · former is grounded in the latter. Along these lines, let us de- ˙ fine a similar preference relation between two 1 relations. Again, note that this inequality is linear in c˙(li), θ(li), c˙(lj), ↔ ˙ Let us say that 1 is preferred to 10 , denoted 1 10 , and θ(lj). ↔ ↔ ↔ ≺↔ if 1= 0 and ( li, lj L)li 1 lj li 0 lj. We wish to determine the stability of a scene under an ↔ 6 ↔1 ∀ ∈ ↔ → ↔1 In other words, 1 is preferred to 10 if they are differ- admissible interpretation. A scene is unstable if there is an ent and every pair↔ of line segments↔ that is attached in the assignment of linear and angular velocities to the line seg- former is attached in the latter. Similarly, let us define a ments in the scene that satisfies the above constraints and preference relation between two θ relations. Let us say decreases the potential energy of the scene. The potential en- ↔ that θ is preferred to 0 , denoted θ 0 , if θ= 0 ergy of a scene is the sum of the potential energies of the line ↔ ↔θ ↔ ≺↔θ ↔ 6 ↔θ and ( li, lj L)li θ lj li 0 lj. In other words, segments in that scene. The potential energy of a line seg- ∀ ∈ ↔ → ↔θ θ is preferred to θ0 if they are different and every pair ment l is proportional to its mass times y(c(l)). We can take of↔ line segments that↔ is rigidly attached in the former is the mass of a line segment to be proportional to its length. rigidly attached in the latter. Finally, let us define a pref- So the potential energy E can be taken as l L l y(c(l)). erence relation between two same-layer relations. Let us ˙ P ∈ || || The potential energy can decrease if E < 0. By scale in- say that ./ is preferred to ./0, denoted ./ ./0, if ./=./0 and ≺ 6 variance, if E˙ can be less than zero, then it can be equal ( li, lj L)li ./ lj li ./0 lj. In other words, ./ is pre- ∀ ∈ → to any value less than zero, in particular 1. Thus a scene ferred to ./0 if they differ and every pair of line segments is unstable under an admissible interpretation− iff the con- that is on the same layer in the former is on the same layer straint E˙ = 1 is consistent with the above constraints. in the latter. Note that E˙ is− linear in all of the c˙(l) values. Thus the sta- Given a set G of grounded properties, its minimal ele- bility of a scene under an admissible interpretation can be ments gˆ are those grounded properties g G such that there ∈ determined by a reduction to linear programming. is no g0 G where g0 g. Similarly for 1, θ, and ./. LEONARD∈ uses the≺ following prioritized↔ circumscription↔ Model Reconstruction process. It first finds all minimal grounded properties gˆ such that there exist relations 1, θ, and ./ where the inter- ↔ ↔ Let us define a model of a scene as an admissible interpreta- pretation g,ˆ 1, 2, θ, ./ is admissible and the scene tion under which the scene is stable. LEONARD enumerates is stable underh ↔ that↔ interpretation.↔ i Note that there must be the models of each frame in each movie it processes. This at least one such minimal grounded property gˆ. For each is called model reconstruction. There are usually multiple such minimal grounded property gˆ, it then finds all mini- models of any given scene. For example, if the scene con- mal 1 relations such that there exist relations θ and ./ tains a collection of overlapping polygons, then the polygons ↔ ↔ whered the interpretation g,ˆ 1, 1, θ, ./ is admissible can be rigidly attached and any of them can be grounded. h ↔ ↔ ↔ i and the scene is stable underd thatd interpretation. Note that In a certain sense, some models make weaker assumptions there must be at least one such minimal 1 relation for each than others. For example, it is always possible to explain ↔ minimal grounded property. For eachd such minimal 1 the stability of an object by grounding it. Thus a model ↔ relation, taken with the corresponding minimal groundedd that has fewer grounded objects makes weaker assumptions property gˆ, it then finds all minimal θ relations such that than a model with more grounded objects. Similarly, when- ↔ there exists a same-layer relation ./ whered the interpretation ever it is possible to explain the stability of an object that g,ˆ 1, 1, θ, ./ is admissible and the scene is stable un- is supported by being above and on the same layer as an- h ↔ ↔ ↔ i derd thatd interpretation.d Note that there must be at least one other stable object, it is also possible to explain its stability such minimal θ relation for each minimal minimal 1 re- by instead being attached to that object. Thus a model that ↔ ↔ lation. Finally,d for each such minimal θ relation,d taken has fewer attachment relations makes weaker assumptions ↔ with the corresponding minimal 1 relationd and minimal than a model with more attachment relations. Accordingly, ↔ grounded property gˆ, it then finds alld minimal same-layer re- during model reconstruction, LEONARD selects models that lations ./ such that the interpretation g,ˆ 1, 1, θ, ./ is make the weakest assumptions. It does so by a process of h ↔ ↔ ↔ i admissibleb and the scene is stable underd thatd interpretation.d b prioritized circumscription (McCarthy 1980). Note that there must be at least one such minimal same-layer To limit the search space, LEONARD considers only rigid relation ./ for each such minimal θ relation. For each such ↔ and revolute joints. It does not consider prismatic joints. minimalb same-layer relation ./, takend with the correspond- b ing minimal θ and 1 relations and minimal grounded and moving objects in each frame. The tracking proce- ↔ ↔ property gˆ, prioritizedd d circumscription returns the interpre- dure computed the correspondence between the polygons in tation g,ˆ 1, 1, θ, ./ as a minimal model of the scene. each frame and those in the adjacent frames. The model re- h ↔ ↔ ↔ i The aboved d prioritizedd b circumscription procedure orders construction procedure was used to construct force-dynamic models by an inclusion relation and selects minimal mod- models of each frame. Prioritized, cardinality, and tempo- els according to that inclusion relation. This has the follow- ral circumscription were used to prune the space of models. ing disadvantage. If a scene has one block resting on top Each frame is shown with the results of segmentation and of another block, there will be two minimal models: one model reconstruction superimposed on the original video where the bottom block is grounded and the top block is on image. the same layer as the bottom block and one where the top For the leftmost column, notice that LEONARD deter- block is grounded and the bottom block is attached to the mines that the lower block and the hand are grounded for the top block. The latter has more attachment assertions than entire movie, that the upper block is supported by being on the former but is still minimal because it is generated for the same layer as the lower block for frames 2, 4, and 8, and a different minimal grounded property gˆ. Nonetheless, it that the upper block is supported by being rigidly attached to is desirable to prefer the former model to the latter model. the hand for frames 20, 22, and 24. For the second column, This is done by a second pass circumscription that uses a notice that LEONARD determines that the lower block and cardinality-based preference metric rather than one based on the hand are grounded for the entire movie, that the upper inclusion. Let us define the cardinality of a grounded prop- block is supported by being rigidly attached to the hand for erty g, denoted g , as the number of line segments l L frames 5 and 10, and that the upper block is supported by for which g(l) is|| true.|| And let us define the cardinality∈ of being on the same layer as the lower block for frames 21, a 1 relation, denoted 1 as the number of pairs of 24, 26, and 29. For the third column, notice that LEONARD ↔ || ↔ || line segments li, lj L for which li 1 lj is true. Simi- determines that the lower block and the hand are grounded ∈ ↔ larly for the θ and ./ relations. Furthermore, let us define for the entire movie and that the upper block is supported the cardinality↔ of an interpretation I, denoted I , as the for the entire movie by being on the same layer as the lower || || quadruple g , 1 , θ , ./ . Now, let us de- block. For the rightmost column, notice that LEONARD de- fine a preferenceh|| || || relation ↔ || between|| ↔ || two|| interpretations.||i Let us termines that the lower block and the hand are grounded for say that I is preferred to I0, denoted I I0, if I is lexi- the entire movie and that the upper block is supported for ≺ || || cographically less than I0 . Given a set of interpretation, the entire movie by being rigidly attached to the hand. || || I its minimal elements Iˆ are those interpretations I such LEONARD was given the following when process- ∈ I ing these four movies: that there is no I0 where I0 I. Cardinality circum- ∈ I ≺ scription returns the minimal elements of the set of minimal SUPPORTED(x) models produced by prioritized circumscription.  ¬SUPPORTED(y) ∧  Prioritized and cardinality circumscription are not suffi- PICKUP(x, y) =4 ∧  ATTACHED(x, y);  cient to prune all of the spurious models. The following   ¬ATTACHED(x, y)   situation often arises. During a pick up event, the agent is SUPPORTED(x) grounded before it grasps the patient but once the patient is ¬ ∧ grasped and lifted there is an ambiguity as to whether the  SUPPORTED(y)  PUTDOWN(x, y) =4 ∧ agent is grounded and the patient is supported by being at- ATTACHED(x, y);     tached to the agent or whether the patient is grounded and  ATTACHED(x, y)  ¬ the agent is supported by being attached to the patient. In Essentially, these define pick up and put down as events some sense, the former is preferable to the later since it does where the agent is not supported throughout the event, the not require the ground assertion to move from the agent to patient is supported throughout the event, and the agent the patient. More generally, sequences of models are pre- grasps or releases the patient respectively. LEONARD cor- ferred when they entail fewer changes in assertions. This is rectly recognises the movie in the leftmost column as depict- a form of temporal circumscription (Mann & Jepson 1998). ing a pick up event and the movie in the second column as Taken together, prioritized, cardinality, and temporal cir- depicting a put down event. More importantly, LEONARD cumscription typically yield a small number of models for correctly recognises that the remaining two movies do not each movie that usually correspond to natural pretheoretic depict any of the defined event types. Note that systems that human intuition. classify events based on motion profiles will often mistak- ingly classify these last two movies as either pick up or put Examples down events because they have similar motion profiles. The techniques described in this paper have been imple- mented in a system called LEONARD. Figure 2 shows the re- Conclusion sults of processing four short movies with LEONARD. Each I have presented a comprehensive system that recovers column shows a subset of the frames from a single movie. force-dynamic interpretations from video and uses those in- From left to right, the movies have 29, 34, 16, and 16 frames terpretations to recognise event occurrences. Force dynam- respectively. Each movie was processed by the segmentation ics is fundamentally more robust than motion profile for procedure to place a convex polygon around the coloured classifying events. Two occurrences of the same event type frame 2 frame 5 frame 2 frame 0

frame 4 frame 10 frame 4 frame 1

frame 8 frame 21 frame 8 frame 5

frame 20 frame 24 frame 10 frame 6

frame 22 frame 26 frame 12 frame 10

frame 24 frame 29 frame 13 frame 11

Figure 2: The results of processing four short movies with LEONARD. The segmented polygons and force-dynamic models for each frame are overlayed on the video image for that frame. LEONARD successfully recognises the movie in the leftmost column as a pick up event and the movie in the second column as a put down event. might have very different motion profiles. And occurrences gorization. Ph.D. Dissertation, University of California at of two different event types might have very similar motion Berkeley. profiles. This has been demonstrated by an implementation Schank, R. C. 1973. The fourteen primitive actions and that distinguishes between occurrences and nonoccurrences their inferences. Memo AIM-183, Stanford Artificial In- of pick up and put down events despite similar motion pro- telligence Laboratory. files. Siskind, J. M., and Morris, Q. 1996. A maximum- likelihood approach to visual event classification. In Pro- Acknowledgments ceedings of the Fourth European Conference on Computer Amit Roy Chowdhury implemented an early version of the Vision, 347–360. Cambridge, UK: Springer-Verlag. stability-checking algorithm described in this paper. Siskind, J. M. 1992. Naive Physics, Event Perception, Lex- ical Semantics, and Language Acquisition. Ph.D. Disser- References tation, Massachusetts Institute of Technology, Cambridge, Bailey, D. R.; Chang, N.; Feldman, J.; and Narayanan, S. MA. 1998. Extending embodied lexical development. In Pro- Siskind, J. M. 1995. Grounding language in perception. ceedings of the 20th Annual Conference of the Cognitive Artificial Intelligence Review 8:371–391. Science Society. Starner, T. E. 1995. Visual recognition of american sign Bobick, A. F., and Ivanov, Y. A. 1998. Action recognition language using hidden markov models. Master’s thesis, using probabilistic parsing. In Proceedings of the IEEE Massachusetts Institute of Technology, Cambridge, MA. Computer Society Conference on Computer Vision and Pat- Talmy, L. 1988. Force dynamics in language and cognition. tern Recognition, 196–202. Cognitive Science 12:49–100. Herskovits, A. 1986. Language and Spatial Cognition: An Yamoto, J.; Ohya, J.; and Ishii, K. 1992. Recognizing hu- Interdisciplinary Study of the Prepositions in English. New man action in time-sequential images using hidden markov York, NY: Cambridge University Press. model. In Proceedings of the 1992 IEEE Conference on Jackendoff, R., and Landau, B. 1991. Spatial lan- Computer Vision and Pattern Recognition, 379–385. IEEE guage and spatial cognition. In Napoli, D. J., and Kegl, Press. J. A., eds., Bridges Between Psychology and : A Swarthmore Festschrift for Lila Gleitman. Hillsdale, NJ: Lawrence Erlbaum Associates. Jackendoff, R. 1983. Semantics and Cognition. Cam- bridge, MA: The MIT Press. Leech, G. N. 1969. Towards a Semantic Description of English. Indiana University Press. Mann, R., and Jepson, A. 1998. Toward the computational perception of action. In Proceedings of the IEEE Com- puter Society Conference on Computer Vision and Pattern Recognition, 794–799. Mann, R.; Jepson, A.; and Siskind, J. M. 1997. The com- putational perception of scene dynamics. Computer Vision and Image Understanding 65(2). McCarthy, J. 1980. Circumscription—a form of non- monotonic reasoning. Artificial Intelligence 13(1–2):27– 39. Miller, G. A. 1972. English verbs of motion: A case study in semantics and lexical memory. In Melton, A. W., and Martin, E., eds., Coding Processes in Human Memory. Washington, DC: V. H. Winston and Sons, Inc. chapter 14, 335–372. Pinhanez, C., and Bobick, A. 1995. Scripts in machine un- derstanding of image sequences. In AAAI Fall Symposium Series on Computational Models for Integrating Language and Vision. Pinker, S. 1989. Learnability and Cognition. Cambridge, MA: The MIT Press. Regier, T. P. 1992. The Acquisition of Lexical Semantics for Spatial Terms: A Connectionist Model of Perceptual Cate-